TaskCluster Platform: 2015Q3 Retrospective

Welcome to TaskCluster Platform’s 2015Q3 Retrospective! I’ve been managing this team this quarter and thought it would be nice to look back on what we’ve done. This report covers what we did for our quarterly goals. I’ve linked to “Publications” at the bottom of this page, and we have a TaskCluster Mozilla Wiki page that’s worth checking out.

High level accomplishments

  • Dramatically improved stability of TaskCluster Platform for Sheriffs by fixing TreeHerder ingestion logic and regexes, adding better logging and fixing bugs in our taskcluster-vcs and mozilla-taskcluster components
  • Created and Deployed CI builds on three major platforms:
    • Added Linux64 (CentOS), Mac OS X cross-compiled builds as Tier2 CI builds
    • Completed and documented a prototype Windows 2012 builds in AWS and task configuration
  • Deployed auth.taskcluster.net, enabling better security, better support for self-service authorization and easier contributions from outside our team
  • Added region biasing based on cost and availability of spot instances to our AWS provisioner
  • Managed the workload of two interns, and significantly mentored a third
  • Onboarded Selena as a new manager
  • Held a workweek to focus attention on bringing our environment into production support of Release Engineering

Goals, Bugs and Collaborators

We laid out our Q3 goals in this etherpad. Our chosen themes this quarter were:

  • Improve operational excellence — focus on sheriff concerns, data collection,
  • Facilitate self-serve consumption — refactoring auth and supporting roles for scopes, and
  • Exploit opportunities to differentiate from other platforms — support for interactive sessions, docker images as artifacts, github integration and more blogging/docs.

We had 139 Resolved FIXED bugs in TaskCluster product.

Link to graph of resolved bugs

We also resolved 7 bugs in FirefoxOS, TreeHerder and RelEng products/components.

We received significant contributions from other teams: Morgan (mrrrgn) designed, created and deployed taskcluster-github; Ted deployed Mac OS X cross compiled builds; Dustin reworked the Linux TC builds to use CentOS, and resolved 11 bugs related to TaskCluster and Linux builds.

An additional 9 people contributed code to core TaskCluster, intree build scripts and and task definitions: aus, rwood, rail, mshal, gerard-majax, mihneadb@gmail.com, htsai, cmanchester, and echen.

The Big Picture: TaskCluster integration into Platform Operations

Moving from B2G to Platform was a big shift. The team had already made a goal of enabling Firefox Release builds, but it wasn’t entirely clear how to accomplish that. We spent a lot of this quarter learning things from RelEng and prioritizing. The whole team spent the majority of our time supporting others use of TaskCluster through training and support, developing task configurations and resolving infrastructure problems. At the same time, we shipped docker-worker features, provisioner biasing and a new authorization system. One tricky infra issue that John and Jonas worked on early in the quarter was a strange AWS Provisioner failure that came down to an obscure missing dependency. We had a few git-related tree closures that Greg worked closely on and ultimately committed fixes to taskcluster-vcs to help resolve. Everyone spent a lot of time responding to bugs filed by the sheriffs and requests for help on IRC.

It’s hard to overstate how important the Sheriff relationship and TreeHerder work was. A couple teams had the impression that TaskCluster itself was unstable. Fixing this was a joint effort across TreeHerder, Sheriffs and TaskCluster teams.

When we finished, useful errors were finally being reported by tasks and starring became much more specific and actionable. We may have received a partial compliment on this from philor. The extent of artifact upload retries, for example, was made much clearer and we’ve prioritized fixing this in early Q4.

Both Greg and Jonas spent many weeks meeting with Ed and Cam, designing systems, fixing issues in TaskCluster components and contributing code back to TreeHerder. These meetings also led to Jonas and Cam collaborating more on API and data design, and this work is ongoing.

We had our own “intern” who was hired on as a contractor for the summer, Edgar Chen. He did some work with the docker-worker, implementing Interactive Sessions, and did analysis on our provisioner/worker efficiency. We made him give a short, sweet presentation on the interactive sessions. Edgar is now at CMU for his sophomore year and has referred at least one friend back to Mozilla to apply for an internship next summer.

Pete completed a Windows 2012 prototype build of Firefox that’s available from Try, with documentation and a completely automated process for creating AMIs. He hasn’t created a narrated video with dueling, British-English accented robot voices for this build yet.

We also invested a great deal of time in the RelEng interns. Jonas and Greg worked with Anhad on getting him productive with TaskCluster. When Anthony arrived, we also onboarded him. Jonas worked closely to get him working on a new project, hooks.taskcluster.net. To take these two bits of work from RelEng on, I pushed TaskCluster’s roadmap for generic-worker features back a quarter and Jonas pushed his stretch goal of getting the big graph scheduler into production to Q4.

We worked a great deal with other teams this quarter on taskcluster-github, supporting new Firefox and B2G builds, RRAs for the workers and generally telling Mozilla about TaskCluster.

Finally, we spent a significant amount of time interviewing, and then creating a more formal interview process that includes a coding challenge and structured-interview type questions. This is still in flux, but the first two portions are being used and refined currently. Jonas, Greg and Pete spent many hours interviewing candidates.

Berlin Work Week

TaskCluster Platform Team in Berlin

Toward the end of the quarter, we held a workweek in Berlin to focus our next round of work on critical RelEng and Release-specific features as well as production monitoring planning. Dustin surprised us with delightful laser cut acrylic versions of the TaskCluster logo for the team! All team members reported that they benefited from being in one room to discuss key designs, get immediate code review, and demonstrate work in progress.

We came out of this with 20+ detailed documents from our conversations, greater alignment on the priorities for Platform Operations and a plan for trainings and tutorials to give at Orlando. Dustin followed this up with a series of ‘TC Topics’ Vidyo sessions targeted mostly at RelEng.

Our Q4 roadmap is focused on key RelEng features to support Release.


Our team published a few blog posts and videos this quarter:

Release Engineering: A draft of an architecture diagram

One of the things that I like to do is create architecture diagrams of complicated systems.

We had Release Engineering and Release Operations in the Portland Mozilla office this week, providing a perfect opportunity to pick everyone’s brains about what the current state of our release infrastructure is like.

Behold: releng flow onepage

And here’s a version that includes some “tree closure reasons” in magenta:

Releng infra with tree closure reason codes

A tree closure is defined as an hg hook that prevents people from committing to a tree (like mozilla-central). It looks up status at treestatus.mozilla.org to figure out whether or not the tree is closed, and this value is updated manually by “sheriffs” who track tree status.

And the an initial key to the tree closure reasons (the numbers on the magenta blobs), is documented on the Mozilla wiki.

The goal of this document was to take brain dump information from everyone in the meeting, and create a relationship diagram of all the systems that everyone here supports. As you can see, it is pretty complex.

What I took away from creating this was:

  • The cognitive load is very high for trying to diagnose the root cause for several kinds of tree closures.
  • People loved being able to look at how each systems related to the others.
  • No single person really had a model in their head of how everything represented in this diagram was related.

There’s a lot more work to do to link in documentation and create some related diagrams, which I’ll tackle next week. The kinds of questions I’d like to try to answer based on the information that I’ve gathered include:

  • How does my patch get a build created for it?
  • What single points of failure can we mitigate?
  • What kinds of resilience do we need for our typical transient failures?

I really enjoyed identifying sources of tree closure and the kinds of failures that cause it. These are the kinds of problems I love working on solving — complicated, often unpredictable and largely driven by the normal work that people need to do to get their jobs done. There’s rarely a simple solution to things like experimental patches taking down large portions of a build infrastructure, and how we solve, or at least mitigate, these problems is fascinating.

Weekly Feminist Work in Tech by Mozillians roundup – Week of March 3, 2014

We have a ton of individual work done by MoFo and MoCo employees related to feminism, feminist activism and the larger technology community. So much is happening, I can barely keep track!

I’ve reached out to a few people I work with to get some highlights and spread the word about interesting projects we’re all working on. If you are a Mozillian and occasionally or regularly work on feminist issues in the tech community, please let me know! My plan is to ping people every Friday morning and post a blog post about what’s happened in the last week.

Without further ado:

Dispatch from me, Selena Deckelmann:

  • I’m presenting at SF Github HQ on Thurs March 13, 7pm as part of the Passion Projects series (Julie Horvath’s project). I’ll be talking about teaching beginners how to code and contribute to open source, specifically through my work with PyLadies. I’m giving a similar talk this afternoon at Portland State University to their chapter of the ACM.
  • Just wrapped up a Git workshop for PyLadiesPDX and am gearing up for a test-run of a “make a Flask blog in 80-lines of code” workshop! Course materials are available here for “intro to git” workshops.
  • Lukas, Liz, me and others (I’m not sure who all else!!) are coordinating a Geekfeminism and feminist hackerspace meetup at PyCon 2014. The details aren’t published yet, so stay tuned!
  • PyLadies PyCon 2014 lunch is happening again!
  • PyLadies will also be holding a Mani-Pedi party just like in 2013. Stay tuned for details!
  • Brownbags for the most recent GNOME Outreach Program for Women contributors are scheduled for next Friday March 14, 10am and 2pm. (thanks Larissa!!) Tune in at http://air.mozilla.com. One of the GNOME Outreach Program for Women contributors is Jennie Rose Halperin, and another is Sabina Brown.

Dispatch from Liz Henry:

  • I’m doing a lot of work to support Double Union feminist hackerspace, a nonprofit in San Francisco. We are hosting tech and arts workshops, and establishing connections with other hackerspaces in the US and around the world. Lukas is also involved with this effort! We have over 100 members now using the space.
  • For PyCon I would like to host fairly informal sessions in our Feminist Hacker Lounge, on QA, bug triaging, and running/writing WebQA automated tests with pytest and selenium.
  • I’m hoping to have funding for an OPW intern for this upcoming round to work on the back end of a QA community facilitating tool, using Python and various APIs for Mozilla tools like Bugzilla, Moztrap, and the Mozillians profiles.

Dispatch from Lukas Blakk:

  • Just held the Lesbians Who Tech hackathon at the Mozilla SF space and it was an amazing weekend of networking, recruiting for Mozilla, doing a stump speech on the radical/political possibilities of open source, and also just a lot of social fun.
  • I’m nearing the point of Project Kick Off for The Ascend Project which will be a 6 week training course for underrepresented in current tech mainstream (and underemployed/unpaid) persons who will learn how to write automatable tests for MozMill. This first one will take place at the Portland office in Sept/Oct 2014 (Starts on Sept 8th). There’s so much more here, but this is just a sound bite.
  • I’m trying to determine what budget I can get agreement on to put towards women in tech outreach this year.
  • PyCon – yes! Such Feminist, So Hackerspace, Much gathering of geek feminists!

Dispatch from Larissa Shapiro:

  • OPW wrapup and next session – we’re wrapping up the current round, scheduling brownbags for two of the current interns, etc. Funding is nearly secured for the next round and we have like 6 willing mentors. w00t.
  • I’m also providing space for/speaking at an upcoming event in the Mountain View office: last year’s African Techwomen emerging leaders were part of a documentary and the Diaspora African Women’s Network is holding a screening and a planning session for how to support next year’s ELs and other African and African-American bay area women in tech both through this and other projects, March 29. Open to Mozilla folks, let me know if you’re interested.

Anything else that’s come up in the last week, or that you’d like Mozillians to know about? Let me know in the comments!