[workweek] tc-worker workweek recap

Sprint recap

We spent this week sprinting on the tc-worker, engines and plugins. We merged 19 pull requests and had many productive discussions!

tc-worker core

We implemented the task loop! This basic loop should start when the worker is invoked. It spins up a task claimer and manager responsible for claiming as many tasks up to it’s available capacity and running them to completion. You can find details in in this commit. We’re still working on some high level documentation.

We did some cleanups to make it easier to download and get started with builds. We fixed up packages related to generating go types from json schemas, and the types now conform to the linting rules

We also implemented the webhookserver. The package provides implementations of the WebHookServer interface which allows attachment and detachment of web-hooks to an internet exposed server. This will support both the livelog and interactive features. Work is detailed in PR 37.

engine: hello, world

Greg created a proof of concept and pushed a successful task to emit a hello, world artifact. Greg will be writing up something to describe this process next week.

plugin: environment variables

Wander landed this plugin this week to support environment variable setting. The work is described in PR 39.

plugin: artifact uploads

This plugin will support artifact uploads for all engines to S3 and is based on generic-worker code. This work is started in PR 55.

TaskCluster design principles

We discussed as a team the ideas behind the design of TaskCluster. The umbrella principle we try to stick to is: Getting Things Built. We felt it was important to say that first because it helps us remember that we’re here to provide features to users, not just design systems. The four key design principles were distilled to:

  • Self-service
  • Robustness
  • Enable rapid change
  • Community friendliness

One surprising connection (to me) we made was that our privacy and security features are driven by community friendliness.

We plan to add our ideas about this to a TaskCluster “about” page.

TaskCluster code review

We discussed our process for code review, and how we’d like to do them in the future. We covered issues around when to do architecture reviews and how to get “pre-reviews” for ideas done with colleagues who will be doing our reviews. We made an outline of ideas and will be giving them a permanent home on our docs site.

Q2 Planning

We made a first pass at our 2016q2 goals. The main theme is to add OS X engine support to taskcluster-worker, continue work on refactoring intree config and build out our monitoring system beyond InfluxDB. Further refinements to our plan will come in a couple weeks, as we close out Q1 and get a better understanding of work related to the Buildbot to TaskCluster migration.

Tier-1 status for Linux 64 Debug build jobs on March 14, 2016

I sent this to dev-planning, dev-platform, sheriffs and tools-taskcluster today. I added a little more context for a non-Mozilla audience.

The time has come! We are planning to switch to Tier-1 on Treeherder for TaskCluster Linux 64 Debug build jobs on March 14. At the same time, we will hide the Buildbot build jobs, but continue running them. This means that these jobs will become what Sheriffs use to determine the health of patches and our trees.

On March 21, we plan to switch the Linux 64 Debug tests to Tier-1 and hide the related Buildbot test jobs.

After about 30 days, we plan to disable and remove all Buildbot jobs related to Linux Debug.

Background:

We’ve been running Linux 64 Debug builds and tests using TaskCluster side-by-side with Buildbot jobs since February 18th. Some of the project work that was done to green up the tests is documented here.

The new tests are running in Docker-ized environments, and the Docker images we use are defined in-tree and publicly accessible.

This work was the culmination of many months of effort, with Joel Maher, Dustin Mitchell and Armen Zambrano primarily focused on test migration this quarter. Thank you to everyone who responded to NEEDINFOs, emails and pings on IRC to help with untangling busted test runs.

On performance, we’re taking a 14% hit across all the new test jobs vs. the old jobs in Buildbot. We ran two large-scale tests to help determine where slowness might still be lurking, and were able to find and fix many issues. There are a handful of jobs remaining that seem significantly slower, while others are significantly faster. We decided that it was more important to deprecate the old jobs and start exclusively maintaining the new jobs now, rather than wait to resolve the remaining performance issues. Over time we hope to address issues with the owners of the affected test suites.

[portland] taskcluster-worker Hello, World

The TaskCluster Platform team is in Portland this week, hacking on the taskcluster-worker.

Today, we all sync’d up on the current state of our worker, and what we’re going to hack on this week. We started with the current docs.

The reason why we’re investing so much time in the worker is two fold:

  • The worker code previously lived in two code bases – docker-worker and generic-worker. We need to unify these code bases so that multiple engineers can work on it, and to help us maintain feature parity.
  • We need to get a worker that supports Windows into production. For now, we’re using the generic-worker, but we’d like to switch over to taskcluster-worker in late Q2 or early Q3. This timeline lines up with when we expect the Windows migration from Buildbot to happen.

One of the things I asked this team to do was come up with some demos of the new worker. The first demo today was to simply output a log and upload it from Greg Arndt.

The rest of the team is getting their Go environments set up to run tests and get hacking on crucial plugins, like our environment variable handling and additional artifact uploading logic we need for our production workers.

We’re also taking the opportunity to sync up with our Windows environment guru. Our goal for Buildbot to TaskCluster migration this quarter is focused on Linux builds and tests. Next quarter, we’ll be finishing Linux and, I hope, landing Windows builds in TaskCluster. To do that, we have a lot of details to sort out with how we’ll build Windows AMIs and deploy them. It’s a very different model because we don’t have the same options with Docker as we have on Linux.

TaskCluster Platform: 2015Q3 Retrospective

Welcome to TaskCluster Platform’s 2015Q3 Retrospective! I’ve been managing this team this quarter and thought it would be nice to look back on what we’ve done. This report covers what we did for our quarterly goals. I’ve linked to “Publications” at the bottom of this page, and we have a TaskCluster Mozilla Wiki page that’s worth checking out.

High level accomplishments

  • Dramatically improved stability of TaskCluster Platform for Sheriffs by fixing TreeHerder ingestion logic and regexes, adding better logging and fixing bugs in our taskcluster-vcs and mozilla-taskcluster components
  • Created and Deployed CI builds on three major platforms:
    • Added Linux64 (CentOS), Mac OS X cross-compiled builds as Tier2 CI builds
    • Completed and documented a prototype Windows 2012 builds in AWS and task configuration
  • Deployed auth.taskcluster.net, enabling better security, better support for self-service authorization and easier contributions from outside our team
  • Added region biasing based on cost and availability of spot instances to our AWS provisioner
  • Managed the workload of two interns, and significantly mentored a third
  • Onboarded Selena as a new manager
  • Held a workweek to focus attention on bringing our environment into production support of Release Engineering

Goals, Bugs and Collaborators

We laid out our Q3 goals in this etherpad. Our chosen themes this quarter were:

  • Improve operational excellence — focus on sheriff concerns, data collection,
  • Facilitate self-serve consumption — refactoring auth and supporting roles for scopes, and
  • Exploit opportunities to differentiate from other platforms — support for interactive sessions, docker images as artifacts, github integration and more blogging/docs.

We had 139 Resolved FIXED bugs in TaskCluster product.

Link to graph of resolved bugs

We also resolved 7 bugs in FirefoxOS, TreeHerder and RelEng products/components.

We received significant contributions from other teams: Morgan (mrrrgn) designed, created and deployed taskcluster-github; Ted deployed Mac OS X cross compiled builds; Dustin reworked the Linux TC builds to use CentOS, and resolved 11 bugs related to TaskCluster and Linux builds.

An additional 9 people contributed code to core TaskCluster, intree build scripts and and task definitions: aus, rwood, rail, mshal, gerard-majax, mihneadb@gmail.com, htsai, cmanchester, and echen.

The Big Picture: TaskCluster integration into Platform Operations

Moving from B2G to Platform was a big shift. The team had already made a goal of enabling Firefox Release builds, but it wasn’t entirely clear how to accomplish that. We spent a lot of this quarter learning things from RelEng and prioritizing. The whole team spent the majority of our time supporting others use of TaskCluster through training and support, developing task configurations and resolving infrastructure problems. At the same time, we shipped docker-worker features, provisioner biasing and a new authorization system. One tricky infra issue that John and Jonas worked on early in the quarter was a strange AWS Provisioner failure that came down to an obscure missing dependency. We had a few git-related tree closures that Greg worked closely on and ultimately committed fixes to taskcluster-vcs to help resolve. Everyone spent a lot of time responding to bugs filed by the sheriffs and requests for help on IRC.

It’s hard to overstate how important the Sheriff relationship and TreeHerder work was. A couple teams had the impression that TaskCluster itself was unstable. Fixing this was a joint effort across TreeHerder, Sheriffs and TaskCluster teams.

When we finished, useful errors were finally being reported by tasks and starring became much more specific and actionable. We may have received a partial compliment on this from philor. The extent of artifact upload retries, for example, was made much clearer and we’ve prioritized fixing this in early Q4.

Both Greg and Jonas spent many weeks meeting with Ed and Cam, designing systems, fixing issues in TaskCluster components and contributing code back to TreeHerder. These meetings also led to Jonas and Cam collaborating more on API and data design, and this work is ongoing.

We had our own “intern” who was hired on as a contractor for the summer, Edgar Chen. He did some work with the docker-worker, implementing Interactive Sessions, and did analysis on our provisioner/worker efficiency. We made him give a short, sweet presentation on the interactive sessions. Edgar is now at CMU for his sophomore year and has referred at least one friend back to Mozilla to apply for an internship next summer.

Pete completed a Windows 2012 prototype build of Firefox that’s available from Try, with documentation and a completely automated process for creating AMIs. He hasn’t created a narrated video with dueling, British-English accented robot voices for this build yet.

We also invested a great deal of time in the RelEng interns. Jonas and Greg worked with Anhad on getting him productive with TaskCluster. When Anthony arrived, we also onboarded him. Jonas worked closely to get him working on a new project, hooks.taskcluster.net. To take these two bits of work from RelEng on, I pushed TaskCluster’s roadmap for generic-worker features back a quarter and Jonas pushed his stretch goal of getting the big graph scheduler into production to Q4.

We worked a great deal with other teams this quarter on taskcluster-github, supporting new Firefox and B2G builds, RRAs for the workers and generally telling Mozilla about TaskCluster.

Finally, we spent a significant amount of time interviewing, and then creating a more formal interview process that includes a coding challenge and structured-interview type questions. This is still in flux, but the first two portions are being used and refined currently. Jonas, Greg and Pete spent many hours interviewing candidates.

Berlin Work Week

TaskCluster Platform Team in Berlin

Toward the end of the quarter, we held a workweek in Berlin to focus our next round of work on critical RelEng and Release-specific features as well as production monitoring planning. Dustin surprised us with delightful laser cut acrylic versions of the TaskCluster logo for the team! All team members reported that they benefited from being in one room to discuss key designs, get immediate code review, and demonstrate work in progress.

We came out of this with 20+ detailed documents from our conversations, greater alignment on the priorities for Platform Operations and a plan for trainings and tutorials to give at Orlando. Dustin followed this up with a series of ‘TC Topics’ Vidyo sessions targeted mostly at RelEng.

Our Q4 roadmap is focused on key RelEng features to support Release.

Publications

Our team published a few blog posts and videos this quarter:

Release Engineering: A draft of an architecture diagram

One of the things that I like to do is create architecture diagrams of complicated systems.

We had Release Engineering and Release Operations in the Portland Mozilla office this week, providing a perfect opportunity to pick everyone’s brains about what the current state of our release infrastructure is like.

Behold: releng flow onepage

And here’s a version that includes some “tree closure reasons” in magenta:

Releng infra with tree closure reason codes

A tree closure is defined as an hg hook that prevents people from committing to a tree (like mozilla-central). It looks up status at treestatus.mozilla.org to figure out whether or not the tree is closed, and this value is updated manually by “sheriffs” who track tree status.

And the an initial key to the tree closure reasons (the numbers on the magenta blobs), is documented on the Mozilla wiki.

The goal of this document was to take brain dump information from everyone in the meeting, and create a relationship diagram of all the systems that everyone here supports. As you can see, it is pretty complex.

What I took away from creating this was:

  • The cognitive load is very high for trying to diagnose the root cause for several kinds of tree closures.
  • People loved being able to look at how each systems related to the others.
  • No single person really had a model in their head of how everything represented in this diagram was related.

There’s a lot more work to do to link in documentation and create some related diagrams, which I’ll tackle next week. The kinds of questions I’d like to try to answer based on the information that I’ve gathered include:

  • How does my patch get a build created for it?
  • What single points of failure can we mitigate?
  • What kinds of resilience do we need for our typical transient failures?

I really enjoyed identifying sources of tree closure and the kinds of failures that cause it. These are the kinds of problems I love working on solving — complicated, often unpredictable and largely driven by the normal work that people need to do to get their jobs done. There’s rarely a simple solution to things like experimental patches taking down large portions of a build infrastructure, and how we solve, or at least mitigate, these problems is fascinating.

Weekly Feminist Work in Tech by Mozillians roundup – Week of March 3, 2014

We have a ton of individual work done by MoFo and MoCo employees related to feminism, feminist activism and the larger technology community. So much is happening, I can barely keep track!

I’ve reached out to a few people I work with to get some highlights and spread the word about interesting projects we’re all working on. If you are a Mozillian and occasionally or regularly work on feminist issues in the tech community, please let me know! My plan is to ping people every Friday morning and post a blog post about what’s happened in the last week.

Without further ado:

Dispatch from me, Selena Deckelmann:

  • I’m presenting at SF Github HQ on Thurs March 13, 7pm as part of the Passion Projects series (Julie Horvath’s project). I’ll be talking about teaching beginners how to code and contribute to open source, specifically through my work with PyLadies. I’m giving a similar talk this afternoon at Portland State University to their chapter of the ACM.
  • Just wrapped up a Git workshop for PyLadiesPDX and am gearing up for a test-run of a “make a Flask blog in 80-lines of code” workshop! Course materials are available here for “intro to git” workshops.
  • Lukas, Liz, me and others (I’m not sure who all else!!) are coordinating a Geekfeminism and feminist hackerspace meetup at PyCon 2014. The details aren’t published yet, so stay tuned!
  • PyLadies PyCon 2014 lunch is happening again!
  • PyLadies will also be holding a Mani-Pedi party just like in 2013. Stay tuned for details!
  • Brownbags for the most recent GNOME Outreach Program for Women contributors are scheduled for next Friday March 14, 10am and 2pm. (thanks Larissa!!) Tune in at http://air.mozilla.com. One of the GNOME Outreach Program for Women contributors is Jennie Rose Halperin, and another is Sabina Brown.

Dispatch from Liz Henry:

  • I’m doing a lot of work to support Double Union feminist hackerspace, a nonprofit in San Francisco. We are hosting tech and arts workshops, and establishing connections with other hackerspaces in the US and around the world. Lukas is also involved with this effort! We have over 100 members now using the space.
  • For PyCon I would like to host fairly informal sessions in our Feminist Hacker Lounge, on QA, bug triaging, and running/writing WebQA automated tests with pytest and selenium.
  • I’m hoping to have funding for an OPW intern for this upcoming round to work on the back end of a QA community facilitating tool, using Python and various APIs for Mozilla tools like Bugzilla, Moztrap, and the Mozillians profiles.

Dispatch from Lukas Blakk:

  • Just held the Lesbians Who Tech hackathon at the Mozilla SF space and it was an amazing weekend of networking, recruiting for Mozilla, doing a stump speech on the radical/political possibilities of open source, and also just a lot of social fun.
  • I’m nearing the point of Project Kick Off for The Ascend Project which will be a 6 week training course for underrepresented in current tech mainstream (and underemployed/unpaid) persons who will learn how to write automatable tests for MozMill. This first one will take place at the Portland office in Sept/Oct 2014 (Starts on Sept 8th). There’s so much more here, but this is just a sound bite.
  • I’m trying to determine what budget I can get agreement on to put towards women in tech outreach this year.
  • PyCon – yes! Such Feminist, So Hackerspace, Much gathering of geek feminists!

Dispatch from Larissa Shapiro:

  • OPW wrapup and next session – we’re wrapping up the current round, scheduling brownbags for two of the current interns, etc. Funding is nearly secured for the next round and we have like 6 willing mentors. w00t.
  • I’m also providing space for/speaking at an upcoming event in the Mountain View office: last year’s African Techwomen emerging leaders were part of a documentary and the Diaspora African Women’s Network is holding a screening and a planning session for how to support next year’s ELs and other African and African-American bay area women in tech both through this and other projects, March 29. Open to Mozilla folks, let me know if you’re interested.

Anything else that’s come up in the last week, or that you’d like Mozillians to know about? Let me know in the comments!