TaskCluster Platform Team: Q1 retrospective

TaskCluster Platform team did a lot of foundational work in Q1, to set the stage for some aggressive goals in Q2 around landing new OS support and migrating as fast as we can out of Buildbot.

The two big categories of work we had were “Moving Forward” — things that move TaskCluster forward in terms of developing our team and adding cool features, and “Paying debt” — upgrading infra, improving security, cleaning up code, improving existing interfaces and spinning out code into separate libraries where we can.

As you’ll see, there’s quite a lot of maintenance that goes into our services at this point. There’s probably some overlap of features in the “paying debt” section. Despite a little bit of fuzzyness in the definitions, I think this is an interesting way to examine our work, and a way for us to prioritize features that eliminate certain classes of unpleasant debt paying work. I’m planning to do a similar retrospective for Q2 in July.

I’m quite proud of the foundational work we did on taskcluster-worker, and it’s already paying off in rapid progress with OS X support on hardware in Q2. We’re making fairly good progress on Windows in AWS as well, but we had to pay down years of technical debt around Windows configuration to get our builds running in TaskCluster. Making a choice on our monitoring systems was also a huge win, paying off in much better dashboarding and attention to metrics across services. We’re also excited to have shipped the “Big Graph Scheduler”, which enables cross-graph dependencies and arbitrarily large task graphs (previous graphs were limited to about 1300 tasks). Our team also grew by 2 people – we added Dustin Mitchell, who will continue to do all kinds of work around our systems, focus on security-related issues and will ship a new intree configuration in Q2, and Eli Perelman, who will focus on front end concerns.

The TaskCluster Platform team put the following list together at the start of Q2.

Moving forward:

  • Kicked off and made excellent progress on the taskcluster-worker, a new worker with more robust abstractions and our path forward for worker support on hardware and AWS (the OS X worker implementation currently in testing uses this)
  • Shipped task.dependencies in the queue and will be shipping the rest of the “big graph scheduler” changes just in time to support some massive release promotion graphs
  • Deployed the first sketch for monitoring dashboard
  • Shipped login v3 (welcome, dustin!)
  • Rewrote and tested a new method for mirroring data between AWS regions (cloud-mirror)
  • Researched a monitoring solution and made a plan for Q2 rollout of signalFX
  • Prototyped and deployed aggregation service: statsum (and client for node.js)
  • Contributed to upstream open source tools and libraries in golang and node ecosystem
  • Brought bstack and rthijssen up to speed, brought Dustin onboard!
  • Working with both GSoC and Outreachy, and Mozilla’s University recruiting to bring five interns into our team in Q2/Q3

Paying debt:

  • Shipped better error messages related to schema violations
  • Rolled out formalization of error messages: {code: “…”, message: “…”, details: {…}}
  • Sentry integration — you see an 5xx error with an incidentId, we see it too!
  • Automatic creation of sentry projects, and rotation of credentials
  • go-got — simple HTTP client for go with automatic retries
  • queue.listArtifacts now takes a continuationToken for paging
  • queue.listTaskGroup refactored for correctness (also returns more information)
  • Pre-compilation of queue, index and aws-provisioner with babel-compile (no longer using babel-node)
  • One-click loaners, (related work by armenzg and jmaher to make loaners awesome: instructions + special start mode)
  • Various UI improvements to tools.taskcluster.net (react.js upgrade, favicons, auth tools, login-flow, status, previous taskIds, more)
  • Upgrade libraries for taskcluster-index (new config loader, component loader)
  • Fixed stateless-dns case-sensitivity (livelogs works with DNS resolvers from Germans ISPs too)
  • Further greening of travis for our repositories
  • Better error messages for insufficient scope errors
  • Upgraded heroku stack for events.taskcluster.net (pulse -> websocket bridge)
  • Various fixes to automatic retries in go code (httpbackoff, proxy in docker-worker, taskcluster-client-go)
  • Moved towards shrinkwrapping all of the node services (integrity checks for packages)
  • Added worker level timestamps to task logs
  • Added metrics for docker/task image download and load times
  • Added artifact expiration error handling and saner default values in docker-worker
  • Made a version jump from docker 1.6 to 1.10 in production (included version upgrades of packages and kernel, refactoring of some existing logic)
  • Improved taskcluster and treeherder integration (retrigger errors, prep for offloading resultset creation to TH)
  • Rolling out temp credential support in docker-worker
  • Added mach support for downloading task image for local development
  • Client support for temp credentials in go and java client
  • JSON schema cleanups
  • CI cleanup (all green) and turning off circle CI
  • Enhancements to jsonschema2go
  • Windows build work by rob and pete for getting windows builds migrated off Buildbot
  • Added stability levels to APIs

[workweek] tc-worker workweek recap

Sprint recap

We spent this week sprinting on the tc-worker, engines and plugins. We merged 19 pull requests and had many productive discussions!

tc-worker core

We implemented the task loop! This basic loop should start when the worker is invoked. It spins up a task claimer and manager responsible for claiming as many tasks up to it’s available capacity and running them to completion. You can find details in in this commit. We’re still working on some high level documentation.

We did some cleanups to make it easier to download and get started with builds. We fixed up packages related to generating go types from json schemas, and the types now conform to the linting rules

We also implemented the webhookserver. The package provides implementations of the WebHookServer interface which allows attachment and detachment of web-hooks to an internet exposed server. This will support both the livelog and interactive features. Work is detailed in PR 37.

engine: hello, world

Greg created a proof of concept and pushed a successful task to emit a hello, world artifact. Greg will be writing up something to describe this process next week.

plugin: environment variables

Wander landed this plugin this week to support environment variable setting. The work is described in PR 39.

plugin: artifact uploads

This plugin will support artifact uploads for all engines to S3 and is based on generic-worker code. This work is started in PR 55.

TaskCluster design principles

We discussed as a team the ideas behind the design of TaskCluster. The umbrella principle we try to stick to is: Getting Things Built. We felt it was important to say that first because it helps us remember that we’re here to provide features to users, not just design systems. The four key design principles were distilled to:

  • Self-service
  • Robustness
  • Enable rapid change
  • Community friendliness

One surprising connection (to me) we made was that our privacy and security features are driven by community friendliness.

We plan to add our ideas about this to a TaskCluster “about” page.

TaskCluster code review

We discussed our process for code review, and how we’d like to do them in the future. We covered issues around when to do architecture reviews and how to get “pre-reviews” for ideas done with colleagues who will be doing our reviews. We made an outline of ideas and will be giving them a permanent home on our docs site.

Q2 Planning

We made a first pass at our 2016q2 goals. The main theme is to add OS X engine support to taskcluster-worker, continue work on refactoring intree config and build out our monitoring system beyond InfluxDB. Further refinements to our plan will come in a couple weeks, as we close out Q1 and get a better understanding of work related to the Buildbot to TaskCluster migration.

Tier-1 status for Linux 64 Debug build jobs on March 14, 2016

I sent this to dev-planning, dev-platform, sheriffs and tools-taskcluster today. I added a little more context for a non-Mozilla audience.

The time has come! We are planning to switch to Tier-1 on Treeherder for TaskCluster Linux 64 Debug build jobs on March 14. At the same time, we will hide the Buildbot build jobs, but continue running them. This means that these jobs will become what Sheriffs use to determine the health of patches and our trees.

On March 21, we plan to switch the Linux 64 Debug tests to Tier-1 and hide the related Buildbot test jobs.

After about 30 days, we plan to disable and remove all Buildbot jobs related to Linux Debug.

Background:

We’ve been running Linux 64 Debug builds and tests using TaskCluster side-by-side with Buildbot jobs since February 18th. Some of the project work that was done to green up the tests is documented here.

The new tests are running in Docker-ized environments, and the Docker images we use are defined in-tree and publicly accessible.

This work was the culmination of many months of effort, with Joel Maher, Dustin Mitchell and Armen Zambrano primarily focused on test migration this quarter. Thank you to everyone who responded to NEEDINFOs, emails and pings on IRC to help with untangling busted test runs.

On performance, we’re taking a 14% hit across all the new test jobs vs. the old jobs in Buildbot. We ran two large-scale tests to help determine where slowness might still be lurking, and were able to find and fix many issues. There are a handful of jobs remaining that seem significantly slower, while others are significantly faster. We decided that it was more important to deprecate the old jobs and start exclusively maintaining the new jobs now, rather than wait to resolve the remaining performance issues. Over time we hope to address issues with the owners of the affected test suites.

[portland] taskcluster-worker Hello, World

The TaskCluster Platform team is in Portland this week, hacking on the taskcluster-worker.

Today, we all sync’d up on the current state of our worker, and what we’re going to hack on this week. We started with the current docs.

The reason why we’re investing so much time in the worker is two fold:

  • The worker code previously lived in two code bases – docker-worker and generic-worker. We need to unify these code bases so that multiple engineers can work on it, and to help us maintain feature parity.
  • We need to get a worker that supports Windows into production. For now, we’re using the generic-worker, but we’d like to switch over to taskcluster-worker in late Q2 or early Q3. This timeline lines up with when we expect the Windows migration from Buildbot to happen.

One of the things I asked this team to do was come up with some demos of the new worker. The first demo today was to simply output a log and upload it from Greg Arndt.

The rest of the team is getting their Go environments set up to run tests and get hacking on crucial plugins, like our environment variable handling and additional artifact uploading logic we need for our production workers.

We’re also taking the opportunity to sync up with our Windows environment guru. Our goal for Buildbot to TaskCluster migration this quarter is focused on Linux builds and tests. Next quarter, we’ll be finishing Linux and, I hope, landing Windows builds in TaskCluster. To do that, we have a lot of details to sort out with how we’ll build Windows AMIs and deploy them. It’s a very different model because we don’t have the same options with Docker as we have on Linux.

TaskCluster Platform: 2015Q3 Retrospective

Welcome to TaskCluster Platform’s 2015Q3 Retrospective! I’ve been managing this team this quarter and thought it would be nice to look back on what we’ve done. This report covers what we did for our quarterly goals. I’ve linked to “Publications” at the bottom of this page, and we have a TaskCluster Mozilla Wiki page that’s worth checking out.

High level accomplishments

  • Dramatically improved stability of TaskCluster Platform for Sheriffs by fixing TreeHerder ingestion logic and regexes, adding better logging and fixing bugs in our taskcluster-vcs and mozilla-taskcluster components
  • Created and Deployed CI builds on three major platforms:
    • Added Linux64 (CentOS), Mac OS X cross-compiled builds as Tier2 CI builds
    • Completed and documented a prototype Windows 2012 builds in AWS and task configuration
  • Deployed auth.taskcluster.net, enabling better security, better support for self-service authorization and easier contributions from outside our team
  • Added region biasing based on cost and availability of spot instances to our AWS provisioner
  • Managed the workload of two interns, and significantly mentored a third
  • Onboarded Selena as a new manager
  • Held a workweek to focus attention on bringing our environment into production support of Release Engineering

Goals, Bugs and Collaborators

We laid out our Q3 goals in this etherpad. Our chosen themes this quarter were:

  • Improve operational excellence — focus on sheriff concerns, data collection,
  • Facilitate self-serve consumption — refactoring auth and supporting roles for scopes, and
  • Exploit opportunities to differentiate from other platforms — support for interactive sessions, docker images as artifacts, github integration and more blogging/docs.

We had 139 Resolved FIXED bugs in TaskCluster product.

Link to graph of resolved bugs

We also resolved 7 bugs in FirefoxOS, TreeHerder and RelEng products/components.

We received significant contributions from other teams: Morgan (mrrrgn) designed, created and deployed taskcluster-github; Ted deployed Mac OS X cross compiled builds; Dustin reworked the Linux TC builds to use CentOS, and resolved 11 bugs related to TaskCluster and Linux builds.

An additional 9 people contributed code to core TaskCluster, intree build scripts and and task definitions: aus, rwood, rail, mshal, gerard-majax, mihneadb@gmail.com, htsai, cmanchester, and echen.

The Big Picture: TaskCluster integration into Platform Operations

Moving from B2G to Platform was a big shift. The team had already made a goal of enabling Firefox Release builds, but it wasn’t entirely clear how to accomplish that. We spent a lot of this quarter learning things from RelEng and prioritizing. The whole team spent the majority of our time supporting others use of TaskCluster through training and support, developing task configurations and resolving infrastructure problems. At the same time, we shipped docker-worker features, provisioner biasing and a new authorization system. One tricky infra issue that John and Jonas worked on early in the quarter was a strange AWS Provisioner failure that came down to an obscure missing dependency. We had a few git-related tree closures that Greg worked closely on and ultimately committed fixes to taskcluster-vcs to help resolve. Everyone spent a lot of time responding to bugs filed by the sheriffs and requests for help on IRC.

It’s hard to overstate how important the Sheriff relationship and TreeHerder work was. A couple teams had the impression that TaskCluster itself was unstable. Fixing this was a joint effort across TreeHerder, Sheriffs and TaskCluster teams.

When we finished, useful errors were finally being reported by tasks and starring became much more specific and actionable. We may have received a partial compliment on this from philor. The extent of artifact upload retries, for example, was made much clearer and we’ve prioritized fixing this in early Q4.

Both Greg and Jonas spent many weeks meeting with Ed and Cam, designing systems, fixing issues in TaskCluster components and contributing code back to TreeHerder. These meetings also led to Jonas and Cam collaborating more on API and data design, and this work is ongoing.

We had our own “intern” who was hired on as a contractor for the summer, Edgar Chen. He did some work with the docker-worker, implementing Interactive Sessions, and did analysis on our provisioner/worker efficiency. We made him give a short, sweet presentation on the interactive sessions. Edgar is now at CMU for his sophomore year and has referred at least one friend back to Mozilla to apply for an internship next summer.

Pete completed a Windows 2012 prototype build of Firefox that’s available from Try, with documentation and a completely automated process for creating AMIs. He hasn’t created a narrated video with dueling, British-English accented robot voices for this build yet.

We also invested a great deal of time in the RelEng interns. Jonas and Greg worked with Anhad on getting him productive with TaskCluster. When Anthony arrived, we also onboarded him. Jonas worked closely to get him working on a new project, hooks.taskcluster.net. To take these two bits of work from RelEng on, I pushed TaskCluster’s roadmap for generic-worker features back a quarter and Jonas pushed his stretch goal of getting the big graph scheduler into production to Q4.

We worked a great deal with other teams this quarter on taskcluster-github, supporting new Firefox and B2G builds, RRAs for the workers and generally telling Mozilla about TaskCluster.

Finally, we spent a significant amount of time interviewing, and then creating a more formal interview process that includes a coding challenge and structured-interview type questions. This is still in flux, but the first two portions are being used and refined currently. Jonas, Greg and Pete spent many hours interviewing candidates.

Berlin Work Week

TaskCluster Platform Team in Berlin

Toward the end of the quarter, we held a workweek in Berlin to focus our next round of work on critical RelEng and Release-specific features as well as production monitoring planning. Dustin surprised us with delightful laser cut acrylic versions of the TaskCluster logo for the team! All team members reported that they benefited from being in one room to discuss key designs, get immediate code review, and demonstrate work in progress.

We came out of this with 20+ detailed documents from our conversations, greater alignment on the priorities for Platform Operations and a plan for trainings and tutorials to give at Orlando. Dustin followed this up with a series of ‘TC Topics’ Vidyo sessions targeted mostly at RelEng.

Our Q4 roadmap is focused on key RelEng features to support Release.

Publications

Our team published a few blog posts and videos this quarter:

TaskCluster migration: about the Buildbot Bridge

Back on May 7, Ben Hearsum gave a short talk about an important piece of technology supporting our transition to TaskCluster, the Buildbot Bridge. A recording is available.

I took some detailed notes to spread the word about how this work is enabling a great deal of important Q3 work like the Release Promotion project. Basically, the bridge allows us to separate out work that Buildbot currently runs in a somewhat monolithic way into TaskGraphs and Tasks that can be scheduled separately and independently. This decoupling is a powerful enabler for future work.

Of course, you might argue that we could perform this decoupling in Buildbot.

However, moving to TaskCluster means adopting a modern, distributed queue-based approach to managing incoming jobs. We will be freed of the performance tradeoffs and careful attention required when using relational databases for queue management (Buildbot uses MySQL for it’s queues, TaskCluster uses RabbitMQ and Azure). We also will be moving “decision tasks” in-tree, meaning that they will be closer to developer environments and likely easier to manage keeping developer and build system environments in sync.

Here are my notes:

Why have the bridge?

  • Allows a graceful transition
  • We’re in an annoying state where we can’t have dependencies between buildbot builds and taskcluster tasks. For example: we can’t move firefox linux builds into taskcluster without moving everything downstream of those also into taskcluster
  • It’s not practical and sometimes just not possible to move everything at the same time. This let’s us reimplement buildbot schedulers as task graphs. Buildbot builds are tasks on the task graphs enabling us to change each task to be implemented by a Docker worker, a generic worker or anything we want or need at that point.
  • One of the driving forces is the build promotion project – the funsize and anti-virus scanning and binary moving – this is going to be implemented in taskcluster tasks but the rest will be in Buildbot. We need to be able to bounce between the two.

What is the Buildbot Bridge (BBB)

BBB acts as a TC worker and provisioner and delegates all those things to BuildBot. As far as TC is concerned, BBB is doing all this work, not Buildbot itself. TC knows nothing about Buildbot.

There are three services:

  • TC Listener: responds to things happening in TC
  • BuildBot Listener: responds to BB events
  • Reflector: takes care of things that can’t be done in response to events — it reclaims tasks periodically, for example. TC expects Tasks to reclaim tasks. If a Task stops reclaiming, TC considers that Task dead.

BBB has a small database that associates build requests with TC taskids and runids.

BBB is designed to be multihomed. It is currently deployed but not running on three Buildbot masters. We can lose an AWS region and the bridge will still function. It consumes from Pulse.

The system is dependent on Pulse, SchedulerDB and Self-serve (in addition to a Buildbot master and Taskcluster).

Taskcluster Listener

Reacts to events coming from TC Pulse exchanges.

Creates build requests in response to tasks becoming “pending”. When someone pushes to mozilla-central, BBB inserts BuildRequests into BB SchedulerDB. Pending jobs appear in BB. BBB cancels BuildRequests as well — can happen from timeouts, someone explicitly cancelling in TC.

Buildbot Listener

Responds to events coming from the BB Pulse exchanges.

Claims a Task when builds start. Attaches BuildBot Properties to Tasks as artifacts. Has a buildslave name, information/metadata. It resolves those Tasks.

Buildbot and TC don’t have a 1:1 mapping of BB statuses and TC resolution. Also needs to coordinate with Treeherder color. A short discussion happened about implementing these colors in an artifact rather than inferring them from return codes or statuses inherent to BB or TC.

Reflector

  • Runs on a timer – every 60 seconds
  • Reclaims tasks: need to do this every 30-60 minutes
  • Cancels Tasks when a BuildRequest is cancelled on the BB side (have to troll through BB DB to detect this state if it is cancelled on the buildbot side)

Scenarios

  • A successful build!

Task is created. Task in TC is pending, nothnig in BB. TCListener picks up the event and creates a BuildRequest (pending).

BB creates a Build. BBListener receives buildstarted event, claims the Task.

Reflector reclaims the Task while the Build is running.

Build completes successfully. BBListener receives log uploaded event (build finished), reports success in TaskCluster.

  • Build fails initially, succeeds upon retry

(500 from hg – common reason to retry)

Same through Reflector.

BB fails, marked as RETRY BBListener receives log uploaded event, reports exception to Taskcluster and calls rerun Task.

BB has already started a new Build TCListener receives task-pending event, updates runid, does not create a new BuildRequest.

Build completes successfully Buildbot Listener receives log uploaded event, reports success to TaskCluster.

  • Task exceeds deadline before Build starts

Task created TCListener receives task-pending event, creates BuildRequest Nothing happens. Task goes past deadline, TaskCluster cancels it. TCListener receives task-exception event, cancels BuildRequest through Self-serve

QUESTIONS:

  • TC deadline, what is it? Queue: a task past a deadline is marked as timeout/deadline exceeded

On TH, if someone requests a rebuild twice what happens? * There is no retry/rerun, we duplicate the subgraph — where ever we retrigger, you get everything below it. You’d end up with duplicates Retries and rebuilds are separate. Rebuilds are triggered by humans, retries are internal to BB. TC doesn’t have a concept of retries.

  • How do we avoid duplicate reporting? TC will be considered source of truth in the future. Unsure about interim. Maybe TH can ignore duplicates since the builder names will be the same.

  • Replacing the scheduler what does that mean exactly?

    • Mostly moving decision tasks in-tree — practical impact: YAML files get moved into the tree
    • Remove all scheduling from BuildBot and Hg polling

Roll-out plan

  • Connected to the Alder branch currently
  • Replacing some of the Alder schedulers with TaskGraphs
  • All the BB Alder schedulers are disabled, and was able to get a push to generate a TaskGraph!

Next steps might be release scheduling tasks, rather than merging into central. Someone else might be able to work on other CI tasks in parallel.

TaskCluster migration: a “hello, world” for worker task creator

On June 1, 2015, Morgan and Dustin presented an introduction to configuring and testing TaskCluster worker tasks. The session was recorded. Their notes are also available in an etherpad.

The key tutorial information centered on how to set up jobs, test/run them locally and selecting appropriate worker types for jobs.

This past quarter Morgan has been working on Linux Docker images and TaskCluster workers for Firefox builds. Using that work as an example, Morgan showed how to set up new jobs with Docker images. She also touched on a couple issues that remain, like sharing sensitive or encrypted information on publicly available infrastructure.

A couple really nice things:

  • You can run the whole configuration locally by copy and pasting a shell script that’s output by the TaskCluster tools
  • There are a number of predefined workers you can use, so that you’re not creating everything from scratch

Dustin gave an overview of task graphs using a specific example. Looking through the docs, I think the best source of documentation other than this video is probably the API documentation. The docs could use a little more narrative for context, as Dustin’s short talk about it demonstrated.

The talk closed with an invitation to help write new tasks, with pointers to the Android work Dustin’s been doing.

Migrating to Taskcluster: work underway!

Mozilla’s build and test infrastructure has relied on Buildbot as the backbone of our systems for many years. Asking around, I heard that we started using Buildbot around 2008. The time has come for a change!

Many of the people working on migrating from Buildbot to Taskcluster gathered all together for the first time to talk about migration this morning. (A recording of the meeting is available)

The goal of this work is to shut down Buildbot and identify a timeline. Our first goal post is to eliminate the Buildbot Scheduler by moving build production entirely into TaskCluster, and scheduling tests in TaskCluster.

Today, most FirefoxOS builds and tests are in Taskcluster. Nearly everything else for Firefox is driven by Buildbot.

Our current tracker bug is ‘Buildbot -> TaskCluster transition‘. At a high level, the big projects underway are:

We have quite a few things to figure out in the Windows and Mac OS X realm where we’re interacting with hardware, and some work is left to be done to support Windows in AWS. We’re planning to get more clarity on the work that needs to be done there next week.

The bugs identified seem tantalizingly close to describing most of the issues that remain in porting our builds. The plan is to have a timeline documented for builds to be fully migrated over by Whistler! We are also working on migrating tests, but for now believe the Buildbot Bridge will help us get tests out of the Buildbot scheduler, even if we continue to need Buildbot masters for a while. An interesting idea about using runner to manage hardware instead of the masters was raised during the meeting that we’ll be exploring further.

If you’re interested in learning more about TaskCluster and how to use it, Chris Cooper is running a training on Monday June 1 at 1:30pm PT.

Ping me on IRC, Twitter or email if you have questions!