TaskCluster 2016Q2 Retrospective

The TaskCluster Platform team worked very hard in Q2 to support the migration off Buildbot, bring new projects into our CI system and look forward with experiments that might enable fully-automated VM deployment on hardware in the future.

We also brought on 5 interns. For a team of 8 engineers and one manager, this was a tremendous team accomplishment. We are also working closely with interns on the Engineering Productivity and Release Engineering teams, resulting in a much higher communication volume than in months past.

We continued our work with RelOps to land Windows builds, and those are available in pushes to Try. This means people can use “one click loaners” for Windows builds as well as Linux (through the Inspect Task link for jobs)! Work on Windows tests is proceeding.

We also created try pushes for Mac OS X tests, and integrated them with the Mac OS X cross-compiled builds. This also meant deep diving into the cross-compiled builds to green them up in Q3 after some compiler changes.

A big part of the work for our team and for RelEng was preparing to implement a new kind of signing process. Aki and Jonas spent a good deal of time on this, as did many other people across PlatformOps. What came out of that work was a detailed specification for TaskCluster changes and for a new service from RelEng. We expect to see prototypes of these ideas by the end of August, and the major blocking changes to the workers and provisioner to be complete then too.

This all leads to being able to ship Linux Nightlies directly from TaskCluster by the end of Q3. We’re optimistic that this is possible, with the knowledge that there are still a few unknowns and a lot has to come together at the right time.

Much of the work on TaskCluster is like building a 747 in-flight. The microservices architecture enables us to ship small changes quickly and without much pre-arranged coordination. As time as gone on, we have consolidated some services (the scheduler is deprecated in favor of the “big graph” scheduling done directly in the queue), separated others (we’ve moved Treeherder-specific services into its own component, and are working to deprecate mozilla-taskcluster in favor of a taskcluster-hg component), and refactored key parts of our systems (intree scheduling last quarter was an important change for usability going forward). This kind of change is starting to slow down as the software and the team adapts and matures.

I can’t wait to see what this team accomplishes in Q3!

Below is the team’s partial list of accomplishments and changes. Please drop by #taskcluster or drop an email to our tools-taskcluster lists.mozilla.org mailing list with questions or comments!

Things we did this quarter:

  • initial investigation and timing data around using sccache for linux builds
  • released update for sccache to allow working in a more modern python environment
  • created taskcluster managed s3 buckets with appropriate policies
  • tested linux builds with patched version of sccache
  • tested docker-worker on packet.net for on hardware testing
  • worked with jmaher on talos testing with docker-worker on releng hardware
  • created livelog plugin for taskcluster-worker (just requires tests now)
  • added reclaim logic to taskcluster-worker
  • converted gecko and gaia in-tree tasks to use new v2 treeherder routes
  • Updated gaia-taskcluster to allow github repos to use new taskcluster-treeherder reporting
  • move docs, schemas, references to https
  • refactor documentation site into tutorial / manual / reference
  • add READMEs to reference docs
  • switch from a * certificate to a SAN certificate for taskcluster.net
  • increase accessibility of AWS provisioner by separating bar-graph stuff from workerType configuration
  • use roles for workerTypes in the AWS provisioner, instead of directly specifying scopes
  • allow non-employees to login with Okta, improve authentication experience
  • named temporary credentials
  • use npm shrinkwrap everywhere
  • enable coalescing
  • reduce the artifact retention time for try jobs (to reduce S3 usage)
  • support retriggering via the treeherder API
  • document azure-entities
  • start using queue dependencies (big-graph-scheduler)
  • worked with NSS team to have tasks scheduled and displayed within treeherder
  • Improve information within docker-worker live logs to include environment information (ip address, instance type, etc)
  • added hg fingerprint verification to decision task
  • Responded and deployed patches to security incidents discovered in q2
  • taskcluster-stats-collector running with signalfx
  • most major services using signalfx and sentry via new monitoring library taskcluster-lib-monitor
  • Experimented with QEMU/KVM and libvirt for powering a taskcluster-worker engine
  • QEMU/KVM engine for taskcluster-worker
  • Implemented Task Group Inspector
  • Organized efforts around front-end tooling
  • Re-wrote and generalized the build process for taskcluster-tools and future front-end sites
  • Created the Migration Dashboard
  • Organized efforts with contractors to redesign and improve the UX of the taskcluster-tools site
  • First Windows tasks in production – NSS builds running on Windows 2012 R2
  • Windows Firefox desktop builds running in production (currently shown on staging treeherder)
  • new features in generic worker (worker type metadata, retaining task users/directories, managing secrets in secrets store, custom drive for user directories, installing as a startup item rather than service, improved syscall integration for logins and executing processes as different users)
  • many firefox desktop build fixes including fixes to python build scripts, mozconfigs, mozharness scripts and configs
  • CI cleanup https://travis-ci.org/taskcluster
  • support for relative definitions in jsonschema2go
  • schema/references cleanup

Paying down technical debt

  • Fixed numerous issues/requests within mozilla-taskcluster
  • properly schedule and retrigger tasks using new task dependency system
  • add more supported repositories
  • Align job state between treeherder and taskcluster better (i.e cancels)
  • Add support for additional platform collection labels (pgo/asan/etc)
  • fixed retriggering of github tasks in treeherder
  • Reduced space usage on workers using docker-worker by removing temporary images
  • fixed issues with gaia decision task that prevented it from running since March 30th.
  • Improved robustness of image creation image
  • Fixed all linter issues for taskcluster-queue
  • finished rolling out shrinkwrap to all of our services
  • began trial of having travis publish our libraries (rolled out to 2 libraries now. talking to npm to fix a bug for a 3rd)
  • turned on greenkeeper everywhere then turned it off again for the most part (it doesn’t work with shrinkwrap, etc)
  • “modernized” (newer node, lib-loader, newest config, directory structure, etc) most of our major services
  • fix a lot of subtle background bugs in tc-gh and improve logging
  • shared eslint and babel configs created and used in most services/libraries
  • instrumented taskcluster-queue with statistics and error reporting
  • fixed issue where task dependency resolver would hang
  • Improved error message rendering on taskcluster-tools
  • Web notifications for one-click-loaner UI on taskcluster-tools
  • Migrated stateless-dns server from tutum.co to docker cloud
  • Moved provisioner off azure storage development account
  • Moved our npm package to a single npm organization

TaskCluster Platform Team: Q1 retrospective

TaskCluster Platform team did a lot of foundational work in Q1, to set the stage for some aggressive goals in Q2 around landing new OS support and migrating as fast as we can out of Buildbot.

The two big categories of work we had were “Moving Forward” — things that move TaskCluster forward in terms of developing our team and adding cool features, and “Paying debt” — upgrading infra, improving security, cleaning up code, improving existing interfaces and spinning out code into separate libraries where we can.

As you’ll see, there’s quite a lot of maintenance that goes into our services at this point. There’s probably some overlap of features in the “paying debt” section. Despite a little bit of fuzzyness in the definitions, I think this is an interesting way to examine our work, and a way for us to prioritize features that eliminate certain classes of unpleasant debt paying work. I’m planning to do a similar retrospective for Q2 in July.

I’m quite proud of the foundational work we did on taskcluster-worker, and it’s already paying off in rapid progress with OS X support on hardware in Q2. We’re making fairly good progress on Windows in AWS as well, but we had to pay down years of technical debt around Windows configuration to get our builds running in TaskCluster. Making a choice on our monitoring systems was also a huge win, paying off in much better dashboarding and attention to metrics across services. We’re also excited to have shipped the “Big Graph Scheduler”, which enables cross-graph dependencies and arbitrarily large task graphs (previous graphs were limited to about 1300 tasks). Our team also grew by 2 people – we added Dustin Mitchell, who will continue to do all kinds of work around our systems, focus on security-related issues and will ship a new intree configuration in Q2, and Eli Perelman, who will focus on front end concerns.

The TaskCluster Platform team put the following list together at the start of Q2.

Moving forward:

  • Kicked off and made excellent progress on the taskcluster-worker, a new worker with more robust abstractions and our path forward for worker support on hardware and AWS (the OS X worker implementation currently in testing uses this)
  • Shipped task.dependencies in the queue and will be shipping the rest of the “big graph scheduler” changes just in time to support some massive release promotion graphs
  • Deployed the first sketch for monitoring dashboard
  • Shipped login v3 (welcome, dustin!)
  • Rewrote and tested a new method for mirroring data between AWS regions (cloud-mirror)
  • Researched a monitoring solution and made a plan for Q2 rollout of signalFX
  • Prototyped and deployed aggregation service: statsum (and client for node.js)
  • Contributed to upstream open source tools and libraries in golang and node ecosystem
  • Brought bstack and rthijssen up to speed, brought Dustin onboard!
  • Working with both GSoC and Outreachy, and Mozilla’s University recruiting to bring five interns into our team in Q2/Q3

Paying debt:

  • Shipped better error messages related to schema violations
  • Rolled out formalization of error messages: {code: “…”, message: “…”, details: {…}}
  • Sentry integration — you see an 5xx error with an incidentId, we see it too!
  • Automatic creation of sentry projects, and rotation of credentials
  • go-got — simple HTTP client for go with automatic retries
  • queue.listArtifacts now takes a continuationToken for paging
  • queue.listTaskGroup refactored for correctness (also returns more information)
  • Pre-compilation of queue, index and aws-provisioner with babel-compile (no longer using babel-node)
  • One-click loaners, (related work by armenzg and jmaher to make loaners awesome: instructions + special start mode)
  • Various UI improvements to tools.taskcluster.net (react.js upgrade, favicons, auth tools, login-flow, status, previous taskIds, more)
  • Upgrade libraries for taskcluster-index (new config loader, component loader)
  • Fixed stateless-dns case-sensitivity (livelogs works with DNS resolvers from Germans ISPs too)
  • Further greening of travis for our repositories
  • Better error messages for insufficient scope errors
  • Upgraded heroku stack for events.taskcluster.net (pulse -> websocket bridge)
  • Various fixes to automatic retries in go code (httpbackoff, proxy in docker-worker, taskcluster-client-go)
  • Moved towards shrinkwrapping all of the node services (integrity checks for packages)
  • Added worker level timestamps to task logs
  • Added metrics for docker/task image download and load times
  • Added artifact expiration error handling and saner default values in docker-worker
  • Made a version jump from docker 1.6 to 1.10 in production (included version upgrades of packages and kernel, refactoring of some existing logic)
  • Improved taskcluster and treeherder integration (retrigger errors, prep for offloading resultset creation to TH)
  • Rolling out temp credential support in docker-worker
  • Added mach support for downloading task image for local development
  • Client support for temp credentials in go and java client
  • JSON schema cleanups
  • CI cleanup (all green) and turning off circle CI
  • Enhancements to jsonschema2go
  • Windows build work by rob and pete for getting windows builds migrated off Buildbot
  • Added stability levels to APIs

TaskCluster migration: about the Buildbot Bridge

Back on May 7, Ben Hearsum gave a short talk about an important piece of technology supporting our transition to TaskCluster, the Buildbot Bridge. A recording is available.

I took some detailed notes to spread the word about how this work is enabling a great deal of important Q3 work like the Release Promotion project. Basically, the bridge allows us to separate out work that Buildbot currently runs in a somewhat monolithic way into TaskGraphs and Tasks that can be scheduled separately and independently. This decoupling is a powerful enabler for future work.

Of course, you might argue that we could perform this decoupling in Buildbot.

However, moving to TaskCluster means adopting a modern, distributed queue-based approach to managing incoming jobs. We will be freed of the performance tradeoffs and careful attention required when using relational databases for queue management (Buildbot uses MySQL for it’s queues, TaskCluster uses RabbitMQ and Azure). We also will be moving “decision tasks” in-tree, meaning that they will be closer to developer environments and likely easier to manage keeping developer and build system environments in sync.

Here are my notes:

Why have the bridge?

  • Allows a graceful transition
  • We’re in an annoying state where we can’t have dependencies between buildbot builds and taskcluster tasks. For example: we can’t move firefox linux builds into taskcluster without moving everything downstream of those also into taskcluster
  • It’s not practical and sometimes just not possible to move everything at the same time. This let’s us reimplement buildbot schedulers as task graphs. Buildbot builds are tasks on the task graphs enabling us to change each task to be implemented by a Docker worker, a generic worker or anything we want or need at that point.
  • One of the driving forces is the build promotion project – the funsize and anti-virus scanning and binary moving – this is going to be implemented in taskcluster tasks but the rest will be in Buildbot. We need to be able to bounce between the two.

What is the Buildbot Bridge (BBB)

BBB acts as a TC worker and provisioner and delegates all those things to BuildBot. As far as TC is concerned, BBB is doing all this work, not Buildbot itself. TC knows nothing about Buildbot.

There are three services:

  • TC Listener: responds to things happening in TC
  • BuildBot Listener: responds to BB events
  • Reflector: takes care of things that can’t be done in response to events — it reclaims tasks periodically, for example. TC expects Tasks to reclaim tasks. If a Task stops reclaiming, TC considers that Task dead.

BBB has a small database that associates build requests with TC taskids and runids.

BBB is designed to be multihomed. It is currently deployed but not running on three Buildbot masters. We can lose an AWS region and the bridge will still function. It consumes from Pulse.

The system is dependent on Pulse, SchedulerDB and Self-serve (in addition to a Buildbot master and Taskcluster).

Taskcluster Listener

Reacts to events coming from TC Pulse exchanges.

Creates build requests in response to tasks becoming “pending”. When someone pushes to mozilla-central, BBB inserts BuildRequests into BB SchedulerDB. Pending jobs appear in BB. BBB cancels BuildRequests as well — can happen from timeouts, someone explicitly cancelling in TC.

Buildbot Listener

Responds to events coming from the BB Pulse exchanges.

Claims a Task when builds start. Attaches BuildBot Properties to Tasks as artifacts. Has a buildslave name, information/metadata. It resolves those Tasks.

Buildbot and TC don’t have a 1:1 mapping of BB statuses and TC resolution. Also needs to coordinate with Treeherder color. A short discussion happened about implementing these colors in an artifact rather than inferring them from return codes or statuses inherent to BB or TC.

Reflector

  • Runs on a timer – every 60 seconds
  • Reclaims tasks: need to do this every 30-60 minutes
  • Cancels Tasks when a BuildRequest is cancelled on the BB side (have to troll through BB DB to detect this state if it is cancelled on the buildbot side)

Scenarios

  • A successful build!

Task is created. Task in TC is pending, nothnig in BB. TCListener picks up the event and creates a BuildRequest (pending).

BB creates a Build. BBListener receives buildstarted event, claims the Task.

Reflector reclaims the Task while the Build is running.

Build completes successfully. BBListener receives log uploaded event (build finished), reports success in TaskCluster.

  • Build fails initially, succeeds upon retry

(500 from hg – common reason to retry)

Same through Reflector.

BB fails, marked as RETRY BBListener receives log uploaded event, reports exception to Taskcluster and calls rerun Task.

BB has already started a new Build TCListener receives task-pending event, updates runid, does not create a new BuildRequest.

Build completes successfully Buildbot Listener receives log uploaded event, reports success to TaskCluster.

  • Task exceeds deadline before Build starts

Task created TCListener receives task-pending event, creates BuildRequest Nothing happens. Task goes past deadline, TaskCluster cancels it. TCListener receives task-exception event, cancels BuildRequest through Self-serve

QUESTIONS:

  • TC deadline, what is it? Queue: a task past a deadline is marked as timeout/deadline exceeded

On TH, if someone requests a rebuild twice what happens? * There is no retry/rerun, we duplicate the subgraph — where ever we retrigger, you get everything below it. You’d end up with duplicates Retries and rebuilds are separate. Rebuilds are triggered by humans, retries are internal to BB. TC doesn’t have a concept of retries.

  • How do we avoid duplicate reporting? TC will be considered source of truth in the future. Unsure about interim. Maybe TH can ignore duplicates since the builder names will be the same.

  • Replacing the scheduler what does that mean exactly?

    • Mostly moving decision tasks in-tree — practical impact: YAML files get moved into the tree
    • Remove all scheduling from BuildBot and Hg polling

Roll-out plan

  • Connected to the Alder branch currently
  • Replacing some of the Alder schedulers with TaskGraphs
  • All the BB Alder schedulers are disabled, and was able to get a push to generate a TaskGraph!

Next steps might be release scheduling tasks, rather than merging into central. Someone else might be able to work on other CI tasks in parallel.

TaskCluster migration: a “hello, world” for worker task creator

On June 1, 2015, Morgan and Dustin presented an introduction to configuring and testing TaskCluster worker tasks. The session was recorded. Their notes are also available in an etherpad.

The key tutorial information centered on how to set up jobs, test/run them locally and selecting appropriate worker types for jobs.

This past quarter Morgan has been working on Linux Docker images and TaskCluster workers for Firefox builds. Using that work as an example, Morgan showed how to set up new jobs with Docker images. She also touched on a couple issues that remain, like sharing sensitive or encrypted information on publicly available infrastructure.

A couple really nice things:

  • You can run the whole configuration locally by copy and pasting a shell script that’s output by the TaskCluster tools
  • There are a number of predefined workers you can use, so that you’re not creating everything from scratch

Dustin gave an overview of task graphs using a specific example. Looking through the docs, I think the best source of documentation other than this video is probably the API documentation. The docs could use a little more narrative for context, as Dustin’s short talk about it demonstrated.

The talk closed with an invitation to help write new tasks, with pointers to the Android work Dustin’s been doing.

Migrating to Taskcluster: work underway!

Mozilla’s build and test infrastructure has relied on Buildbot as the backbone of our systems for many years. Asking around, I heard that we started using Buildbot around 2008. The time has come for a change!

Many of the people working on migrating from Buildbot to Taskcluster gathered all together for the first time to talk about migration this morning. (A recording of the meeting is available)

The goal of this work is to shut down Buildbot and identify a timeline. Our first goal post is to eliminate the Buildbot Scheduler by moving build production entirely into TaskCluster, and scheduling tests in TaskCluster.

Today, most FirefoxOS builds and tests are in Taskcluster. Nearly everything else for Firefox is driven by Buildbot.

Our current tracker bug is ‘Buildbot -> TaskCluster transition‘. At a high level, the big projects underway are:

We have quite a few things to figure out in the Windows and Mac OS X realm where we’re interacting with hardware, and some work is left to be done to support Windows in AWS. We’re planning to get more clarity on the work that needs to be done there next week.

The bugs identified seem tantalizingly close to describing most of the issues that remain in porting our builds. The plan is to have a timeline documented for builds to be fully migrated over by Whistler! We are also working on migrating tests, but for now believe the Buildbot Bridge will help us get tests out of the Buildbot scheduler, even if we continue to need Buildbot masters for a while. An interesting idea about using runner to manage hardware instead of the masters was raised during the meeting that we’ll be exploring further.

If you’re interested in learning more about TaskCluster and how to use it, Chris Cooper is running a training on Monday June 1 at 1:30pm PT.

Ping me on IRC, Twitter or email if you have questions!

pushlog from last night, a brief look at Try

One of the mysterious and amazing parts of Mozilla’s Release Engineering infrastructure is the Try server, or just “Try”. This is how Firefox and FirefoxOS developers can push changes to Mozilla’s large build and test system, made up of about 3000 servers at any time. There are a couple amazing things about this — one is that anyone can request access to push to try, not just core developers or Mozilla employees. It is a publicly-available system and tool. Second is that a remarkable amount of control is offered over which builds to produce and which tests are run. Each Try run could consume 300+ hours of machine time if every posible build and test option is selected.

This blog post is a brain dump from a few days of noodling and a quick review of of the pushlog for Try, which shows exactly the options developers are choosing for their Try runs.

To use Try, you need to include a string of configuration that looks something like this in your topmost hg commit:

try: -b do -p emulator,emulator-jb,emulator-kk,linux32_gecko,linux64_gecko,macosx64_gecko,win32_gecko -u all -t none

That’s a recommended string for B2G developers from the Sheriff best practices wiki page. If you’re interested in how this works, the code for the try syntax parser itself is here.

You can include a Try configuration string in an empty commit, or as the last part of an existing commit message. What most developers tell me they do is have a an empty commit with the Try string in it, and they remove the extra commit before merging a patch. From all the feedback I’ve read and heard, I think that’s probably what we should document on the wiki page for Try, and maybe have a secondary page with “variants” for those that want to use more advanced tooling. KISS rule seems to apply here.

If you’re a regular user of Try, you might have heard of the high scores tracker. What you might not know is that there is a JSON file behind this page and it contains quite a bit of history that’s used to generate that page. You can find it if you just replace ‘.html’ with ‘.json’.

Something about the 8-bit ambiance of this page that made me think of “texts from last night”. But in reality, Try is most busy during typical Pacific Time working hours.

The high scores page also made me curious about the actual Try strings that people were using. I pulled them all out and had a look at what config options were most common.

Of the 1262 pushes documented today in that file:

  • 760 used ‘-b do’ meaning both Debug and Opt builds are made. I wonder whether this should just be the default, or we should have some clear recomendations about what developers should do here.
  • 366 used ‘-p all’ meaning build on 28 platforms, and produce 28 binaries. Some people might intend this, but I wonder if some other default might be more helpful.
  • 456 used ‘-u all’ meaning that all the unit tests were run.
  • 1024 used ‘-t none’ reflecting the waning use of Talos tests.

I’m still thinking about how to use this information. I have a few ideas:

  • Change the defaults for a minimal try run
  • Make some commonly-used aliases for things like “build on B2G platforms” or “tests that catch a lot of problems”
  • Create a dashboard that shows TopN try syntax strings
  • update the parser to include more of the options documented on various wiki pages as environment variables

If you’re a regular user of Try, what would you like to see changed? Ping me in #releng, email or tweet your thoughts!

And some background on me: I’ve been working with the Release Engineering team since April 1, 2015, and most of that time so far was spent on #buildduty, a topic I’m planning to write a few blog posts about. I’m also having a look at planning for the Task Cluster migration (away from BuildBot), monitoring and developer environments for releng tooling. I’m also working on a zine to share at Whistler of what is going on when you push to Try. Finally, I stood up a bot for reporting alerts through AWS SNS and to be able to file #buildduty related bugzilla bugs.

My goal right now is to find ways of communicating about and sharing the important work that Release Engineering is doing. Some of that is creating tracker bugs and having meetings to spread knowlege. Some of that is documenting our infrastructure and drawing pictures of how our systems interact. A lot of it is listening and learning from the engineers who build and ship Firefox to the world.