[workweek] tc-worker workweek recap

Sprint recap

We spent this week sprinting on the tc-worker, engines and plugins. We merged 19 pull requests and had many productive discussions!

tc-worker core

We implemented the task loop! This basic loop should start when the worker is invoked. It spins up a task claimer and manager responsible for claiming as many tasks up to it’s available capacity and running them to completion. You can find details in in this commit. We’re still working on some high level documentation.

We did some cleanups to make it easier to download and get started with builds. We fixed up packages related to generating go types from json schemas, and the types now conform to the linting rules

We also implemented the webhookserver. The package provides implementations of the WebHookServer interface which allows attachment and detachment of web-hooks to an internet exposed server. This will support both the livelog and interactive features. Work is detailed in PR 37.

engine: hello, world

Greg created a proof of concept and pushed a successful task to emit a hello, world artifact. Greg will be writing up something to describe this process next week.

plugin: environment variables

Wander landed this plugin this week to support environment variable setting. The work is described in PR 39.

plugin: artifact uploads

This plugin will support artifact uploads for all engines to S3 and is based on generic-worker code. This work is started in PR 55.

TaskCluster design principles

We discussed as a team the ideas behind the design of TaskCluster. The umbrella principle we try to stick to is: Getting Things Built. We felt it was important to say that first because it helps us remember that we’re here to provide features to users, not just design systems. The four key design principles were distilled to:

  • Self-service
  • Robustness
  • Enable rapid change
  • Community friendliness

One surprising connection (to me) we made was that our privacy and security features are driven by community friendliness.

We plan to add our ideas about this to a TaskCluster “about” page.

TaskCluster code review

We discussed our process for code review, and how we’d like to do them in the future. We covered issues around when to do architecture reviews and how to get “pre-reviews” for ideas done with colleagues who will be doing our reviews. We made an outline of ideas and will be giving them a permanent home on our docs site.

Q2 Planning

We made a first pass at our 2016q2 goals. The main theme is to add OS X engine support to taskcluster-worker, continue work on refactoring intree config and build out our monitoring system beyond InfluxDB. Further refinements to our plan will come in a couple weeks, as we close out Q1 and get a better understanding of work related to the Buildbot to TaskCluster migration.

Tier-1 status for Linux 64 Debug build jobs on March 14, 2016

I sent this to dev-planning, dev-platform, sheriffs and tools-taskcluster today. I added a little more context for a non-Mozilla audience.

The time has come! We are planning to switch to Tier-1 on Treeherder for TaskCluster Linux 64 Debug build jobs on March 14. At the same time, we will hide the Buildbot build jobs, but continue running them. This means that these jobs will become what Sheriffs use to determine the health of patches and our trees.

On March 21, we plan to switch the Linux 64 Debug tests to Tier-1 and hide the related Buildbot test jobs.

After about 30 days, we plan to disable and remove all Buildbot jobs related to Linux Debug.

Background:

We’ve been running Linux 64 Debug builds and tests using TaskCluster side-by-side with Buildbot jobs since February 18th. Some of the project work that was done to green up the tests is documented here.

The new tests are running in Docker-ized environments, and the Docker images we use are defined in-tree and publicly accessible.

This work was the culmination of many months of effort, with Joel Maher, Dustin Mitchell and Armen Zambrano primarily focused on test migration this quarter. Thank you to everyone who responded to NEEDINFOs, emails and pings on IRC to help with untangling busted test runs.

On performance, we’re taking a 14% hit across all the new test jobs vs. the old jobs in Buildbot. We ran two large-scale tests to help determine where slowness might still be lurking, and were able to find and fix many issues. There are a handful of jobs remaining that seem significantly slower, while others are significantly faster. We decided that it was more important to deprecate the old jobs and start exclusively maintaining the new jobs now, rather than wait to resolve the remaining performance issues. Over time we hope to address issues with the owners of the affected test suites.

[portland] taskcluster-worker Hello, World

The TaskCluster Platform team is in Portland this week, hacking on the taskcluster-worker.

Today, we all sync’d up on the current state of our worker, and what we’re going to hack on this week. We started with the current docs.

The reason why we’re investing so much time in the worker is two fold:

  • The worker code previously lived in two code bases – docker-worker and generic-worker. We need to unify these code bases so that multiple engineers can work on it, and to help us maintain feature parity.
  • We need to get a worker that supports Windows into production. For now, we’re using the generic-worker, but we’d like to switch over to taskcluster-worker in late Q2 or early Q3. This timeline lines up with when we expect the Windows migration from Buildbot to happen.

One of the things I asked this team to do was come up with some demos of the new worker. The first demo today was to simply output a log and upload it from Greg Arndt.

The rest of the team is getting their Go environments set up to run tests and get hacking on crucial plugins, like our environment variable handling and additional artifact uploading logic we need for our production workers.

We’re also taking the opportunity to sync up with our Windows environment guru. Our goal for Buildbot to TaskCluster migration this quarter is focused on Linux builds and tests. Next quarter, we’ll be finishing Linux and, I hope, landing Windows builds in TaskCluster. To do that, we have a lot of details to sort out with how we’ll build Windows AMIs and deploy them. It’s a very different model because we don’t have the same options with Docker as we have on Linux.