Monitorama 2014 wrapup

I’m just settling back into the daily routine after RelEng/RelOps’ workweek and then Monitorama back-to-back.

Videos will eventually be posted here.

I thought it was awesome the conference started with some #hugops.

Here are my highlights:

  • I gave a talk about crontabber! I have my speakers notes if you’re interested!

  • Dan Slimmons gave a nice talk about basic probability and how understanding the difference between sensitivity and specificity can help you choose more useful alerts. It was super basic stats stuff, but a good foundation for building up stats competency in teams.

  • James Mickens gave a hilarious talk about the cloud that is well-worth finding when it goes up.

  • Ashe Dryden gave a talk about gender issues and “our most wicked problem”. It was very well-received by the audience, which was gratifying for me personally. I think the audience walked away with some very practical things to do: speak up among peers when someone says things that make you uncomfortable and ask questions about equal treatment in your company for things like salary, perks and benefits.

  • Several talks were given about monitoring and managing ops inside companies. My favorite was from Daniel Schauenberg (contributor to statsd) of Etsy. and Scott Sanders spoke about similar topics in this presentaton on Github’s outage lifecycle. And related, but not at the conference, Heroku just published an incident response runbook.

  • There was a hilarious lightning talk about the failure of the Swedish ship Vasa as an object lesson for massive project failure. Here’s a link to the case study the lightning talk was based on.

  • Larry Price (@laprice) gave a 5-minute talk about Postgres autovacuum tuning, which was awesome, and I hope he posts the slides. It reminded me that I should do a couple brownbags about Postgres config this summer!

  • I was struck by how many people said they used Postgres in production. Someone else asked the question during a talk, and nearly half the audience raised their hands.

  • InfluxDB, a new timeseries database emphasizing an HTTP API (remind anyone of CouchDB? :D), seemed interesting, although maybe rough around the edges when it came to documenting useful features/best practices. When I mentioned it on Twitter, I found a few folks already trying to use it in production and got at least one bug filed. 🙂

  • I also saw an amazing demo of Kibana, which seems like a very interesting dashboard/investigation/querying interface to Elastic Search. I watched a friend deploy it in about an hour to look at their ES systems last Wednesday.

  • Dashing from Shopify was also very interesting, although a rubyist project, so not easy to integrate with our Pythonic world. However, putting on a contributor relations hat — it could be a wonderful and beautiful way for contributors to interact with our many APIs.

I’m looking forward to the videos coming out and a list of slide decks, as I missed a few talks during hallway track conversations. I met several people who are managing similar or larger event loads than we do with Socorro, so it was fun swapping stories and seeing how their software stacks are evolving. RabbitMQ was a weapon of choice for reporting environments, along with Storm. Lots of love for Kafka was out there for the people dealing with real-time customer response.

Overall, highly recommend attending Monitorama to dip a toe into the state of the art with regard to system operations, monitoring and ops management.

Broken windows, broken code, broken systems

A few days ago, I asked:

I spend a lot of time thinking about the little details in systems – like the number of ephemeral ports consumed, number of open file descriptors and per-process memory utilization over time. Small changes across 50 machines can add up to a large overall change in performance.

And then, today, I saw this article:

One of the more telling comments I received was the idea that since the advent of virtualization, there’s no point in trying to fix anything anymore. If a weird error pops up, just redeploy the original template and toss the old VM on the scrap heap. Similar ideas revolved around re-imaging laptops and desktops rather than fixing the problem. OK. Full stop. A laptop or desktop is most certainly not a server, and servers should not be treated that way. But even that’s not the full reality of the situation.

I’m starting to think that current server virtualization technologies are contributing to the decline of real server administration skills.

There definitely has been a shift – “real server administration skills” are now more about packaging, software selection and managing dramatic shifts in utilization. It’s less important know to know exactly how to manage M4 with sendmail, and more important that you know you should probably use postfix instead. I don’t spend much time convincing clients that they need connection pooling; I debug the connection pooler that was chosen.

The available software for web development and operations is quite broad – the version of Linux you select, whether you are vendor supported or not, and the volume of open source tools to support applications.

Inevitably, the industry has shifted to configuration management, rather than configuration. And, honestly, the shift started about 15 years ago with cfengine.

Now we call this DevOps, the idea that systems management should be programmable. Burgess called this “Computer Immunology”. DevOps is a much better marketing term, but I think the core ideas remain the same: Make programmatic interfaces to manage systems and automate.

But, back to the broken window thing! I did some searching for development and broken windows and found that in 2007, a developer talked about Broken Window Theory:

People are reluctant to break something that works, but not so much when it doesn’t. If the build is already broken, then people won’t spend much time making sure their change doesn’t break it (well, break it further). But if the build is pristine green, then they will be very careful about it.

In 2005, Jeff Atwood mentioned the original source, and said “Maybe we should be sweating the small stuff.”

That stuck with me because I admit that I focus on the little details first. I try to fix and automate where I can, but for political or practical reasons, I often am unable to make the comprehensive system changes I’d like to see.

So, given that most of us live in the real world where some things are just left undone, where do we draw the line? What do we consider a bit of acceptable street litter, and what do we consider a broken window? When is it ok to just reboot the system, and when do you really need to figure out exactly what went wrong?

This decision making process is often the difference between a productive work day, and one filled with frustration.

The strategies that we use to make this choice are probably the most important aspects of system administration and devops today. There, of course, is never a single right answer for every business. But I’m sure there are some themes.

For example:

James posted “Rules for Infrastructure” just the other day, which is a repost of the original gist. What I like about this is that they are phrased philosophically: here are the lines in the sand, and the definitions that we’re all going to agree to.

Where do you draw the line? And how do you communicate to your colleagues where the line is?