Getting real about post-mortems

I talk a lot about post-mortems, started thinking about this a long time ago, and I’ve run quite a few.

I tend to think about meetings in general as post-mortems, as typical meetings tend to be for talking about what’s been done, and what we might do, rather than *actually doing work*. But we can change our meetings to be better.

In the slides from my keynote on Sunday, I posted some specific information about how to operate post-mortems.

The key points for conducting the meeting were:

  • Set expectation for 100% participation
  • Designate a note keeper & time keeper
  • Everyone shares a success, failure, something to do better
  • Vote anonymously on what to do next
  • Communicate meeting notes out

There’s great research into each one of these items. Some of it comes from “Effective meetings” curriculum, taught by Intel University. Fast Company had a great “meeting myths” article back in 1996 that still holds true (and references Intel’s meeting culture). The bit about anonymous voting comes from research into group dynamics and how people say different things depending on who is listening and what the social pressure is to lie or tell the truth.

How do you run your post-mortems? Anything you’d add?

Going from Vagrant and Puppet into EC2: A short survey of 5 tools (and two I didn’t bother trying)

I thought this would be easy.

I started using Vagrant, and was productive with it in about a day. Really a couple hours. Most of my time was spent downloading the correct version of VirtualBox, looking for starter images and then a small amount of time experimenting with the Vagrantfile scripting language (for multiple VMs).

And we made some Puppet configs.

Then I wanted to use those same Puppet configs with EC2.

So my ultimate goals were:

  1. Reuse my existing puppet configs as much as possible
  2. Have a completely automated deploy of a server system (including checkouts of code from a private github repo)
  3. Have a puppetmaster in EC2
  4. Be able to provision systems from EC2 or my laptop
  5. Make the whole process easy for my coworkers

This is mostly a list of what I failed at using, and the thing I succeed with at the end.

Short aside:

Pro tip to people writing documentation: Most tutorials and sites that make recommendations for tools leave out the part where you run into all kinds of insane problems. Create a wiki page or a place where you collect the problems. Please.

For example: My Cloud Formation to Ubuntu AMI deploy was failing with an error in cfn.rb that said: “Unexpected return.” Um. Ok. *facepalm*

The problem was that a AWS-image specific JSON file wasn’t present (and couldn’t be created) on the target machine. So instead of noting (raise an exception, anyone?) that the file wasn’t present, the module just executed a bare return.

Because I don’t know much about Puppet internals, this was a very annoying problem to solve. (like, what gets installed in /var/lib/puppet/lib vs. in the gem install vs. the cloudpack library I was told to install in /etc/puppet/modules?)

Stepping back a bit – a useful note from the Cloud Formation folks would have been: “Hey – this probably won’t work if you try to deploy to non-Amazon Linux AMI distros of Linux.” It’s not obvious that’s the case! You’re supposed to be able to completely control the classes being installed on the target system, right? Bad assumption, apparently.

And we’re back!

Let me know in the comments if you’ve successfully navigated any of the tools I didn’t pick. Juju, in particular, I don’t think I gave a fair chance (since I didn’t try it at all).

Here’s my list:

  1. Juju

    I just wasn’t sure this was a reasonable thing to install/use. No one I knew had ever heard of it. Didn’t try it.

  2. Mcollective + tools ported to PHP

    I’m interested in Mcollective, but the configs looked overly complex, and I didn’t have anyone close by that was actively using it.

    The examples scared me away because of the PHP. I already had three languages at play in the deployment, and I didn’t need another language dependency. So, I didn’t bother trying it.

  3. Custom scripts based on the ec2-tools packages

    This approach works, but is fragile and a PITA to keep updated. I tried it as a “getting oriented” exercise, and abandoned it.

  4. Mccloud

    This looked awesome! I could reuse all my Vagrant configs and not really have to change anything… Except I had to maintain duplicate configs, just sub ‘Mccloud’. Eh.

    I may revisit this tool in the future, but it seemed to require pretty much the same things as the tool I ultimately decided to use, and didn’t seem as flexible. I also had a weird restriction where it wouldn’t allow me to spin up the correct type of image (I wanted m1.small in my testing). Could have been PEBKAC — I didn’t take good enough notes to say for sure.

  5. cloud-init

    This looked very promising! We were already using Ubuntu so seemed like a good fit.

    Pros: easy – pass in a shell script when starting an EC2 instance from the web. Cons: required yet-another-configuration style. But there were command-line tools and it was looking very promising.

    In the end, using a supported package would have required me to be running a Linux desktop to start my puppetmaster. I didn’t search much harder than brew install cloud-init for a Mac-equivalent (that doesn’t exist). So, I moved on to the next thing.

  6. AWS Cloud Formation

    I launched a puppetmaster pre-configured instance! I sort of got puppetmaster running! Then I tried to deploy an Ubuntu AMI from it… This does not work.

    So, I will save you a ton of time: Avoid trying to mix the pre-specified Cloud Formation images with other systems.

    Someone showed me the chunk of the config you can rip out and probably get it to work. I was frustrated at that point, and moved on. Too much tweaking was required, for what was uncertain gain at that point.

  7. PuppetLab’s Cloud Provisioner

    This is what I am currently using! I’m running HEAD pulled directly from github. Older versions are not recommended. (I tried three versions.)

    The configuration is pretty straightforward and documented. The one thing (a very important thing) is that you have to amend your $RUBYLIB if you don’t install the code in your version of ruby’s default libdir. There’s no gem. Yet.

    I customized the deploy script to my liking – there is an unsupported option called --install-script you can pass in that will execute whatever .erb (a shell script!) you’d like if you put it in ~/.puppet/scripts. You can also pass in your puppetmaster hostname with --server.

    Totally sweet.

    The command-line is ok, but there’s also a programmatic interface in Ruby. Dan Bode showed me a short code snippet that worked (hostnames & keys sanitized):


    irb(main):012:0> require 'puppet'
    irb(main):013:0> require 'puppet/face'
    irb(main):014:0> Puppet::Face[:node, :current].install('myserver.compute-1.amazonaws.com', :keyfile => 'mykey.pem', :login => 'ubuntu', :install_script => 'custom-puppetmaster', :server=>'myserver.compute-1.amazonaws.com')

    I so appreciate this! Faces is awesome.

I’ve got some additional tweaking to do yet, but I’m planning to commit a few amendments to the provisioner scripts included by default and the README. And I filed a couple bugs.

Overall, I’d bet that cloud-provisioner (if you use the version currently on github) will work for most people.

High availability and Postgres

A friend contacted me today, asking me “What are the best practices for failover with Postgres?” And he mentioned pgpool-II.

He was interested in 9.0, since 9.1 hasn’t been released yet. (but, it’s looking like we’re gearing up for a September release!)

My off-the-cuff response was:

There isn’t a single solution, although pgpool-II is a common one.

pgpool-II is what I’ve used in AWS. I’ve also seen people use heartbeat (I guess pacemaker now?). I think either works fine. The frustrating bit is that we don’t have the ability to refresh the failed system easily.

There’s also repmgr: https://github.com/greg2ndQuadrant/repmgr

It’s new, but might be worth exploring.

I started an High Availability page on the PostgreSQL wiki. We really need a canonical source of information for this. Devs are struggling to figure it out from our docs.

What are you doing for HA and Postgres?

Broken windows, broken code, broken systems

A few days ago, I asked:

I spend a lot of time thinking about the little details in systems – like the number of ephemeral ports consumed, number of open file descriptors and per-process memory utilization over time. Small changes across 50 machines can add up to a large overall change in performance.

And then, today, I saw this article:

One of the more telling comments I received was the idea that since the advent of virtualization, there’s no point in trying to fix anything anymore. If a weird error pops up, just redeploy the original template and toss the old VM on the scrap heap. Similar ideas revolved around re-imaging laptops and desktops rather than fixing the problem. OK. Full stop. A laptop or desktop is most certainly not a server, and servers should not be treated that way. But even that’s not the full reality of the situation.

I’m starting to think that current server virtualization technologies are contributing to the decline of real server administration skills.

There definitely has been a shift – “real server administration skills” are now more about packaging, software selection and managing dramatic shifts in utilization. It’s less important know to know exactly how to manage M4 with sendmail, and more important that you know you should probably use postfix instead. I don’t spend much time convincing clients that they need connection pooling; I debug the connection pooler that was chosen.

The available software for web development and operations is quite broad – the version of Linux you select, whether you are vendor supported or not, and the volume of open source tools to support applications.

Inevitably, the industry has shifted to configuration management, rather than configuration. And, honestly, the shift started about 15 years ago with cfengine.

Now we call this DevOps, the idea that systems management should be programmable. Burgess called this “Computer Immunology”. DevOps is a much better marketing term, but I think the core ideas remain the same: Make programmatic interfaces to manage systems and automate.

But, back to the broken window thing! I did some searching for development and broken windows and found that in 2007, a developer talked about Broken Window Theory:

People are reluctant to break something that works, but not so much when it doesn’t. If the build is already broken, then people won’t spend much time making sure their change doesn’t break it (well, break it further). But if the build is pristine green, then they will be very careful about it.

In 2005, Jeff Atwood mentioned the original source, and said “Maybe we should be sweating the small stuff.”

That stuck with me because I admit that I focus on the little details first. I try to fix and automate where I can, but for political or practical reasons, I often am unable to make the comprehensive system changes I’d like to see.

So, given that most of us live in the real world where some things are just left undone, where do we draw the line? What do we consider a bit of acceptable street litter, and what do we consider a broken window? When is it ok to just reboot the system, and when do you really need to figure out exactly what went wrong?

This decision making process is often the difference between a productive work day, and one filled with frustration.

The strategies that we use to make this choice are probably the most important aspects of system administration and devops today. There, of course, is never a single right answer for every business. But I’m sure there are some themes.

For example:

James posted “Rules for Infrastructure” just the other day, which is a repost of the original gist. What I like about this is that they are phrased philosophically: here are the lines in the sand, and the definitions that we’re all going to agree to.

Where do you draw the line? And how do you communicate to your colleagues where the line is?