Broken windows, broken code, broken systems

A few days ago, I asked:

I spend a lot of time thinking about the little details in systems – like the number of ephemeral ports consumed, number of open file descriptors and per-process memory utilization over time. Small changes across 50 machines can add up to a large overall change in performance.

And then, today, I saw this article:

One of the more telling comments I received was the idea that since the advent of virtualization, there’s no point in trying to fix anything anymore. If a weird error pops up, just redeploy the original template and toss the old VM on the scrap heap. Similar ideas revolved around re-imaging laptops and desktops rather than fixing the problem. OK. Full stop. A laptop or desktop is most certainly not a server, and servers should not be treated that way. But even that’s not the full reality of the situation.

I’m starting to think that current server virtualization technologies are contributing to the decline of real server administration skills.

There definitely has been a shift – “real server administration skills” are now more about packaging, software selection and managing dramatic shifts in utilization. It’s less important know to know exactly how to manage M4 with sendmail, and more important that you know you should probably use postfix instead. I don’t spend much time convincing clients that they need connection pooling; I debug the connection pooler that was chosen.

The available software for web development and operations is quite broad – the version of Linux you select, whether you are vendor supported or not, and the volume of open source tools to support applications.

Inevitably, the industry has shifted to configuration management, rather than configuration. And, honestly, the shift started about 15 years ago with cfengine.

Now we call this DevOps, the idea that systems management should be programmable. Burgess called this “Computer Immunology”. DevOps is a much better marketing term, but I think the core ideas remain the same: Make programmatic interfaces to manage systems and automate.

But, back to the broken window thing! I did some searching for development and broken windows and found that in 2007, a developer talked about Broken Window Theory:

People are reluctant to break something that works, but not so much when it doesn’t. If the build is already broken, then people won’t spend much time making sure their change doesn’t break it (well, break it further). But if the build is pristine green, then they will be very careful about it.

In 2005, Jeff Atwood mentioned the original source, and said “Maybe we should be sweating the small stuff.”

That stuck with me because I admit that I focus on the little details first. I try to fix and automate where I can, but for political or practical reasons, I often am unable to make the comprehensive system changes I’d like to see.

So, given that most of us live in the real world where some things are just left undone, where do we draw the line? What do we consider a bit of acceptable street litter, and what do we consider a broken window? When is it ok to just reboot the system, and when do you really need to figure out exactly what went wrong?

This decision making process is often the difference between a productive work day, and one filled with frustration.

The strategies that we use to make this choice are probably the most important aspects of system administration and devops today. There, of course, is never a single right answer for every business. But I’m sure there are some themes.

For example:

James posted “Rules for Infrastructure” just the other day, which is a repost of the original gist. What I like about this is that they are phrased philosophically: here are the lines in the sand, and the definitions that we’re all going to agree to.

Where do you draw the line? And how do you communicate to your colleagues where the line is?