Broken windows, broken code, broken systems

A few days ago, I asked:

I spend a lot of time thinking about the little details in systems – like the number of ephemeral ports consumed, number of open file descriptors and per-process memory utilization over time. Small changes across 50 machines can add up to a large overall change in performance.

And then, today, I saw this article:

One of the more telling comments I received was the idea that since the advent of virtualization, there’s no point in trying to fix anything anymore. If a weird error pops up, just redeploy the original template and toss the old VM on the scrap heap. Similar ideas revolved around re-imaging laptops and desktops rather than fixing the problem. OK. Full stop. A laptop or desktop is most certainly not a server, and servers should not be treated that way. But even that’s not the full reality of the situation.

I’m starting to think that current server virtualization technologies are contributing to the decline of real server administration skills.

There definitely has been a shift – “real server administration skills” are now more about packaging, software selection and managing dramatic shifts in utilization. It’s less important know to know exactly how to manage M4 with sendmail, and more important that you know you should probably use postfix instead. I don’t spend much time convincing clients that they need connection pooling; I debug the connection pooler that was chosen.

The available software for web development and operations is quite broad – the version of Linux you select, whether you are vendor supported or not, and the volume of open source tools to support applications.

Inevitably, the industry has shifted to configuration management, rather than configuration. And, honestly, the shift started about 15 years ago with cfengine.

Now we call this DevOps, the idea that systems management should be programmable. Burgess called this “Computer Immunology”. DevOps is a much better marketing term, but I think the core ideas remain the same: Make programmatic interfaces to manage systems and automate.

But, back to the broken window thing! I did some searching for development and broken windows and found that in 2007, a developer talked about Broken Window Theory:

People are reluctant to break something that works, but not so much when it doesn’t. If the build is already broken, then people won’t spend much time making sure their change doesn’t break it (well, break it further). But if the build is pristine green, then they will be very careful about it.

In 2005, Jeff Atwood mentioned the original source, and said “Maybe we should be sweating the small stuff.”

That stuck with me because I admit that I focus on the little details first. I try to fix and automate where I can, but for political or practical reasons, I often am unable to make the comprehensive system changes I’d like to see.

So, given that most of us live in the real world where some things are just left undone, where do we draw the line? What do we consider a bit of acceptable street litter, and what do we consider a broken window? When is it ok to just reboot the system, and when do you really need to figure out exactly what went wrong?

This decision making process is often the difference between a productive work day, and one filled with frustration.

The strategies that we use to make this choice are probably the most important aspects of system administration and devops today. There, of course, is never a single right answer for every business. But I’m sure there are some themes.

For example:

James posted “Rules for Infrastructure” just the other day, which is a repost of the original gist. What I like about this is that they are phrased philosophically: here are the lines in the sand, and the definitions that we’re all going to agree to.

Where do you draw the line? And how do you communicate to your colleagues where the line is?

8 thoughts on Broken windows, broken code, broken systems

Comments are closed.

  1. This is a generic problem across most types of work these days. From auto shops that don’t actually repair parts but swap them out to throw away cell phones. There are many reasons for this; planned obsolescence, modular components, lack of training, impatience and so on. The one overriding reason, as I see it, is that in developed countries we overvalue our time. When the choice comes done to; pay a sysadmin X dollars/hr x Y hours to track down the problem or reboot, guess what the choice is. There was a time when economic and political imbalances allowed for people in developed countries to get more and more for their time, that time has passed. This also accounts for two other issues in IT and other fields in this country, jobs moving overseas and difficulty in finding jobs that payed what they used to. For the foreseeable future I do not see the sort of in depth work you talk of being done on a wide scale in this country. The economics are against it. Or to use your analogy the line has been drawn and it is $/hr.

  2. I agree that economic conditions have influenced this a lot.

    But I’m interested in the negotiation process – because at least in my world, this isn’t just an economic decision. Customer complaints, support headaches and personal pet peeves play a role as well 🙂

    So, given that more than just economics tends to decide these issues in any particular organization, I am curious about the policies each org comes up with. In my consulting, I’ve found that the policies vary to an extreme.

    I’d love to hear about how each workplace struggles with this process internally.

  3. Well I came of age during Watergate and I tend to follow Deep Throats suggestion to Woodward/Bernstein:’Follow the money’. That being said other issues do come into play. From my experience, especially for smaller organizations, the philosophy is driven top down. If the boss(es) are open to problem solving rather then problem hiding then the organization is. If you are lucky they come to the table with that attitude. Otherwise it is a a matter of education and that is the hitch, how to prove that solving problems is truly beneficial. Any where revenue is involved it usually means proving that an ounce of prevention is worth a pound of cure. Otherwise it is matter of peoples time. Showing that having people repeat the same mistakes over and over is counter productive and the better option is to solve the problem and move on. The issue is that life is a series of problems/challenges thrown at as. Some people relish the opportunity to surmount each as they come up and then move on to the next. Other people see it as insurmountable and choose to take the path of least resistance i.e apply BandAid. In my more cynical moments that leads me to divide the world into two camps, problem solvers and problem creators. The negotiation process then is between those two groups. To make it work means getting at least some people from each group to the middle. Again, from my experience that generally means baby steps. Start with a process that is simple and has a high likelihood of working. For example, at a greenhouse where I took over maintenance duties I came into a situation where ‘deferred’ maintenance was the norm. This was driven by low expectations all the way around. To change that I started small by instituting a simple work order system. Basically slips of paper on a clipboard by the break room. I explained the process at a company wide meeting, you write down the problem, I jump on it as quick as possible. As in any situation there was early adopters and the key was making sure that their experience was positive. Over time word of mouth sold the program and I was on my way to going from quite literally putting out fires to preventing them in the first place.

  4. While economics and sociological concerns both play parts, there’s the third leg of the driving triad – regulation. While much of regulatory enforcement boils down to economic impact in the commercial world, it isn’t all there – there are interesting (read, painful) sociological results from some kinds of publicity available to regulatory bodies.

    As such, while ‘following the money’, I’d suggest keeping an eye on those big bricks dangling overhead with various regulations tagged on ’em – after that, it’s the classic risk calculation: vulnerability * threat * rate.

    While you CAN use FUD here, it tends to be self-limiting, and finding realistic, useful numbers for such calculation is an artform in and of its own right.

    IMO, ‘course – YMMV. This *IS* the world-wide web, after all…

  5. I first read the broken windows analogy in The Pragmatic Programmer (2000). (Excerpt of that part here:

    As already been said in the end it always comes down to resources. But the point made in the above mentioned book, and probably also in the original “neighborhood safety” context, is that you should not measure the cost against the current status, with a single broken window, but against a future of many more windows that eventually *will* break unless you take care of the first ones. From “Broken windows” (the article linked as the original source):

    “A particular rule that seems to make sense in the individual case makes no sense when it is made a universal rule and applied to all cases. It makes no sense because it fails to take into account the connection between one broken window left untended and a thousand broken windows.”

  6. Daniel: Ahh, nice. I figured there would be a Pragmatic Programmer link around somewhere. 🙂

    Great point. I also think of this as related to technical debt, and probably should have said so earlier.

    This reminds me of a presentation that I think Damian Conway or maybe Andy Lester gave at OSCON about tracking features implemented in his consulting team. You got credit for implementing features, but the credit evaporated when bugs were reported. This was a way to dis-incentivize production of code without proper tests, or customer acceptance.

    I think it’s related, but still a difficult thing to argue about with sysadmins.

  7. Wasn’t me, Selena. If I said anything about bugs and testing, it would have been to write the test that fails in the presence of the bug, BEFORE fixing the bug. That way you know that you have a test, and that the test works, and that the bug has been fixed.