Josef “Jeff” Sipek

Optimizing for Failure

For the past two years, I’ve been working at Barracuda Networks on a key-value storage system called Moebius. As with any other software project, the development was more focused on stability and basic functionality at first. However lately, we managed to get some spare cycles to consider tackling some of the big features we’ve been wishing for as well as revisiting some of the initial decisions. This includes error handling — specifically how and what size of hardware failures should be handled. During this brainstorming, I made an interesting (in my opinion) observation regarding optimizing systems.

If you take any computer architecture or organization course, you will hear about Wikipedia article: Amdahl’s law. Even if you never took an architecture course or just never heard of Amdahl, eventually you came to the realization that one should optimize for the common case. (Technically, Amdahl’s law is about parallel speedup but the idea of an upper bound on performance improvement applies here as well.) A couple of years ago, when I used to spend more time around architecture people, a day wouldn’t go by when I didn’t hear them focus on making the common case fast, and the uncommon case correct — as well as always guaranteeing forward progress.

My realization is that straightforward optimization for the common case is not sufficient. I’m not claiming that my realization is novel in any way. Simply that it surprised me more than it should.

Suppose you are writing a storage system. The common case (all hardware and software operate correctly) has been optimized and the whole storage system is performing great. Now, suppose that a hardware failure (or even a bug in other software!) occurs. Since this is a rare occurence, you did not optimize for it. The system is still operating, but you want to take some corrective action. Sadly, the failure has caused the system to no longer operate under the common case. So, you have a degraded system whose performance is hindering your corrective action! Ouch!

The answer is to optimize not just for the common case, but for some uncommon cases. Which uncommon cases? Well, the most common ones. :) The problem in the above scenario could have been (hopefully) avoided by not just optimizing for the common case, but also optimizing for the common failure! This is the weird bit… optimize for failures because you will see them.

In the case of a storage system, some failures to consider include:

  • one or more disks failing
  • random bit flips on one or more disks
  • one or more disks responding slowly
  • one or more disks temporarily disappearing and shortly after reappearing
  • low memory conditions

This list is far from exhaustive. You may even decide that some of these failures are outside the scope of your storage system’s reliability guarantees. But no matter what you decide, you need to keep in mind that your system will see failures and it must still behave well enough to not be a hindrance.

None of what I have written here is ground breaking. I just found it sufficiently different from what one normally hears that I thought I would write it up. Sorry architecture friends, the uncommon case needs to be fast too :)

0 Comments »

Atom feed for comments on this post.

Leave a comment

Powered by blahgd