Josef “Jeff” Sipek

Meili upgrades

A couple of months ago, I decided to update my almost two and a half year old laptop. Twice.

First, I got more RAM. This upped it to 12 GB. While still on the low side for a box which actually gets to see some heavy usage (compiling illumos takes a couple of hours and generates a couple of GB of binaries), it was better than the 4 GB I used for way too long.

Second, I decided to bite the bullet and replaced the 320 GB disk with a 256 GB SSD (Samsung 840 Pro). Sadly, in the process I had the pleasure of reinstalling the system — both Windows 7 and OpenIndiana. Overall, the installation was uneventful as my Windows partition has no user data and my OI storage is split into two pools (one for system and one for my data).

The nice thing about reinstalling OI was getting back to a stock OI setup. A while ago, I managed to play with software packaging a bit too much and before I knew it I was using a customized fork of OI that I had no intention of maintaining. Of course, I didn’t realize this until it was too late to rollback. Oops. (Specifically, I had a custom pkg build which was incompatible with all versions OI ever released.)

One of the painful things about my messed-up-OI install was that I was running a debug build of illumos. This made some things pretty slow. One such thing was boot. The ZFS related pieces took about a minute alone to complete. The whole boot procedure took about 2.5 minutes. Currently, with a non-debug build and an SSD, my laptop goes from Grub prompt to gdm login in about 40 seconds. I realize that this is an apples to oranges comparison.

I knew SSDs were supposed to be blazing fast, but I resisted getting one for the longest time mostly due to reliability concerns. What changed my mind? I got to use a couple of SSDs in my workstation at work. I saw the performance and I figured that ZFS would take care of alerting me of any corruption. Since most of my work is version controlled, chances are that I wouldn’t lose anything. Lastly, SSDs got a fair amount of improvements over the past few years.

Optimizing for Failure

For the past two years, I’ve been working at Barracuda Networks on a key-value storage system called Moebius. As with any other software project, the development was more focused on stability and basic functionality at first. However lately, we managed to get some spare cycles to consider tackling some of the big features we’ve been wishing for as well as revisiting some of the initial decisions. This includes error handling — specifically how and what size of hardware failures should be handled. During this brainstorming, I made an interesting (in my opinion) observation regarding optimizing systems.

If you take any computer architecture or organization course, you will hear about Wikipedia article: Amdahl’s law. Even if you never took an architecture course or just never heard of Amdahl, eventually you came to the realization that one should optimize for the common case. (Technically, Amdahl’s law is about parallel speedup but the idea of an upper bound on performance improvement applies here as well.) A couple of years ago, when I used to spend more time around architecture people, a day wouldn’t go by when I didn’t hear them focus on making the common case fast, and the uncommon case correct — as well as always guaranteeing forward progress.

My realization is that straightforward optimization for the common case is not sufficient. I’m not claiming that my realization is novel in any way. Simply that it surprised me more than it should.

Suppose you are writing a storage system. The common case (all hardware and software operate correctly) has been optimized and the whole storage system is performing great. Now, suppose that a hardware failure (or even a bug in other software!) occurs. Since this is a rare occurence, you did not optimize for it. The system is still operating, but you want to take some corrective action. Sadly, the failure has caused the system to no longer operate under the common case. So, you have a degraded system whose performance is hindering your corrective action! Ouch!

The answer is to optimize not just for the common case, but for some uncommon cases. Which uncommon cases? Well, the most common ones. :) The problem in the above scenario could have been (hopefully) avoided by not just optimizing for the common case, but also optimizing for the common failure! This is the weird bit… optimize for failures because you will see them.

In the case of a storage system, some failures to consider include:

  • one or more disks failing
  • random bit flips on one or more disks
  • one or more disks responding slowly
  • one or more disks temporarily disappearing and shortly after reappearing
  • low memory conditions

This list is far from exhaustive. You may even decide that some of these failures are outside the scope of your storage system’s reliability guarantees. But no matter what you decide, you need to keep in mind that your system will see failures and it must still behave well enough to not be a hindrance.

None of what I have written here is ground breaking. I just found it sufficiently different from what one normally hears that I thought I would write it up. Sorry architecture friends, the uncommon case needs to be fast too :)

Benchmark Assumptions

Today I came across a blog post about Running PostgreSQL on Compression-enabled ZFS. I found the article because (1) I am a fan of Wikipedia article: ZFS, and (2) transparent storage compression interests me. (Maybe I’ll talk about the later in the future.)

Whoever ran the benchmark decided to compare ZFS with Wikipedia article: lzjb, ZFS with gzip, against ext3. Their analysis states that ZFS-gzip is faster than ZFS-lzjb, which is faster than ext3. They admit that the benchmark is I/O bound. Then they state that compression effectively speeds up the disk I/O by making every byte transfered contain more information. The analysis goes down the drain right after that.

While doing background research for this blog post we also got a chance to investigate some of the other features besides compression that differentiate ZFS from older file system architectures like ext3. One of the biggest differences is ZFS’s approach to scheduling disk IOs which employs explicit IO priorities, IOP reordering, and deadline scheduling in order to avoid flooding the request queues of disk controllers with pending requests.

Anyone who’s benchmarked a system should have a red flag going off after reading those sentences. My reaction was something along the lines: “What?! You know that there are at least three major differences between ZFS and ext3 in addition to compression and you still try to draw conclusions about compression effectiveness by comparing ZFS with compression against ext3?!”

All they had to do to make their analysis so much more interesting and keep me quiet was to include another set of numbers — ZFS without compression. That way, one can compare ext3 with ZFS-uncompressed to see how much difference the radically different filesystem design makes. Then one could compare ZFS-uncompressed with the lzjb and gzip data to see if compression helps. Based on the data presented, we have no idea if compression helps — we just know that compression and ZFS outperform ext3. What if ZFS without compression is 5x faster than ext3? Then using gzip (~4x faster than ext3) is actually not the fastest.

To be fair, knowing how modern disk drives behave, chances are that compressed ZFS is faster than uncompressed ZFS. Since CPU cycles are so plentiful these days, all my systems have lzjb compression enabled everywhere. I do this mostly to conserve space, but also in hopes of transferring less data to disk. Yes, this is exactly what their benchmark attempts to show. (I haven’t had a chance to experiment with the new-ish lz4 compression algorithm in ZFS.) My point here is solely about benchmark analysis and unfounded (or at least unstated) assumptions found in just about every benchmark out there.

PAPI - Getting at Hardware Performance Counters

Recently, I wanted to figure out whether or not an application I was analyzing was memory bound or not. While on this quest, I was introduced to Performance Application Programming Interface (PAPI).

There is a rather good HOWTO that shows step-by-step instructions on getting it all running on Debian. The text below is more or less just a short version of that HOWTO, with my thoughts interspersed.

PAPI is a library that hooks into the hardware performance counters, and presents them in a uniform way. Installation is rather simple if you pay attention to the installation instructions.

  1. Get the kernel source
  2. Get the perfctr tarball
  3. Extract the sources, and run the update-kernel script. I really mean this, if you try to be clever and apply the patch by hand, you’ll have a broken source tree. (The script runs patch to fixup some existing kernel files, and then it copies a whole bunch of other files into kernel tree.)
  4. Configure, build, install, and reboot into the new kernel
  5. You can modprobe perfctr and see spew in dmesg

That’s it for perfctr. Now PAPI itself…

  1. Get & extract the source
  2. ./configure, make, make fulltest, make install-all

That’s it for PAPI. The make fulltest will run the tests. Chances are that they will all either pass or all fail. If they fail, then something is wrong (probably with perfctr). If they pass, then you are all set.

There are some examples in the src/examples directory. Those should get you started with using PAPI. It takes about 100 lines of C to get an arbitrary counter going.

Some other time, I’ll talk more about PAPI, and how I used it in my experiments.

Powered by blahgd