Josef “Jeff” Sipek

Simple File System

On three or four occasions over the past 4 years, I had a use for simple file system spec. Either to teach people about file systems, or to have a simple file system to implement to learn the idiosyncracies of an operating system’s VFS layer. This is what I came up with back in 2011 when helping a friend learn about file systems.

Simple File System

The structure is really simple. All multi-byte integers are stored as big endian.

A disk is a linear sequence of blocks. Each block is 512 bytes long. You can read/write a block at a time. The first block on the disk is number 0, the second is 1, etc.

The following is the file system structure. First of all, the file system uses 1024 byte blocks, and therefore you need to issue two disk I/Os to process a file system block worth of data.

The first fs block (disk blocks 0 & 1) is reserved, you should not change it in any way.

The second block contains the superblock:

struct superblock {
        uint32_t magic;        /* == 0x42420374 */
        uint32_t root_inode;   /* the disk block containing the root inode */
        uint32_t nblocks;      /* number of block on the disk */
        uint32_t _pad[253];    /* unused (should be '\0' filled) */
};

Starting at the third block is the block allocation map. The most significant bit of the first byte of this block represents fs block 0. The next bit represents block 1, etc.

Each file is represented by an inode. The inode contains a number of data block pointers. The first pointer (blocks[0]) contains the first 1024 bytes of the file, blocks[1] the second, etc. The timestamps are in microseconds since 00:00:00 Jan 1, 1900 UTC.

sturct inode {
        uint32_t size;         /* file length in bytes */
        uint32_t _pad0;        /* unused (should be 0) */
        uint64_t ctime;        /* creation time stamp */
        uint64_t mtime;        /* last modification time stamp */
        uint16_t nblocks;      /* number of data blocks in this file */
        uint16_t _pad1;        /* unused (should be 0) */
        uint32_t _pad2;        /* unused (should be 0) */
        uint32_t blocks[248];  /* file block ptrs */
};

The root directory is represented by an inode, the data pointed to by this inode’s blocks[] have a special format. They should be treated as arrays of directory entries. The filename is space padded (so, “foo.txt” would be stored as “foo.txt                     ”).

struct direntry {
        char fname[28];        /* the filename */
        uint32_t inode;        /* the inode block ptr */
};

So, graphically, the file system looks something like:

sb -> rootinode
        |-> direntries
        |     |-> <"foo.txt", inodeptr>
        |     |                 \-> inode
        |     |                       |-> data
        |     |                       |-> data
        |     |                       \-> data
        |     |-> <"bar.txt", inodeptr>
        |     |                 \-> inode
        |     |                       |-> data
        |     |                       |-> data
        |     |                       \-> data
        |     |             .
        |     |             .
        |     |             .
        |     \-> <"xyz.txt", inodeptr>
        |                       \-> inode
        |                             |-> data
        |                             |-> data
        |                             \-> data
        |                   .
        |                   .
        |                   .
        \-> direntries
              |-> <"aaa.txt", inodeptr>
              |                 \-> inode
              |                       |-> data
              |                       |-> data
              |                       \-> data
              |-> <"bbb.txt", inodeptr>
              |                 \-> inode
              |                       |-> data
              |                       |-> data
              |                       \-> data
              |             .
              |             .
              |             .
              \-> <"ccc.txt", inodeptr>
                                \-> inode
                                      |-> data
                                      |-> data
                                      \-> data

Operating Systems: Three Easy Pieces

I just found out that Remzi and Andrea decided to write a textbook about operating systems. This is exciting for several reasons. Here are the top two.

First and foremost, the book is free. That’s right, a textbook that is free when every other computer science textbook is easily around $100. Why? I’ll let Remzi make the case. Long story short, publishing a textbook isn’t about making money. It is about sharing ideas. You can download it from the textbook’s website.

Second, the book is by Remzi and Andrea. This pair of professors from the University of Wisconsin is responsible for a ton of amazing storage related research. If you don’t believe me, check out their publication track record.

I suppose I should mention that I have read only very little of the book, but I did push it onto the top of my to-read stack and I’m slowly making my way through it. I’ll let you all know how it goes.

Optimizing for Failure

For the past two years, I’ve been working at Barracuda Networks on a key-value storage system called Moebius. As with any other software project, the development was more focused on stability and basic functionality at first. However lately, we managed to get some spare cycles to consider tackling some of the big features we’ve been wishing for as well as revisiting some of the initial decisions. This includes error handling — specifically how and what size of hardware failures should be handled. During this brainstorming, I made an interesting (in my opinion) observation regarding optimizing systems.

If you take any computer architecture or organization course, you will hear about Wikipedia article: Amdahl’s law. Even if you never took an architecture course or just never heard of Amdahl, eventually you came to the realization that one should optimize for the common case. (Technically, Amdahl’s law is about parallel speedup but the idea of an upper bound on performance improvement applies here as well.) A couple of years ago, when I used to spend more time around architecture people, a day wouldn’t go by when I didn’t hear them focus on making the common case fast, and the uncommon case correct — as well as always guaranteeing forward progress.

My realization is that straightforward optimization for the common case is not sufficient. I’m not claiming that my realization is novel in any way. Simply that it surprised me more than it should.

Suppose you are writing a storage system. The common case (all hardware and software operate correctly) has been optimized and the whole storage system is performing great. Now, suppose that a hardware failure (or even a bug in other software!) occurs. Since this is a rare occurence, you did not optimize for it. The system is still operating, but you want to take some corrective action. Sadly, the failure has caused the system to no longer operate under the common case. So, you have a degraded system whose performance is hindering your corrective action! Ouch!

The answer is to optimize not just for the common case, but for some uncommon cases. Which uncommon cases? Well, the most common ones. :) The problem in the above scenario could have been (hopefully) avoided by not just optimizing for the common case, but also optimizing for the common failure! This is the weird bit… optimize for failures because you will see them.

In the case of a storage system, some failures to consider include:

  • one or more disks failing
  • random bit flips on one or more disks
  • one or more disks responding slowly
  • one or more disks temporarily disappearing and shortly after reappearing
  • low memory conditions

This list is far from exhaustive. You may even decide that some of these failures are outside the scope of your storage system’s reliability guarantees. But no matter what you decide, you need to keep in mind that your system will see failures and it must still behave well enough to not be a hindrance.

None of what I have written here is ground breaking. I just found it sufficiently different from what one normally hears that I thought I would write it up. Sorry architecture friends, the uncommon case needs to be fast too :)

FAST 2012

Last week I went to FAST ’12. As always, it’s been fun attending. Overall, the talks were good and gave me a couple of ideas. There was definitely a lot of flash and SSD related work — both of which I don’t find all that exciting. There was however a bunch of work related to dedup and backup.

Anyway, here’s my reading list. I’ve skimmed most of the papers, but I want to take a closer look at these.

  • Characteristics of Backup Workloads in Production Systems
  • WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression
  • Recon: Verifying File System Consistency at Runtime
  • Consistency Without Ordering
  • ZZFS: A Hybrid Device and Cloud File System for Spontaneous Users
  • Serving Large-scale Batch Computed Data with Project Voldemort
  • BlueSky: A Cloud-Backed File System for the Enterprise
  • Extracting Flexible, Replayable Models from Large Block Traces
  • iDedup: Latency-aware, Inline Data Deduplication for Primary Storage

Ancient Storage Device

I decided to clean out my room aaaand I found this:

ancient storage device

I vaguely remember hearing that it was an ancient storage device of some sort… :)

Busy

So, yeah. I finally forced myself to write one of these blog entries. Woohoo! I’ve been quite busy for the past two weeks (doing some cluster building, and then slideshow making - I leave for OLS in only 4 days!) I must say it should be interesting - the presentation, the impromptu stackable filesystems BOF.

I just found this very bizarre image (mirror) (too wide to include it here) which explains rather humorously Wikipedia article: RAID.

Powered by blahgd