Josef “Jeff” Sipek

April 18, 2009

O_PONIES & and Other Assorted Wishes

Filed under: rants programming programming/kernel filesystems — JeffPC @ 03:26

You might have already heard about ext4 "eating" people's data. That's simply not true.

While I am far from being a fan of ext4, I feel an obligation to set the record straight. But first, let me give you some references with an approximate timeline. I'm sure I managed to leave out a ton of details.

In mid-January, a bug titled Ext4 data loss showed up in the Ubuntu bug tracker. The complaining users apparently were using data on system crashes when using ext4. (The fact that Ubuntu likes to include every unstable & crappy driver into their kernels doesn't help at all.) As part of the discussion, Ted Ts'o explained that the problem wasn't with ext4 but with applications that did not ensure that the data they wrote was actually safe. The people did not like hearing that.

Things went pretty quiet until mid-March. That's when a slashdot article made it painfully obvious that many of today's apps are buggy. Some applications (KDE being a whole suite of applications) gotten used to the fact that ext3 was a very common filesystem used by Linux installations. More specifically, they got used to the behavior that ext3's default mount option (data=ordered) provided. This is really the issue. The application developers assumed that the POSIX interface gave them more guarantees that it did! To make matters worse, the one way to ensure that the contents of a file get to the disk (the fsync system call) is very expensive on ext3. So over the past (almost) decade that ext3 has been around, application developers have been "trained" (think Pavlov reflexes) to not use fsync --- on ext3, it's expensive and the likelyhood of you losing data is much lower due to the default mount options. ext4's fsync implementation, much like other filesystems' implementations (e.g., XFS) does not suffer from this. (You may have heard about fsync on ext3 being expensive almost a year ago when Firefox was hit by this: Fsyncers and curveballs (the Firefox 3 fsync() problem). Note that in this case, as Ted Ts'o points out, the problem is that Firefox uses the same thread to draw the UI and do IO. That's plain stupid.)

Over the next few days, Ted Ts'o posted two blog entries about delayed allocation (people seem to like to blame it for dataloss): Delayed allocation and the zero-length file problem, Don't fear the fsync!.

About the same time, Eric Sandeen wrote a blurb about the state of affairs: fsync, sigh. He points out that XFS has faced the same issue years ago. When the application developers were confronted about their application being broken, they just put fingers in their ears, hummed loudly, yelled "I can't hear you!" There is a word for that, and here's the OED definition for it:

denial,

The asserting (of anything) to be untrue or untenable; contradiction of a statement or allegation as untrue or invalid; also, the denying of the existence or reality of a thing.

The problem is application developers not wanting to believe that it's an application problem. Well, it really is! Not only are those apps broken, but they are not portable. AIX, IRIX, or Solaris will not give you the same guarantees as ext3!

(Eric is also trying to fight the common misconception that XFS nulls files: XFS does not null files, and requires no flux, which I assure you is not the case.)

About a week later, on an episode of Free Software Round Table, the problem was discussed a bit. They got most of it right :) (Here's a 55MB mp3 of the show: 2009-03-21.)

When April 1st came about, the linux-fsdevel mailing list got a patch from yours truly: [PATCH] fs: point out any processes using O_PONIES. (The pony thing...it's a bit of an inside joke among the Linux filesystem developers.) The idea of having O_PONIES first came up in #linuxfs on OFTC. While I don't remember who first thought of it (my guess would be Eric), I know for sure that it wasn't me. At the same time, I couldn't help it, and considering that the patch took only a minute to make (and compile test), it was well worth it.

Few days later, during the Linux Storage and Filesystem workshop, the whole fsync issue got some discussion time. (See "Rename, fsync, and ponies" at Linux Storage and Filesystem workshop, day 1.) The part that really amused me:

Prior to Ted Ts'o's session on fsync() and rename(), some joker filled the room with coloring-book pages depicting ponies. These pages reflected the sentiment that Ted has often expressed: application developers are asking too much of the filesystem, so they might as well request a pony while they're at it.

In the comments for that article you can find Ted Ts'o saying:

Actually, it was Josef 'Jeff' Sipek who deserves the first mention of application programmers asking for pones, when he posted an April Fools patch submission for the new open flag, O_PONIES --- unreasonable file system assumptions desired.

Another file system developer who had worked on two major filesystems (ext4 and XFS) had a t-shirt on that had O_PONIES written on the front. And the joker who distributed the colouring book pages with pictures of ponies was another file system developer working yet another next generation file system.

Application programmers, while they were questioning my competence, judgement, and even my paternity, didn't quite believe me when I told them that I was the moderate on these issues, but it's safe to say that most of the file system developers in the room were utterly unsympathetic to the idea that it was a good idea to encourage application programmers to avoid the use of fsync(). About the only one who was also a moderate in the room was Val Aurora (formerly Henson). Both of us recognize that ext3's data=ordered mode was responsible for people deciding that fsync() was harmful, and I've said already that if we had known how badly it would encourage application writers to Do The Wrong Thing, I would have pushed hard not to make data=ordered the default. Unfortunately, memory wasn't as plentiful in those days, and so the associated page writeback latencies wasn't nearly as bad ten years ago.

Hrm, I'm not sure how to take it...he makes it sound like I'm an extremist. Jeff --- a freedom fighter for sanity of filesystem interfaces! :) As I said, I can't take credit for the idea of O_PONIES. As I was writing this entry, I mentioned it to Eric and he promptly wrote an entry of his own: Coming clean on O_PONIES. It looks like he isn't sure that he was the one to invent it! I'll give him credit for it anyway.

The next day, a group photo of the attendees was taken... You can clearly see Val Aurora wearing an O_PONIES shirt. The idea was Eric's, and as far as I know, he had his shirt the first day.

Fedora 11 is supposedly going to use ext4 as the default filesystem. When Ars Technica published an article about it (First look: Fedora 11 beta shows promise), some misguided people thinking that that ext4 eats your data left a bunch of comments....*sigh*

Well, there you have it. That's the summary of events with some of my thoughts interleaved. If you are writing a userspace application that does file IO, do the right thing, fsync the data you care about (or at least fdatasync).

February 29, 2008

Dumping & restoring XFS volumes

Over the past few years, I've been using XFS wherever I could. I never really tried to tweak the mkfs options, and therefore most of my filesystems were quite sub-optimal. I managed to get my hands on an external 500GB disk that I decided to use for all this data shuffling...

320GB external firewire disk


This was probably the most offenseively made fs. Here's the old info:


meta-data=/dev/sdb1 isize=512 agcount=17, agsize=4724999 blks
= sectsz=512 attr=1
data = bsize=4096 blocks=78142042, imaxpct=25
= sunit=0 swidth=0 blks, unwritten=1
naming =version 2 bsize=4096
log =internal bsize=4096 blocks=32768, version=2
= sectsz=512 sunit=0 blks, lazy-count=0
realtime =none extsz=65536 blocks=0, rtextents=0

It had 512 byte inodes (instead of the more sane, and default 256 byte inodes) because I was playing around with SELinux when I made this filesystem, and the bigger inodes allow more extended attributes to be stored there - improving performance a whole lot. When I first made the fs, it had 16 allocation groups, but I grew the filesystem about 10GB which were used by a FAT32 partition that I used for Windows< ->Linux data shuffling. On a simple disk (e.g., not a RAID 5), 4 allocation groups is far more logical then the 17 I had before. Another thing I wanted to use is the lazy-count. That got introduced in 2.6.23, and improved performance when multiple processes were filesystem metadata (create/unlink/mkdir/rmdir). And last, but not least, I wanted to use version 2 inodes.

The simples way to change all the filesystem to use these features is to backup, mkfs, and restore...and that's what I did.

This is what the fs is like after the whole process (I bolded all the changes):


meta-data=/dev/sdb1 isize=256 agcount=4, agsize=19535511 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=78142042, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096
log =internal bsize=4096 blocks=32768, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

dumping...

I mkfs.xfs'd the 500GB disk, and mounted it on /mnt/dump. Since I like tinkering with storage, I couldn't help but start blktrace for both of the disks (the one being dumped, and the one storing the dump).

Instead of using rsync, tar, or dd, I went with xfsdump/xfsrestore combo. xfsdump is a lot like tar - it creates a single with with all the data, but unlike tar, it also saves extended attributes, and preserves the hole information for sparse files. So, with blktrace running, it was time to start the dump:


# xfsdump -f /mnt/dump/acomdata_xfs.dump -p 60 -J /mnt/acomdata

The dump took about 9300 seconds (2 hours, 35 mins). Here are the graphs created by seekwatcher (which uses the blktrace traces)...The source disk is the firewire disk being dumped, and the target disk is the one being dumped to.

source disk

The IO here makes sense, xfsdump scans the entire filesystem - and backs up every inode sorted by the inode number (which is a function of the block number). The scattered accesses are because of fragmented files having data all over the place.

target disk

I'm not quite sure why XFS decided to break the dump file into 8 extents. These extents show up nicely as the 8 ascending lines. The horizontal line ~250GB is the journal being written to. (The seeks/second graph's y-axis shows that seekwatcher has a bug when there's very little seeking :) )

...and restoring

After the dump finished, I unmounted the 320GB fs, and ran mkfs on it (lazy-count=1, agcount=4, etc.). Then it was time to mount, start a new blktrace run on the 2 disks, and run xfsrestore - to extract all the files from the dump.


# xfsrestore -f /mnt/dump/acomdata_xfs.dump -p 60 -A -B -J /mnt/acomdata

I used the -A option to NOT restore xattrs as the only xattrs that were on the filesystem were some stray SELinux labels that managed to survive.

The restore took a bit longer...12000 seconds (3 hours, 20 minutes). And here are the traces for the restore:

source disk

Reading the 240GB file that was in 8 extents created a IO trace that's pretty self explanatory. The constant writing to the journal was probably because of the inode access time updates. (And again, seekwatcher managed to round the seeks/second y-axis labels.)

target disk

This looks messy, but it actually isn't bad at all. The 4 horizontal lines that look a lot like journal writes are probably the superblocks being updated to reflect the inode counts (4 allocation groups == 4 sets superblock + ag structures).

some analysis...


After the restore, I ran some debug tools to see how clean the filesystem ended up being...

...fragmentation


37945 extents used, ideal 37298 == not bad at all

...free space fragmentation



from to extents blocks pct
1 1 19 19 0.00
2 3 1 3 0.00
64 127 2 150 0.00
128 255 1 134 0.00
512 1023 1 584 0.00
4096 8191 1 4682 0.02
32768 65535 1 36662 0.19
131072 262143 1 224301 1.16
262144 524287 3 1315076 6.79
524288 1048575 2 1469184 7.59
1048576 2097151 4 6524753 33.71
2097152 4194303 4 9780810 50.53

== pretty much sqeaky clean

...per allocation group block usage



/dev/sdb1:
AG 1K-blocks Used Available Use%
0 78142044 40118136 38023908 51%
1 78142044 78142040 4 99%
2 78142044 42565780 35576264 54%
3 78142036 74316844 3825192 95%
ALL 312568168 235142800 77425368 75%

I'm somewhat surprised that the 2nd and 4th are near full (well, 2nd ag has only 4kB free!), while the 1st and 3rd are only half full. As you can see, the 320GB disk is 75% used.

Bonus features


I decided to render mpeg versions of the IO traces...

source disk (dump) (4MB)

target disk (dump) (2MB)

source disk (restore) (2MB)

target disk (restore) (4.1MB) <- this is the best one of the bunch

August 23, 2007

Smile!

Filed under: programming programming/kernel random humor filesystems — JeffPC @ 16:29

Last night, I forget who, challenged me to make a smilie with disk io and seekwatcher. Well, I couldn't let such challenge just pass me by (click to enlarge):

Smile!

All that I used was: blktrace, seekwatcher, python (to do the math - sin, cos, etc.), and dd (to do the disk io). I am already planning bigger and better things :)

August 22, 2007

XFS, blktrace, seekwatcher

Filed under: programming programming/kernel filesystems — JeffPC @ 23:50

Today I was playing with blktrace, and graphing the results with seekwatcher. At one point, I ran acp (which is a lot like tar, but tries to be smarter) on a directory stored on an XFS volume, but I forgot that months ago, I created a sparse file 101PB (that's peta) in size. Well, acp was happily reading all the sparse regions. I killed it, and decided to remove the gigantic file which was totally useless. About 30 seconds into the removal, I realized it would have been great to have a trace of that. Well, I started blktrace and about 12 minutes later the rm process finished.

I graphed it and here's the result (click for larger version):

XFS removing a large sparse file

At first I was very confused why things looked the way they did, but eventually it dawned on me (after some discussion with Dave Chinner - XFS dude) that it's all journal log traffic. I quickly ran xfs_info on the filesystem:


meta-data=/dev/sdb1 isize=256 agcount=16, agsize=1120031 blks
= sectsz=512 attr=1
data = bsize=4096 blocks=17920496, imaxpct=25
= sunit=0 swidth=0 blks, unwritten=1
naming =version 2 bsize=4096
log =internal bsize=4096 blocks=8750, version=1
= sectsz=512 sunit=0 blks
realtime =none extsz=65536 blocks=0, rtextents=0

And things just made sense. I calculated the size of the log (see bolded numbers) to be (4096*8750) bytes, or 34.17 MB (base 2) or 35.84 MB (base 10). If you look at the graph, you'll see that the disk offsets accessed were 35001 to 35035 MB or about 35MB! XFS puts the log near the middle of the disk to minimize seeks as much as possible, so as you may have guessed, my disk is about 70GB in size (it's a U160 73GB SCSI disk).

May 28, 2007

Looking up Files, Part II

Filed under: programming programming/kernel fsl fsl/unionfs filesystems — JeffPC @ 05:28

So, here's more updates about my adventures within the realm of unionfs_lookup (I suggest you read part I first). After my first post about lookup code, I went back to coding, and I had the pleasure to try to figure out why I was hitting a BUG_ON() with my new code, but not with the old code.

I made a simple test case, in one terminal I'd run fsx (a POSIX compliance tester program) on unionfs:


mount -t unionfs -o dirs=/mnt/foo/b0:/mnt/foo/b1=ro none unionfs/
cd unionfs/
fsx -l 104060000 -q foo

And then mid-way through, I'd insert a branch as the new branch index 0:


mount -o remount,add=/mnt/foo/b0:/mnt/foo/b2=rw /mnt/unionfs

The remount command immediatelly caused the BUG_ON (that tests for dentry validity) in unionfs_setattr to trigger. It seemed rather odd that the lookup code replacement would do something that'd cause the unionfs dentry to be invalid. I pondered for a bit, and then I tried to insert a number of branches quickly with the old code. Eureka! The same BUG_ON() got triggered. Some lxr-ing later, it became apparent that we need to potentially revalidate inside the inode ops (like unionfs_setattr). Seems kinda obvious now, oh well. I'm also pondering about the posibility of changing the VFS to call d_revalidate, but I'm still not sure if that's the Right Thing(tm) to do.

Until next time!

May 28, 2007

Looking up Files

Filed under: programming programming/kernel fsl fsl/unionfs filesystems — JeffPC @ 03:08

So, I spend the last two to three days mucking around with unionfs_lookup. Before I touched it, it was a very, very ugly, 340 line beast that no one on the Unionfs team wanted to touch in the past year or so, because it seemed that just looking at it the wrong way would make it not work. The function had 4 different modes of operation, which overlapped in subtle ways.

Well, I decided to have some fun - rewriting it from scratch. :) Currently, the lookup function is 210 lines long, and it looks like it is working. It has only one mode - as it should. Since, I am still not done, the original lookup code is still there, and used for the other 3 modes. I'll hack on it some more, and either remove them completely, or collapse some of them because they seem a bit redundant.

In the end, even if the total code size is still around 340 lines, I'll be happy. Having 340 lines of readable code is way better than 340 lines of barely readable code.

January 9, 2007

Step 1: Fame

Totally awesome day! I submitted Unionfs to the usual places (linux-kernel, fsdevel, and the key people), then I stayed up all night. In the morning, I got a form for permission to enroll in the graduate version of compilers (I'd much prefer lex & yacc to some made up java thing the undergrad course uses). At around 10, I decided to head home and get some sleep. I woke up about 8 hours later, and checked my email. I replied to a lot of comments/questions by Andrew Morton and some other people, and when I finally managed to check the rest of the inbox, I saw:


Jan 08 akpm@osdl.org ( 236) + unionfs-documentation.patch added to -mm tree
Jan 08 akpm@osdl.org ( 107) + lookup_one_len_nd-lookup_one_len-with-nameidata-argument.patch added to -mm
Jan 08 akpm@osdl.org ( 138) + unionfs-branch-management-functionality.patch added to -mm tree
Jan 08 akpm@osdl.org ( 649) + unionfs-common-file-operations.patch added to -mm tree
Jan 08 akpm@osdl.org ( 733) + unionfs-copyup-functionality.patch added to -mm tree
Jan 08 akpm@osdl.org ( 299) + unionfs-dentry-operations.patch added to -mm tree
Jan 08 akpm@osdl.org ( 313) + unionfs-file-operations.patch added to -mm tree
Jan 08 akpm@osdl.org ( 319) + unionfs-directory-file-operations.patch added to -mm tree
Jan 08 akpm@osdl.org ( 326) + unionfs-directory-manipulation-helper-functions.patch added to -mm tree
Jan 08 akpm@osdl.org ( 995) + unionfs-inode-operations.patch added to -mm tree
Jan 08 akpm@osdl.org ( 572) + unionfs-lookup-helper-functions.patch added to -mm tree
Jan 08 akpm@osdl.org ( 743) + unionfs-main-module-functions.patch added to -mm tree
Jan 08 akpm@osdl.org ( 344) + unionfs-readdir-state.patch added to -mm tree
Jan 08 akpm@osdl.org ( 501) + unionfs-rename.patch added to -mm tree
Jan 08 akpm@osdl.org ( 263) + unionfs-privileged-operations-workqueue.patch added to -mm tree
Jan 08 akpm@osdl.org ( 168) + unionfs-handling-of-stale-inodes.patch added to -mm tree
Jan 08 akpm@osdl.org ( 228) + unionfs-miscellaneous-helper-functions.patch added to -mm tree
Jan 08 akpm@osdl.org ( 402) + unionfs-superblock-operations.patch added to -mm tree
Jan 08 akpm@osdl.org ( 233) + unionfs-helper-macros-inlines.patch added to -mm tree
Jan 08 akpm@osdl.org ( 552) + unionfs-internal-include-file.patch added to -mm tree
Jan 08 akpm@osdl.org ( 87) + unionfs-include-file.patch added to -mm tree
Jan 08 akpm@osdl.org ( 218) + unionfs-unlink.patch added to -mm tree
Jan 08 akpm@osdl.org ( 109) + unionfs-kconfig-and-makefile.patch added to -mm tree

Unionfs is now in -mm!

If you actually look at the next -mm changelog, you only see one patch containing all of Unionfs, as Andrew decided to use the git tree that I set up (gitweb) on kernel.org.

September 19, 2006

XFS & ext3

Filed under: programming programming/kernel random rants filesystems — JeffPC @ 08:54

So, here's a mini-rant...There are just as many XFS complaints as ext3 complaints on the linux-kernel mailing list. Yep. It is that simple. I can't stand the fact that some people make a big deal out of complaints about XFS, but are oddly silent (or ignorant?) of the fact that there are just as many "problems" with Ext2/3. I'm not even considering Ext4, as it is in development.

September 19, 2006

Papers...

Filed under: programming programming/kernel fsl fsl/unionfs filesystems — JeffPC @ 08:59

Hrm, the ever-ending TODO list currently contains two papers I want to read. Since the TODO list doesn't actually exist anywhere beyond my head, I figured that I'd post the links to the papers here, and maybe someone who reads this will bug me, in turn making me actually read them.

September 1, 2006

Unionfs Request For Comments

So, finally, after long time of trying to get Unionfs into the Linux kernel, we submited it to linux-kernel, fsdevel, cc'ing all the people that should be cc'd (Al Viro, Christoph Hellwig, and Andrew Morton.)

Powered by a pile of c