O_PONIES & Other Assorted Wishes
You might have already heard about ext4 “eating” people’s data. That’s simply not true.
While I am far from being a fan of ext4, I feel an obligation to set the record straight. But first, let me give you some references with an approximate timeline. I’m sure I managed to leave out a ton of details.
In mid-January, a bug titled Ext4 data loss showed up in the Ubuntu bug tracker. The complaining users apparently were using data on system crashes when using ext4. (The fact that Ubuntu likes to include every unstable & crappy driver into their kernels doesn’t help at all.) As part of the discussion, Ted Ts’o explained that the problem wasn’t with ext4 but with applications that did not ensure that the data they wrote was actually safe. The people did not like hearing that.
Things went pretty quiet until mid-March. That’s when a slashdot article made it painfully obvious that many of today’s apps are buggy. Some applications (KDE being a whole suite of applications) gotten used to the fact that ext3 was a very common filesystem used by Linux installations. More specifically, they got used to the behavior that ext3’s default mount option (data=ordered) provided. This is really the issue. The application developers assumed that the POSIX interface gave them more guarantees that it did! To make matters worse, the one way to ensure that the contents of a file get to the disk (the fsync system call) is very expensive on ext3. So over the past (almost) decade that ext3 has been around, application developers have been “trained” (think Pavlov reflexes) to not use fsync — on ext3, it’s expensive and the likelyhood of you losing data is much lower due to the default mount options. ext4’s fsync implementation, much like other filesystems’ implementations (e.g., XFS) does not suffer from this. (You may have heard about fsync on ext3 being expensive almost a year ago when Firefox was hit by this: Fsyncers and curveballs (the Firefox 3 fsync() problem). Note that in this case, as Ted Ts’o points out, the problem is that Firefox uses the same thread to draw the UI and do IO. That’s plain stupid.)
Over the next few days, Ted Ts’o posted two blog entries about delayed allocation (people seem to like to blame it for dataloss): Delayed allocation and the zero-length file problem, Don’t fear the fsync!.
About the same time, Eric Sandeen wrote a blurb about the state of affairs: fsync, sigh. He points out that XFS has faced the same issue years ago. When the application developers were confronted about their application being broken, they just put fingers in their ears, hummed loudly, yelled “I can’t hear you!” There is a word for that, and here’s the OED definition for it:
denial,
The asserting (of anything) to be untrue or untenable; contradiction of a statement or allegation as untrue or invalid; also, the denying of the existence or reality of a thing.
The problem is application developers not wanting to believe that it’s an application problem. Well, it really is! Not only are those apps broken, but they are not portable. AIX, IRIX, or Solaris will not give you the same guarantees as ext3!
(Eric is also trying to fight the common misconception that XFS nulls files: XFS does not null files, and requires no flux, which I assure you is not the case.)
About a week later, on an episode of Free Software Round Table, the problem was discussed a bit. They got most of it right :) (Here’s a 55MB mp3 of the show: 2009-03-21.)
When April 1st came about, the linux-fsdevel mailing list got a patch from yours truly: [PATCH] fs: point out any processes using O_PONIES. (The pony thing…it’s a bit of an inside joke among the Linux filesystem developers.) The idea of having O_PONIES first came up in #linuxfs on OFTC. While I don’t remember who first thought of it (my guess would be Eric), I know for sure that it wasn’t me. At the same time, I couldn’t help it, and considering that the patch took only a minute to make (and compile test), it was well worth it.
Few days later, during the Linux Storage and Filesystem workshop, the whole fsync issue got some discussion time. (See “Rename, fsync, and ponies” at Linux Storage and Filesystem workshop, day 1.) The part that really amused me:
Prior to Ted Ts’o’s session on fsync() and rename(), some joker filled the room with coloring-book pages depicting ponies. These pages reflected the sentiment that Ted has often expressed: application developers are asking too much of the filesystem, so they might as well request a pony while they’re at it.
In the comments for that article you can find Ted Ts’o saying:
Actually, it was Josef ’Jeff’ Sipek who deserves the first mention of application programmers asking for pones, when he posted an April Fools patch submission for the new open flag, O_PONIES — unreasonable file system assumptions desired.
Another file system developer who had worked on two major filesystems (ext4 and XFS) had a t-shirt on that had O_PONIES written on the front. And the joker who distributed the colouring book pages with pictures of ponies was another file system developer working yet another next generation file system.
Application programmers, while they were questioning my competence, judgement, and even my paternity, didn’t quite believe me when I told them that I was the moderate on these issues, but it’s safe to say that most of the file system developers in the room were utterly unsympathetic to the idea that it was a good idea to encourage application programmers to avoid the use of fsync(). About the only one who was also a moderate in the room was Val Aurora (formerly Henson). Both of us recognize that ext3’s data=ordered mode was responsible for people deciding that fsync() was harmful, and I’ve said already that if we had known how badly it would encourage application writers to Do The Wrong Thing, I would have pushed hard not to make data=ordered the default. Unfortunately, memory wasn’t as plentiful in those days, and so the associated page writeback latencies wasn’t nearly as bad ten years ago.
Hrm, I’m not sure how to take it…he makes it sound like I’m an extremist. Jeff — a freedom fighter for sanity of filesystem interfaces! :) As I said, I can’t take credit for the idea of O_PONIES. As I was writing this entry, I mentioned it to Eric and he promptly wrote an entry of his own: Coming clean on O_PONIES. It looks like he isn’t sure that he was the one to invent it! I’ll give him credit for it anyway.
The next day, a group photo of the attendees was taken… You can clearly see Val Aurora wearing an O_PONIES shirt. The idea was Eric’s, and as far as I know, he had his shirt the first day.
Fedora 11 is supposedly going to use ext4 as the default filesystem. When Ars Technica published an article about it (First look: Fedora 11 beta shows promise), some misguided people thinking that that ext4 eats your data left a bunch of comments….*sigh*
Well, there you have it. That’s the summary of events with some of my thoughts interleaved. If you are writing a userspace application that does file IO, do the right thing, fsync the data you care about (or at least fdatasync).
Comment by unknown — January 1, 1970 @ 00:00
This is a great example of what is wrong with fsync()! As a _user_, I absolutely detest fsync(). I feel that applications are completely overusing it. I completely understand why everybody is whining. The problem is a lack of a better solution. The answer is that fsync() is the only portable solution for now, IT TOTALLY SUCKS.
For a while there, mutt would fsync() after every message it wrote to a folder. Imagine moving 10,000 messages to another folder. Yes, mutt should probably have just fsync()d after it was done writing everything, but even that is annoying. I don't even want to wait for that. I want an interface that responds as fast as damn well possible (see threading below).
LILO back in 1998 used to install in a split second. Boom, done, reboot. Recently (before I started using grub), it changed into some beast that would fsync() so many times during an install that it would take about 5 seconds to run. The result is a higher chance of a power failure or crash while it is fsync()ing, meaning that the time the system is unbootable has been increased, meaning that the use of fsync() has actually made it LESS reliable. Oops! But I really cared about that data!
mythtv (and most syslogs by default) open O_SYNC or use fsync() on all log files (and mythtv fsync()s its video files so often that I can't sit in the same room as the hard drive, grinding every second or more. This is incredibly frustrating. When pressing the up arrow in the listings, it decides to log some silly error which grinds away at the hard drive. Argh!
This doesn't go away as we move to SSDs! Instead of making noise, it just wears out the flash! Really, we don't actually want to write this often.
There were just so many cases where the speed or responsiveness impact from the use of fsync() annoyed me that I ended up writing an LD_PRELOAD library that no-ops fsync(), fdatasync(), and O_SYNC. I have been running this on every desktop I have since around January, 2000, and I have yet to lose any data as a result. This is as a DESKTOP user. Sure, I'm not recording bank transactions.
YES, there needs to be a way of an application knowing when something is written. YES, there needs to be a way to ask for it to happen soon, please. But blocking is the worst possible implementation for an application developer.
What about some sort of notification event when the data is actually on disk? To me, the needs of an application always seemed to be more along the lines of "let me know when that's on disk so I can forget about it", like the way an ACK back to a TCP stack lets it forget those bytes in its window, or grouping some changes into an atomic transaction or ordered commits (eg: mutt not wanting to write the new msgs, delete the old, and lose both because of write ordering).
Does a Firefox user care that they might lose a site in their history if their machine crashes a few seconds after loading the page? No. What they care about is if their entire history disappears (or gets corrupted due to lack of ordering and has the same effect).
You say "the same thread to draw the UI and do IO. That's plain stupid". Consider the alternative! Mutt would have to become a multithreaded application ONLY so that it could wait on the disk in the background and still be responsive. As you well know, threaded programming is hard. Most people get it wrong.
All OSes already have an asynchronous write-back queue (dirty pages and all these write-back timers and VM heuristics). These exist because UNIX is not DOS. Blocking on the creation of data is just not feasible for performance. The distinction is between writing streams and writing records. It sure as hell doesn't make sense to spawn a thread every time one wants to write anything to disk. So where do we draw the line? Bash doesn't know that your writing to a file is uber-important. Do we add --but-please-fsync-it-because-i-like-it to every shell utility? Do we run "sync" between every step? No, because the power supply could explode at any time anyway.
Imagine an SMTP server where it can pile up a bunch of writes to be checkpointed, ask it to happen sometime soon (not in that order), and be notified when it's on disk so that it can write back "OK" to the sender. This would let any mail server be completely single-threaded (or at least have no need for multiple threads except for more worthy, CPU-bound tasks). The same applies to nearly any daemon or application that needs to write to disk and still be responsive. (Reading, eg., Apache, is another problem, hence my surprise that there is no wakeup support for O_NONBLOCK on files, but that's another story.)
Anyway, those are my feelings on the topic. "fsync the data you care about", but also realize how annoying, wasteful, and counter-productive it can be in some cases to do so. Most people don't have data that they care about that much anyway. I just want "cp a b; rm a" to at least leave a or b around. I know POSIX doesn't guarantee it, but don't you agree that "cp a b ; sync ; rm a" seems like overkill?
(No, I don't know how to implement a callback API for shell programs. Hmm, I wonder why developers would like implicit ordering...)
Comment by unknown — January 1, 1970 @ 00:00