Josef “Jeff” Sipek

lguest: The New Kid on the Block

As most of you know, virtuallization doesn’t really interest me, so me writing about lguest is rather unusual. For those who don’t know, lguest is Rusty Russell’s way of saying virtualization sucks and I can make it better (don’t quote me on that).

Yesterday, Rusty sent out 7 patch series ( 1, 2, 3, 4, 5, 6, 7) that contains most of the documentation for lguest. This is not the normal style of documentation you’ll find in the kernel. Here’s Rusty’s description…

Lguest is an adventure, with you, the reader, as Hero. I can’t think of many 5000-line projects which offer both such capability and glimpses of future potential; it is an exciting time to be delving into the source!

But be warned; this is an arduous journey of several hours or more! And as we know, all true Heroes are driven by a Noble Goal. Thus I offer a Beer (or equivalent) to anyone I meet who has completed this documentation.

So get comfortable and keep your wits about you (both quick and humorous). Along your way to the Noble Goal, you will also gain masterly insight into lguest, and hypervisors and x86 virtualization in general.

There is a very large number of totally hillarious comments. It looks like one doesn’t have to be an x86 expert to get a laugh out of them, but knowing a thing or two about the architecture makes it all the more enjoyable.

I can’t help but include few excerpts here…

Intel provided a special instruction to clear the TS bit for people too cool to use write_cr0() to do it. This "clts" instruction is faster, because all the vowels have been optimized out.

I’m told there are only two stories in the world worth telling: love and hate. So there used to be a love scene here like this:

Launcher: We could make beautiful I/O together, you and I.
Guest: My, that’s a big disk!

Unfortunately, it was just too raunchy for our otherwise-gentle tale.

Just read the patches. They are really amusing :)

Looking up Files, Part II

So, here’s more updates about my adventures within the realm of unionfs_lookup (I suggest you read part I first). After my first post about lookup code, I went back to coding, and I had the pleasure to try to figure out why I was hitting a BUG_ON() with my new code, but not with the old code.

I made a simple test case, in one terminal I’d run fsx (a POSIX compliance tester program) on unionfs:

mount -t unionfs -o dirs=/mnt/foo/b0:/mnt/foo/b1=ro none unionfs/
cd unionfs/
fsx -l 104060000 -q foo

And then mid-way through, I’d insert a branch as the new branch index 0:

mount -o remount,add=/mnt/foo/b0:/mnt/foo/b2=rw /mnt/unionfs

The remount command immediatelly caused the BUG_ON (that tests for dentry validity) in unionfs_setattr to trigger. It seemed rather odd that the lookup code replacement would do something that’d cause the unionfs dentry to be invalid. I pondered for a bit, and then I tried to insert a number of branches quickly with the old code. Eureka! The same BUG_ON() got triggered. Some lxr-ing later, it became apparent that we need to potentially revalidate inside the inode ops (like unionfs_setattr). Seems kinda obvious now, oh well. I’m also pondering about the posibility of changing the VFS to call d_revalidate, but I’m still not sure if that’s the Right Thing(tm) to do.

Until next time!

Looking up Files

So, I spend the last two to three days mucking around with unionfs_lookup. Before I touched it, it was a very, very ugly, 340 line beast that no one on the Unionfs team wanted to touch in the past year or so, because it seemed that just looking at it the wrong way would make it not work. The function had 4 different modes of operation, which overlapped in subtle ways.

Well, I decided to have some fun — rewriting it from scratch. :) Currently, the lookup function is 210 lines long, and it looks like it is working. It has only one mode - as it should. Since, I am still not done, the original lookup code is still there, and used for the other 3 modes. I’ll hack on it some more, and either remove them completely, or collapse some of them because they seem a bit redundant.

In the end, even if the total code size is still around 340 lines, I’ll be happy. Having 340 lines of readable code is way better than 340 lines of barely readable code.

I'd Tap That

So, I just spend the whole night playing around with Systemtap. A neat way to dynamically instrument the kernel.

As far as I know, it uses the kprobe interface to do the hard work. The neat thing about it is, the fact that you use a rather safe language to write the code that’ll get inserted into the kernel. This code then gets translated into C code, which then gets compiled as a kernel module.

When I first heard about systemtap, I decide to ignore it for whatever reason. Now, I’m kind of sorry I did, because it definitely looks like a very useful tool.

Fun Kernel Experiment

Ok, so after doing all that work on Guilt, I decided that it would be an interesting experiment to resurect an idea I had about 2.5 years ago…my own kernel patch series. Back then, I called it -js, for rather obvious reasons (my name, if you haven’t noticed).

I wrote a small script, which allows me to make a release of a 2.6-js kernel patch series within seconds. I have no idea if it is worth anything having yet another kernel patch series, and I might not actually maintain it for long, but just in case someone is brave enough to try it, here it is: Yeah, I could put it on, but I won’t unless I see that -js is actually getting somewhere.

Bug reports, etc. are..well..I guess, welcome :)

Step 1: Fame

Totally awesome day! I submitted Unionfs to the usual places (linux-kernel, fsdevel, and the key people), then I stayed up all night. In the morning, I got a form for permission to enroll in the graduate version of compilers (I’d much prefer lex & yacc to some made up java thing the undergrad course uses). At around 10, I decided to head home and get some sleep. I woke up about 8 hours later, and checked my email. I replied to a lot of comments/questions by Andrew Morton and some other people, and when I finally managed to check the rest of the inbox, I saw:

Jan 08   ( 236) + unionfs-documentation.patch added to -mm tree
Jan 08   ( 107) + lookup_one_len_nd-lookup_one_len-with-nameidata-argument.patch added to -mm
Jan 08   ( 138) + unionfs-branch-management-functionality.patch added to -mm tree
Jan 08   ( 649) + unionfs-common-file-operations.patch added to -mm tree
Jan 08   ( 733) + unionfs-copyup-functionality.patch added to -mm tree
Jan 08   ( 299) + unionfs-dentry-operations.patch added to -mm tree
Jan 08   ( 313) + unionfs-file-operations.patch added to -mm tree
Jan 08   ( 319) + unionfs-directory-file-operations.patch added to -mm tree
Jan 08   ( 326) + unionfs-directory-manipulation-helper-functions.patch added to -mm tree
Jan 08   ( 995) + unionfs-inode-operations.patch added to -mm tree
Jan 08   ( 572) + unionfs-lookup-helper-functions.patch added to -mm tree
Jan 08   ( 743) + unionfs-main-module-functions.patch added to -mm tree
Jan 08   ( 344) + unionfs-readdir-state.patch added to -mm tree
Jan 08   ( 501) + unionfs-rename.patch added to -mm tree
Jan 08   ( 263) + unionfs-privileged-operations-workqueue.patch added to -mm tree
Jan 08   ( 168) + unionfs-handling-of-stale-inodes.patch added to -mm tree
Jan 08   ( 228) + unionfs-miscellaneous-helper-functions.patch added to -mm tree
Jan 08   ( 402) + unionfs-superblock-operations.patch added to -mm tree
Jan 08   ( 233) + unionfs-helper-macros-inlines.patch added to -mm tree
Jan 08   ( 552) + unionfs-internal-include-file.patch added to -mm tree
Jan 08   (  87) + unionfs-include-file.patch added to -mm tree
Jan 08   ( 218) + unionfs-unlink.patch added to -mm tree
Jan 08   ( 109) + unionfs-kconfig-and-makefile.patch added to -mm tree

Unionfs is now in -mm!

If you actually look at the next -mm changelog, you only see one patch containing all of Unionfs, as Andrew decided to use the git tree that I set up (gitweb) on

Linus has gone insane

So, it would seem that Linus has gone insane. His announcement of 2.6.18 is rather odd — I didn’t expect him to go for the Internation Talk-like a Pirate day thing…


She’s good to go, hoist anchor!

Here’s some real booty for all you land-lubbers.

There’s not too many changes, with t’bulk of the patch bein’ defconfig updates, but the shortlog at the aft of this here email describes the details if you care, you scurvy dogs.

Header cleanups, various one-liners, and random other fixes.

Linus "but you can call me Cap’n"

Oh, and I got a simple patch into the kernel. I have about 8 more of similar ones that should get into 2.6.19-rc1. Oh, and of course, I should resubmit Unionfs in the next week or two. Grr…so much to do!

Unionfs Request For Comments

So, finally, after long time of trying to get Unionfs into the Linux kernel, we submited it to linux-kernel, fsdevel, cc’ing all the people that should be cc’d (Al Viro, Christoph Hellwig, and Andrew Morton.)

Double u, tee, ef

You can look at this post as either a sequal to a post I made few days ago, or as another sequal to a post I made about a year ago.

So, while doing some more cleanup, Dave Quigley discovered this gem in Unionfs:

for (bindex = bstart - 1; bindex >= 0; bindex--) {
        err = copyup_file(dentry->d_parent->d_inode,
                          file, bstart, bindex,

        if (!err)

The particular piece of interest is the if statement:

if (!err)

What does it mean? If there is no error, we break out of the loop. Makes sense. If there is an error, we go to the top of the loop. Makes sense…right? Well, what happens if you reach the end of a for loop? We go to the top of the loop! See the brain damage? We are at the end of the loop, so we don’t need an explicit continue statement.

Here’s the if statement the way it should be (for completeness):

if (!err)

Oh, so simple, it almost brings a tear into one’s eye.

Bizarre Code

Think of this as a sequel to my post about a year ago.

The following peice of code is from Unionfs’s copyup_permission(). It is (or should be) a simple function that is supposed to copy the permissions from one inode to another. Well, I can’t help but remember this one slide from Greg Kroah-Hartman’s keynote this year at Ottawa Linux Symposium. Here’s the slide:

Linux is evolution, not intelligent design.

The quote definitely applies to copyup_permission(). It seems very clear that it “evolved” to something very odd. Anyway, I shall torment you no longer, here is the code:

static int copyup_permissions(struct super_block *sb,
                              struct dentry *old_hidden_dentry,
                              struct dentry *new_hidden_dentry)
        struct iattr newattrs;
        int err;


        newattrs.ia_atime = old_hidden_dentry->d_inode->i_atime;
        newattrs.ia_mtime = old_hidden_dentry->d_inode->i_mtime;
        newattrs.ia_ctime = old_hidden_dentry->d_inode->i_ctime;
        newattrs.ia_valid = ATTR_CTIME | ATTR_ATIME | ATTR_MTIME |
                        ATTR_ATIME_SET | ATTR_MTIME_SET;
        /* original mode of old file */
        newattrs.ia_mode = old_hidden_dentry->d_inode->i_mode;
        newattrs.ia_gid = old_hidden_dentry->d_inode->i_gid;
        newattrs.ia_uid = old_hidden_dentry->d_inode->i_uid;
        newattrs.ia_valid |= ATTR_FORCE | ATTR_GID | ATTR_UID | ATTR_MODE;
        if (newattrs.ia_valid & ATTR_MODE) {
                newattrs.ia_mode = (newattrs.ia_mode & S_IALLUGO) |                
                        (old_hidden_dentry->d_inode->i_mode & ~S_IALLUGO);

        err = notify_change(new_hidden_dentry, &newattrs);

        return err;

It was actually Seth Arnold that noticed that the condition will ALWAYS be true because ATTR_MODE is set in the line just above it. Furthermore, if one eliminates the if statement and replaces newattr.ia_mode in the assignment with what it is set to just few lines before (right after the comment) he gets:

newattrs.ia_mode = (old_hidden_dentry->d_inode->i_mode & S_IALLUGO) |
                   (old_hidden_dentry->d_inode->i_mode & ~S_IALLUGO);

If you are up to speed with bitwise operations, you’ll realize that it can be simplified to:

newattrs.ia_mode = old_hidden_dentry->d_inode->i_mode;


Powered by blahgd