Josef “Jeff” Sipek

Making Version Control Systems Go Boom

So, time has come, once again, to talk of many things…of Git and Mercurial. :)

For a fun project which I’ll describe here some other time, I want to version about 2GB of files. Here’s the breakdown:

  • 5x 312MB
  • 3x 100MB
  • 2x 16MB
  • 80 other files all under 5MB each

My first instinct was to use Mercurial, and so I did. It made sense, because it stores compressed deltas for the files. I don’t expect more than ~20MB to change between two consecutive versions, so it made sense on an architectural level as well.

The setup

There are a number of computers involved, unless I say otherwise, I’m talking about my laptop.

  • laptop: 3.06GHz P4, 1GB RAM
  • server: Athlon 2000, 1.25GB RAM
  • kernel devel box: 2x 2.8GHz Xeon, 2GB RAM, 4GB swap
  • big box: 4x 1.8GHz Opteron, 64GB RAM

Unfortunately, I can’t use the “big box” much. :( Oh well.

Attempt #1: Mercurial

First, I set up the directory hierarchy with all the files. Virtually all of the data in the 100MB & 312MB files consists of binary zeros, so it came as no surprise that the initial commit created approximatelly 50MB worth of history. Not bad at all! I ran some commands that changed the files the way I wanted, and commited each time I felt it was a good place to checkpoint. Mercurial’s compressed delta way of storing history really worked well, only 4MB increase in history between the initial and the 6th commit.

At this point, I decided that I should make a clone on another computer — yeah, I use distributed version control systems for backups of individual projects. :) Now, this is where things went crazy. I initiated clone on my server, and after about two minutes, the hg process on my laptop died with a memory allocation error. That sucks. It was probably because of the protocol, which tries to uncompress everything, and recompress it to save bandwidth. Since I was on a LAN, I tried to use the –uncompressed option, which doesn’t try to be smart, and just wastes bandwidth, but I forgot that I need to enable it on the server side, and so unknown to me, it still tried to compress the data. It died with a memory error, just as before. Oh well. At this point, I decided to try Git for this project.

Attempt #2: Git

Git uses a different storage scheme, well it actually has two. Whenever you commit, git stores the full file versions — compressed. I did a quick conversion of the hg repo to git — by hand as there were only 6 commits. I had to use:

hg update -C <rev>

otherwise, hg was trying to be too smart — something that makes you run out of memory. :)

After the conversion, the resulting .git repo was also about 50MB in size. Everything worked just as well. It is possible that the commits took little bit less time, as commiting consists of just compressing the files, and storing them on disk. I am not sure which one was faster, and knowing how each works doesn’t help with psychological effects :)

Anyway, it was time for me to clone the repository — again, going from my laptop to the server. I was afraid of this step, because when git transfers data between repositories, it tries to conserve bandwidth by making a packfile — a file containing a number of deltified objects (such as the compressed files stored during commit). It started to create the packfile, but it died with a nice message saying that it ran out of memory. Great! Now what? At that point, I decided to cheat. Since I need a packfile sooner or later, I just rsync’d the whole git repo to the kernel test box I have — a box that has twice the ram, and 4GB of swap, and I tried to clone from that. It got to about 66% done, when it was using most of the ram, and far too much swap. After about an hour and twenty minutes, I decided to rsync the repo to the box that has 64GB ram. On it, I ran the commands necessary to just create a pack file — without pulling/pushing/cloning. In about 10 minutes, it was done. Great! I then aborted the clone that was running for hour and a half, and cloned from the repo that had the packfile all set up. Everything worked rather nicely :) I moved things back onto my laptop.

Additional commits

Now it was time to resume what I was doing before — “the project”…I made some additional changes to the files, and made another commit. And it was time to push the changes. Git wasn’t happy. I wasn’t going to fight as I was getting tired, so I just rsync’d the # 6 newly created objects to the server.

Recently, there have been some patches on the git mailing list to make git little smarter about the way it uses multiple pack files. This doesn’t apply to me - at least not yet.

Conclusions

So, here it is. Both of the version control systems I like to use (each one has it’s area where I wouldn’t want to switch to the other), die on me because my 3 year old laptop has only 1GB of RAM. Just great. :-/ And please, don’t tell me about Subversion, and other non-distributed vcs tools. As far as I know, the other distributed systems consume even more resources.

Looking up Files, Part II

So, here’s more updates about my adventures within the realm of unionfs_lookup (I suggest you read part I first). After my first post about lookup code, I went back to coding, and I had the pleasure to try to figure out why I was hitting a BUG_ON() with my new code, but not with the old code.

I made a simple test case, in one terminal I’d run fsx (a POSIX compliance tester program) on unionfs:

mount -t unionfs -o dirs=/mnt/foo/b0:/mnt/foo/b1=ro none unionfs/
cd unionfs/
fsx -l 104060000 -q foo

And then mid-way through, I’d insert a branch as the new branch index 0:

mount -o remount,add=/mnt/foo/b0:/mnt/foo/b2=rw /mnt/unionfs

The remount command immediatelly caused the BUG_ON (that tests for dentry validity) in unionfs_setattr to trigger. It seemed rather odd that the lookup code replacement would do something that’d cause the unionfs dentry to be invalid. I pondered for a bit, and then I tried to insert a number of branches quickly with the old code. Eureka! The same BUG_ON() got triggered. Some lxr-ing later, it became apparent that we need to potentially revalidate inside the inode ops (like unionfs_setattr). Seems kinda obvious now, oh well. I’m also pondering about the posibility of changing the VFS to call d_revalidate, but I’m still not sure if that’s the Right Thing(tm) to do.

Until next time!

Looking up Files

So, I spend the last two to three days mucking around with unionfs_lookup. Before I touched it, it was a very, very ugly, 340 line beast that no one on the Unionfs team wanted to touch in the past year or so, because it seemed that just looking at it the wrong way would make it not work. The function had 4 different modes of operation, which overlapped in subtle ways.

Well, I decided to have some fun — rewriting it from scratch. :) Currently, the lookup function is 210 lines long, and it looks like it is working. It has only one mode - as it should. Since, I am still not done, the original lookup code is still there, and used for the other 3 modes. I’ll hack on it some more, and either remove them completely, or collapse some of them because they seem a bit redundant.

In the end, even if the total code size is still around 340 lines, I’ll be happy. Having 340 lines of readable code is way better than 340 lines of barely readable code.

I'd Tap That

So, I just spend the whole night playing around with Systemtap. A neat way to dynamically instrument the kernel.

As far as I know, it uses the kprobe interface to do the hard work. The neat thing about it is, the fact that you use a rather safe language to write the code that’ll get inserted into the kernel. This code then gets translated into C code, which then gets compiled as a kernel module.

When I first heard about systemtap, I decide to ignore it for whatever reason. Now, I’m kind of sorry I did, because it definitely looks like a very useful tool.

201st post

Ok, I just realized that this is the 201st post - I missed the nice, round 200. Oh well.

In other news, my wonderful blag is now syndicated by Planet LILUG (run by LILUG).

Fun Kernel Experiment

Ok, so after doing all that work on Guilt, I decided that it would be an interesting experiment to resurect an idea I had about 2.5 years ago…my own kernel patch series. Back then, I called it -js, for rather obvious reasons (my name, if you haven’t noticed).

I wrote a small script, which allows me to make a release of a 2.6-js kernel patch series within seconds. I have no idea if it is worth anything having yet another kernel patch series, and I might not actually maintain it for long, but just in case someone is brave enough to try it, here it is: http://www.fsl.cs.sunysb.edu/~jsipek/kernel/js/. Yeah, I could put it on kernel.org, but I won’t unless I see that -js is actually getting somewhere.

Bug reports, etc. are..well..I guess, welcome :)

Guilt: Taking over the world one repository at a time

It is really interesting how sometimes a bit of luck makes things happen. For example, little over 6 months ago, I wrote a few shell scripts, which called gq, to make my life a little easier. I worked on the for about a week, and then I decided I should share with the community. So I tagged the sources as version 0.10, and announced it on the git mailing list. One of the comments I got was about the fact that there is another project (completely unrelated) that had the name gq for a long time. Oh well, it was time for me to rename it. After some procrastination and hacking, new year rolled around, and I decided to release 6th version (v0.15), but this time it wouldn’t be gq anymore — instead I would call it Guilt. My post from January describes how it got the name. As with every version of gq, I announced Guilt v0.15. I could see that Guilt was getting way better, and so I felt even more motivated to hack on it. v0.16 came out. And then a very unexpected thing happened. I got two patches from a guy on the mailing list. Sweet! I applied them, and release v0.17. Shortly thereafter, during the Linux Storage and Filesystem (LSF) workshop in San Jose, I got a patch from Ted Ts’o (of the ext[234] fame). I couldn’t believe it, but it was true. I decided to release v0.19 the next day. At LSF, I met Brandon Philips, and we talked about Guilt. Rather shortly after LSF, he send me an email saying that he’ll try to get Guilt into Debian. :) Well, about a month ago, he succeeded.

As many of you may already know, I stick around a number of channels on OFTC’s IRC network, and it is rather interesting to see people try Guilt, or people talk about Guilt; generally suggesting that someone use it — and people do!

Anyway, I hope I didn’t bore everyone to death with my little tour of history behind Guilt.

Powered by blahgd