Making Version Control Systems Go Boom

Filed under:

— JeffPC @ May 30, 2007 21:57

So, time has come, once again, to talk of many things…of Git and Mercurial. :)

For a fun project which I’ll describe here some other time, I want to version about 2GB of files. Here’s the breakdown:

5x 312MB
3x 100MB
2x 16MB
80 other files all under 5MB each

My first instinct was to use Mercurial, and so I did. It made sense, because it stores compressed deltas for the files. I don’t expect more than ~20MB to change between two consecutive versions, so it made sense on an architectural level as well.

The setup

There are a number of computers involved, unless I say otherwise, I’m talking about my laptop.

laptop: 3.06GHz P4, 1GB RAM
server: Athlon 2000, 1.25GB RAM
kernel devel box: 2x 2.8GHz Xeon, 2GB RAM, 4GB swap
big box: 4x 1.8GHz Opteron, 64GB RAM

Unfortunately, I can’t use the “big box” much. :( Oh well.

Attempt #1: Mercurial

First, I set up the directory hierarchy with all the files. Virtually all of the data in the 100MB & 312MB files consists of binary zeros, so it came as no surprise that the initial commit created approximatelly 50MB worth of history. Not bad at all! I ran some commands that changed the files the way I wanted, and commited each time I felt it was a good place to checkpoint. Mercurial’s compressed delta way of storing history really worked well, only 4MB increase in history between the initial and the 6th commit.

At this point, I decided that I should make a clone on another computer — yeah, I use distributed version control systems for backups of individual projects. :) Now, this is where things went crazy. I initiated clone on my server, and after about two minutes, the hg process on my laptop died with a memory allocation error. That sucks. It was probably because of the protocol, which tries to uncompress everything, and recompress it to save bandwidth. Since I was on a LAN, I tried to use the –uncompressed option, which doesn’t try to be smart, and just wastes bandwidth, but I forgot that I need to enable it on the server side, and so unknown to me, it still tried to compress the data. It died with a memory error, just as before. Oh well. At this point, I decided to try Git for this project.

Attempt #2: Git

Git uses a different storage scheme, well it actually has two. Whenever you commit, git stores the full file versions — compressed. I did a quick conversion of the hg repo to git — by hand as there were only 6 commits. I had to use:

hg update -C <rev>

otherwise, hg was trying to be too smart — something that makes you run out of memory. :)

After the conversion, the resulting .git repo was also about 50MB in size. Everything worked just as well. It is possible that the commits took little bit less time, as commiting consists of just compressing the files, and storing them on disk. I am not sure which one was faster, and knowing how each works doesn’t help with psychological effects :)

Anyway, it was time for me to clone the repository — again, going from my laptop to the server. I was afraid of this step, because when git transfers data between repositories, it tries to conserve bandwidth by making a packfile — a file containing a number of deltified objects (such as the compressed files stored during commit). It started to create the packfile, but it died with a nice message saying that it ran out of memory. Great! Now what? At that point, I decided to cheat. Since I need a packfile sooner or later, I just rsync’d the whole git repo to the kernel test box I have — a box that has twice the ram, and 4GB of swap, and I tried to clone from that. It got to about 66% done, when it was using most of the ram, and far too much swap. After about an hour and twenty minutes, I decided to rsync the repo to the box that has 64GB ram. On it, I ran the commands necessary to just create a pack file — without pulling/pushing/cloning. In about 10 minutes, it was done. Great! I then aborted the clone that was running for hour and a half, and cloned from the repo that had the packfile all set up. Everything worked rather nicely :) I moved things back onto my laptop.

Additional commits

Now it was time to resume what I was doing before — “the project”…I made some additional changes to the files, and made another commit. And it was time to push the changes. Git wasn’t happy. I wasn’t going to fight as I was getting tired, so I just rsync’d the # 6 newly created objects to the server.

Recently, there have been some patches on the git mailing list to make git little smarter about the way it uses multiple pack files. This doesn’t apply to me - at least not yet.

Conclusions

So, here it is. Both of the version control systems I like to use (each one has it’s area where I wouldn’t want to switch to the other), die on me because my 3 year old laptop has only 1GB of RAM. Just great. :-/ And please, don’t tell me about Subversion, and other non-distributed vcs tools. As far as I know, the other distributed systems consume even more resources.

Comments (2)

2 Comments »

http://www.youtube.com/watch?v=4XpnKHJAok8
Comment by [unknown] — January 1, 1970 @ 00:00
Yeah, I saw it. It wasn't very technical, but it was interesting to see Linus preach the ways of distributed version control.
Comment by [unknown] — January 1, 1970 @ 00:00

Atom feed for comments on this post.

Josef “Jeff” Sipek