Josef “Jeff” Sipek

Modern Mercurial

I’ve been using both Git and Mercurial since they were first released in 2005. I’ve messed with the internals of both, but I always had a preference for Mercurial (its user interface is cleaner, its design is well thought-out, and so on). So, it should be no surprise that I felt a bit sad every time I heard that some project chose Git over Mercurial (or worse yet, migrated from Mercurial to Git). At the same time, I could see Git improving release after release—but Mercurial did not seem to. Seem is the operative word here.

A couple of weeks ago, I realized that more and more of my own repositories have been Git based. Not for any particular reason other than that I happened to type git init instead of hg init. After some reflection, I decided that I should convert a number of these repositories from Git to Mercurial. The conversion itself was painless thanks to the most excellent hggit extension that lets you clone, pull, and push Git repositories with Mercurial. (I just cloned the Git repository with a hg clone and then cleaned up some of the mess manually—for example, I don’t need the bookmark corresponding to the one and only branch in the original Git repository.) Then the real fun began.

I resumed the work on my various projects, but now with the brand-new Mercurial repositories. Soon after I started hitting various quirks with the Mercurial UI. I realized that the workflow I was using wasn’t really aligned with the UI. Undeterred, I looked for solutions. I enabled the pager extension, the color extension, overrode some of the default colors to be less offensive (and easier to read), enabled the shelve, rebase, and histedit extensions to (along with mq) let me do some minor history rewriting while I iteratively work on changes. (I learned about and switched to the evolve extension soon after.) With each tweak, the user experience got better and better.

Then it suddenly hit me—before these tweaks, I had been using Mercurial like it’s still 2005!

I think this is a very important observation. Mercurial didn’t seem to be improving because none of the user-visible changes were forced onto the users. Git, on the other hand, started with a dreadful UI so it made sense to enable new features by default to lessen the pain.

One could say that Mercurial took the Unix approach—simple and not exactly friendly by default, but incredibly powerful if you dig in a little. (This extensibility is why Facebook chose Mercurial over Git as a Subversion replacement.)

Now I wonder if some of the projects chose Git over Mercurial at least partially because by default Mercurial has been a bit…spartan.

With my .hgrc changes, I get exactly the information I want in a format that’s even better than what Git provided me. (Mercurial makes so much possible via its templating engine and the revsets language.)

So, what does all this mean for Mercurial? It’s hard to say, but I’m happy to report that there is a number of good improvements that should land in the upcoming 4.2 release scheduled for early May. For example, the pager and color functionality is moving into the core and they will be on by default.

Finally, I like my current Mercurial environment quite a lot. The hggit extension is making me seriously consider using Mercurial when dealing with Git repositories that I can’t convert.

git filter-branch

Recently, I had to rewrite some commits in a git repository. All I wanted to do was set the author and committer names and emails to the correct value for all the commits in a repository. (Have you ever accidentally committed with user@some.host.local as the email address? I have.) It turns out that git has a handy command for that: git filter-branch. Unfortunately, using it is a bit challenging. Here’s what I ended up doing. (In case it isn’t clear, I am documenting what I have done in case I ever need to do it again on another repository.)

The invocation is relatively easy. We want to pass each commit to a script that creates a new commit with the proper name and email. This is done via the –commit-filter argument. Further, we want to rewrite each tag to point to the new commit hash. This is done via the –tag-filter argument. Since we’re not trying to change the contents of the tag, we use cat to simply pass through the tag contents.

$ git filter-branch \
        --commit-filter '/home/jeffpc/src/poc-clean/process.sh "$@"' \
        --tag-name-filter cat \
        -- fmt4 load-all master
Rewrite a95e3603e5ec40e6f229e75425f1969f13c17820 (710/710)
Ref 'refs/heads/fmt4' was rewritten
Ref 'refs/heads/load-all' was rewritten
Ref 'refs/heads/master' was rewritten
v3.0 -> v3.0 (b56481e52236c8bd85e647c30bafad6ac651e3fb -> b53c5b3ae8e18de02e1067bada7a0f05d4bcd230)
v3.1 -> v3.1 (993683bf104f42a74a2c58f2a91aee561573f7cc -> 1a1f4ff657abc8e97879f68a5dc4add664980b71)
v3.2 -> v3.2 (090b3ff1a66fa82d7d8fc99976c42c9495d5a32f -> 60fbeb91b689c65217b5ea17e68983d6aebc0239)
v3.3 -> v3.3 (4fb6d3ac2c5b88e69129cefe92d08decb341e1ae -> dd75fbb92353021c2738da2848111b78d1684405)

Caution: git rewrite-branch changes the directory while it does all the work so don’t try to use relative paths to specify the script.

The commit filter script is rather simple:

#!/bin/sh

name="Josef 'Jeff' Sipek"
email="jeffpc@josefsipek.net"

export GIT_AUTHOR_NAME="$name"
export GIT_AUTHOR_EMAIL="$email"
export GIT_COMMITTER_NAME="$name"
export GIT_COMMITTER_EMAIL="$email"

exec git commit-tree "$@"

It just sets the right environmental variables to pass the right name and email to git commit-tree, which writes out the commit object.

That’s it! I hope this helps.

OLS 2008 - Day 4

I’m a still a day late when it comes to writing about OLS. Here’s Friday’s list of talks, and other happenings.

The day began with A Practical Guide to using Git (From a Kernel Maintainer) — it was very crowded in the room, so much so that I didn’t really see the slideshow, but since I already know enough about how to use Git, I don’t mind all that much. Good talk.

The next talk which I kinda had to go to was SynergyFS: A Stackable File System Creating Synergies Between Heterogeneous Storage Devices. It was a disaster — and I put that mildly. The first 30 minutes of the 45 minute talk consisted of Samsung branded marketing material showing that solid state disks were better than the regular platter-based disks. Since the marketing people care mostly about Windows users, the propaganda materials consisted of things like a video thing showing Microsoft Windows Vista booting on two identical laptops — with the exception of the storage device.

Anyway, about 2/3 of the talk through, a SynergyFS got mentioned. And that’s when the one quite important bit got mentioned. At the time of the writing of the paper, the filesystem was a “proposed filesystems.” In other words, it didn’t exist. I am not certain if it exists at the moment, and if it does, what state it is in, but I do know (since an audience member asked when/where he could look at the code) that unless one signs an NDA with Samsung, he can’t even look at it. The code is not GPL licensed, since Samsung lawyers apparently see it as a way to lose some magical intellectual property, which as far as I know they never had. There has been papers published about hybrid storage, there have been papers published about fanout stackable filesystems, there have been papers published about fanout stackable filesystems which use different storage technologies (in no particular order: FiST, GreenFS, RAIF, Unionfs).

Overall, I feel like going to the talk was a waste of time. Meh.

Then I lunched.

Well, just before lunch, I was playing around with SELinux on my laptop, and after logging in, the processes weren’t getting the right context. After lunch, I went to SELinux for Consumer Electronic Devices. I walked into the room, and saw the NSA/Tresys/RedHat SELinux developers (including Dave Quigley) clustered in one area of the room. I just couldn’t resist, and I said “SELinux sucks” and then proceeded to walk away. The really amusing thing was all the SELinux people turned around to see who it was that dared to say such a thing. Very amusing. I sat next to them, and mentioned my SELinux problem. Stephen Smalley tried to figure out what the problem was, and in the end, reached the conclusion that somehow, even though the targeted policy was in use, the system was using some information from the strict policy completely confusing everything. I installed the strict policy, and things started working….well, for the most part. I should file this under the Debian bug tracker since it is a bug.

The SELinux talk was ok. It was what I expect…SELinux is kinda bloated for embedded systems. Some time after the talk, I overheard Stephen Smalley talking to Dave, saying that they should look into it a bit.

The next talk which I went to was Around the Linux File System World in 45 minutes. The reason I went to it was because it was being presented by Steve French. It was interesting, as I expected, and I’m going to read through his paper to see what exactly he did for the accounting (and what his thoughts are).

After Steve’s talk, I was going to go to a BOF about MIPS kernel port, but got distracted by people (including Steve).

At first, I wasn’t sure if I was going to go to the keynote (The Joy of Synchronicity) by Mark Shuttleworth (of the Ubuntu, space travel, and other-random-stuff fame). The title alone makes it sound like a hand-wavy, dreamy thing, but in the end I decided to at least spend 5 minutes listening. It was ok. Not great, from a technical perspective, but he did have some interesting ideas…well, it was really all just one idea — open source projects should have regular release schedules. I don’t know if I agree or not. On one hand it’s a nice thing, but at the same time, schedules are quite annoying when you want to make major changes (the KDE 3.x to 4.0 changes come to mind). In the end, I did stay the entire time, but I bailed at the beginning of the Q&A session.

Some food later, I headed to the hotel room to finish up writing notes for the day before. Well, I tried to upgrade my Wordpress install…but more about that later.

Making Version Control Systems Really Go Boom

This is a part 2 of my adventures of making version systems go boom.

As I described before, I need to version some reasonably large files. After trying Mercurial and Git, I decided to go with git as it presented me with less problems.

To make matters worse than before, I now need to version 3 files which are about 2.7GB in size each. I tried to git-add the directory, but I got this wonderful message:

$ git-add dir/
The following paths are ignored by one of your .gitignore files:
dir/ (directory)
Use -f if you really want to add them.
$ git-add -f dir/
fatal: dir/: can only add regular files or symbolic links

Wha?

  1. I don’t have any .gitignore files in this repository
  2. Adding a directory like that worked (and still works!) on other directories

Really painful. Time to experiment, but first I run git-status to see what other files I have not committed yet, and I see everything listed except the directory!…So, I moved one of the files to the top directory of the repo, ran git-status — the file did not show up — but tried to add it anyway:

$ git-add file
fatal: pathspec 'file' did not match any files

Ok, this time around, I at least get an error message which I’ve seen before. It is still wrong, but oh well. Thankfully, the program that uses these files has be made in such a way that it can handle filesystems which don’t support files larger than 2GB. I regenerate the file, now I have 2 files, the first one 2GB and the other 667MB. git-status displays both — great! git-add on the smaller file works flawlessly, but…you guessed it! Adding the larger file dies? Which error message?

fatal: Out of memory, malloc failed

Yep, great. My laptop’s 1GB of RAM just isn’t good enough, eh? I’m not quite sure what I’ll do, I’ll probably scp everything over to a box with 2+GB RAM, and commit things there. This really sucks :-/

Update: I asked around on IRC (#git) where I got a few pointers and the code confirms things…it would seem that git-hash-object tries to mmap the entire file. This explains the out of memory error. The other problem is the fact that the file size is stored in an unsigned long variable, which is 32-bits on my laptop. Oh well, so much for files over 4GB. I think, but I’m not sure - I’m too lazy to check — the stat structure may return a signed int which would limit things to 2GB — which is what I see.

Making Version Control Systems Go Boom

So, time has come, once again, to talk of many things…of Git and Mercurial. :)

For a fun project which I’ll describe here some other time, I want to version about 2GB of files. Here’s the breakdown:

  • 5x 312MB
  • 3x 100MB
  • 2x 16MB
  • 80 other files all under 5MB each

My first instinct was to use Mercurial, and so I did. It made sense, because it stores compressed deltas for the files. I don’t expect more than ~20MB to change between two consecutive versions, so it made sense on an architectural level as well.

The setup

There are a number of computers involved, unless I say otherwise, I’m talking about my laptop.

  • laptop: 3.06GHz P4, 1GB RAM
  • server: Athlon 2000, 1.25GB RAM
  • kernel devel box: 2x 2.8GHz Xeon, 2GB RAM, 4GB swap
  • big box: 4x 1.8GHz Opteron, 64GB RAM

Unfortunately, I can’t use the “big box” much. :( Oh well.

Attempt #1: Mercurial

First, I set up the directory hierarchy with all the files. Virtually all of the data in the 100MB & 312MB files consists of binary zeros, so it came as no surprise that the initial commit created approximatelly 50MB worth of history. Not bad at all! I ran some commands that changed the files the way I wanted, and commited each time I felt it was a good place to checkpoint. Mercurial’s compressed delta way of storing history really worked well, only 4MB increase in history between the initial and the 6th commit.

At this point, I decided that I should make a clone on another computer — yeah, I use distributed version control systems for backups of individual projects. :) Now, this is where things went crazy. I initiated clone on my server, and after about two minutes, the hg process on my laptop died with a memory allocation error. That sucks. It was probably because of the protocol, which tries to uncompress everything, and recompress it to save bandwidth. Since I was on a LAN, I tried to use the –uncompressed option, which doesn’t try to be smart, and just wastes bandwidth, but I forgot that I need to enable it on the server side, and so unknown to me, it still tried to compress the data. It died with a memory error, just as before. Oh well. At this point, I decided to try Git for this project.

Attempt #2: Git

Git uses a different storage scheme, well it actually has two. Whenever you commit, git stores the full file versions — compressed. I did a quick conversion of the hg repo to git — by hand as there were only 6 commits. I had to use:

hg update -C <rev>

otherwise, hg was trying to be too smart — something that makes you run out of memory. :)

After the conversion, the resulting .git repo was also about 50MB in size. Everything worked just as well. It is possible that the commits took little bit less time, as commiting consists of just compressing the files, and storing them on disk. I am not sure which one was faster, and knowing how each works doesn’t help with psychological effects :)

Anyway, it was time for me to clone the repository — again, going from my laptop to the server. I was afraid of this step, because when git transfers data between repositories, it tries to conserve bandwidth by making a packfile — a file containing a number of deltified objects (such as the compressed files stored during commit). It started to create the packfile, but it died with a nice message saying that it ran out of memory. Great! Now what? At that point, I decided to cheat. Since I need a packfile sooner or later, I just rsync’d the whole git repo to the kernel test box I have — a box that has twice the ram, and 4GB of swap, and I tried to clone from that. It got to about 66% done, when it was using most of the ram, and far too much swap. After about an hour and twenty minutes, I decided to rsync the repo to the box that has 64GB ram. On it, I ran the commands necessary to just create a pack file — without pulling/pushing/cloning. In about 10 minutes, it was done. Great! I then aborted the clone that was running for hour and a half, and cloned from the repo that had the packfile all set up. Everything worked rather nicely :) I moved things back onto my laptop.

Additional commits

Now it was time to resume what I was doing before — “the project”…I made some additional changes to the files, and made another commit. And it was time to push the changes. Git wasn’t happy. I wasn’t going to fight as I was getting tired, so I just rsync’d the # 6 newly created objects to the server.

Recently, there have been some patches on the git mailing list to make git little smarter about the way it uses multiple pack files. This doesn’t apply to me - at least not yet.

Conclusions

So, here it is. Both of the version control systems I like to use (each one has it’s area where I wouldn’t want to switch to the other), die on me because my 3 year old laptop has only 1GB of RAM. Just great. :-/ And please, don’t tell me about Subversion, and other non-distributed vcs tools. As far as I know, the other distributed systems consume even more resources.

Guilt: Taking over the world one repository at a time

It is really interesting how sometimes a bit of luck makes things happen. For example, little over 6 months ago, I wrote a few shell scripts, which called gq, to make my life a little easier. I worked on the for about a week, and then I decided I should share with the community. So I tagged the sources as version 0.10, and announced it on the git mailing list. One of the comments I got was about the fact that there is another project (completely unrelated) that had the name gq for a long time. Oh well, it was time for me to rename it. After some procrastination and hacking, new year rolled around, and I decided to release 6th version (v0.15), but this time it wouldn’t be gq anymore — instead I would call it Guilt. My post from January describes how it got the name. As with every version of gq, I announced Guilt v0.15. I could see that Guilt was getting way better, and so I felt even more motivated to hack on it. v0.16 came out. And then a very unexpected thing happened. I got two patches from a guy on the mailing list. Sweet! I applied them, and release v0.17. Shortly thereafter, during the Linux Storage and Filesystem (LSF) workshop in San Jose, I got a patch from Ted Ts’o (of the ext[234] fame). I couldn’t believe it, but it was true. I decided to release v0.19 the next day. At LSF, I met Brandon Philips, and we talked about Guilt. Rather shortly after LSF, he send me an email saying that he’ll try to get Guilt into Debian. :) Well, about a month ago, he succeeded.

As many of you may already know, I stick around a number of channels on OFTC’s IRC network, and it is rather interesting to see people try Guilt, or people talk about Guilt; generally suggesting that someone use it — and people do!

Anyway, I hope I didn’t bore everyone to death with my little tour of history behind Guilt.

Git Quilt or Guilt for short

Here’s another update on my version control system escapades (a follow up to Do I have…).

As several people mentioned during the 0.10 release of gq, the name is already in use by a rather well established project. So, after some idleing and hacking, I decided that it was time to give the scripts a new name, and announce the new version on the git and linux-kernel mailing lists (the annoucement). I can’t take credit for the rather clever name, I asked a few people, and the best suggestion was by Dave - Git Quilt or Guilt for short.

One thing I did not expect was the fact that someone would contribute 2 patches very shortly after I announced it. Here’s the list of changes that made between v0.16 and v0.17:

Horst H. von Brand (2):
      Fix up Makefiles
      Run regression on the current version

Josef 'Jeff' Sipek (24):
      A minimalistic makefile
      Contributing doc file
      Added guilt-add
      Added guilt-status
      Expanded the HOWTO
      Added usage strings to all commands
      All arguments to guilt-add are filenames
      More thorough argument checking & display usage string on failure
      Changed status file format to include the hash of the commit
      Fixed guilt-refresh doing an unnecessary and somewhat wrong pop&push
      Fixed up guilt-{delete,pop} not matching the patch name properly
      Fixed guilt-{delete,pop} regexps some more
      Force UTC as timezone for regression tests
      Fixed a bug in guilt-pop introduced by the status file format switch
      Error messages should go to stderr
      Merge branch 'usage'
      Merge branch 'status-file'
      Yet another TODO update
      Added guilt-rm
      Makefile update & cleanup
      pop: Display the name of the patch from the status file, not the series file
      new: Create dir structure for the patch if necessary
      Documentation/TODO: Mark guilt-rm as done
      Guilt v0.17

I haven’t had much time to work on Guilt since then, but I got an rather encouriging email from someone, who tried to apply Andrew Morton’s -mm patch series on top of the kernel tree, but failed. The problem is with the way git-apply works. If it applies a patch with an offset, it still returns non-zero status. This makes guilt think that at least one of the hunks in the patch did not apply at all. As far as I know, there is no way to get the necessary information out of git-apply without either modifying it (which I might as well), or parsing the output for signs of rejection and ignoring the return status completely. I don’t like the latter, but changing git-apply would limit the number of compatible git versions. :-/

Needless to say, patches are welcomed :)

Do I have a thing for Version Control Systems?

So, for whatever reason, I seem to be working on version control systems far too much. I have a decent amount of code in Mercurial, I wrote a bunch of wrappers for CVS, I call them CDS which stands for Completely Dumb System which is an apt description of CVS. And now I am working on gq (git repo: git://git.kernel.org/pub/scm/linux/kernel/git/jsipek/gq.git) which is a porcelain (set of wrapper scripts for git) that gives a Mercurial Queues-like functionality to git users.

Yep, I think it is official, I have a thing for version control systems. Ever since I became very interested in them (~April 2005), I learned a lot about them, and I am kind of tempted to give it a go and try something of my own. :)

OLS 2006 - Day 5

The day began with an awesome presentation I gave about Unionfs. :) Shawn was recoding it, but after the presentation, he found out that the video turned out to be crap. He has audio only. I’m sure he’ll share it soon. :) I was pleasantly surprised at the number of people that use Unionfs or were interested in Unionfs.

The keynote was excelent as always. However I must say that Greg K-H made it sound like any piece of code will get into the kernel. Yeah, right :) But he did say few nice things about the status of Linux.

After the keynote, there was the GPG key signing - which I did not attend, although I wanted to. Instead we went to get some food. Food was good, we (I, Dave, Mike Halcrow, and Prof. Zadok) talked about a bunch of things ranging from MythTV and terabyte storage servers, to things like the number of ants in Texas. (Apparently, it is a lot of fun to watch termites and fire ants battle to the death. O_o )

We finished food around 19:45 which was about right to head over to the Black Thorn for the after event party. Just as last year it was quite interesting. Pretty much as soon as I got there, I noticed Peter Baudis aka. pasky - the cogito maintainer. We chatted about how git and Mercurial differ (Matt’s talk the day before came in handy :) ). I mentioned I was slowly working on a generic benchmark script that would test a number of popular SCMs including Mercurial, Subversion, and CVS. He was thrilled about the prospect of knowing exactly where git sucked compared to other SCMs - my guess is that he wants to fix it and make it better, a noble goal, but unnecessary as Mercurial already exists and why reinvent the wheel? ;) Seriously, though, I think a lot of people would benefit from knowing exactly where each SCM excels, and where each sucks. The nice thing about collaborating with the git people would be that it would make it more apparent that this wouldn’t just be yet-another-fake-test. After some time, a bunch of other Czech people poped up right next to us (people like, Pavel Machek, etc.). It was quite interesting. :)

After than I joined a converation with some Intel people. As it turns out, one of the Intel people is working on the e1000 driver — awesome piece of hardware, by the way, don’t ever buy anything other than it. :) Some time later, Jens Axboe joined the group briefly. When he said my name seemed familiar, I mentioned how I tried to implement IO priorities - and failed :) Later on, a guy from University of Toronto joined the group. He approached me earlier in the day about unionfs on clusters. We chatted about things ranging from school (undergraduate program, and grad school) to submitting kernel code to lkml. The e1000 guy said a similar thing that we should split unionfs up into a few patches, and send it off. During the event a few people still asked me about Unionfs, which felt good :)

Then, I decided that it would be fun to talk to some IRC people. I found John Levon and Seth Arnold. We sat down, and had an interesting conversation about a number of things. Since at least some of these were quite interesting, here’s a list:

  1. How can I deal with VFS and not drink vodka or other hard liquer
  2. Everybody hates CDE, even people at Sun
  3. Solaris is dead (well, they didn’t say it, but that’s the feeling I got)
  4. Brittons have some interesting sports or at least some of the expected behavior during the sport is interesting, namely:

  1. darts - you are expected to drink as you play
  2. I can’t recall the name - gigantic pool table
  3. cricket - everyone smokes "reefer" (to quote Movement, I just find this name of the substance mildly amusing) because their games sometimes take several days

After that, they kicked everyone out as it was 2:45 already. We (Seth, John, and I) went back to the hotel. There, I Prof. Zadok and Chip (who arrived on Friday) were about to get up and head to the airport. :) I just went to bed.

Powered by blahgd