Josef “Jeff” Sipek

Installing Debian under FreeBSD's bhyve

This weekend I tried to move my work dev vm to a slightly beefier vm host. This meant moving the vm from kvm on OmniOS to bhyve on FreeBSD 12.1. After moving the disk images over, I tried to configure vm-bhyve to use them as-is. Unfortunately, grub2-bhyve fails to boot XFS root file systems with CRC support and there was no good way to convert the disk image to get it to boot via UEFI instead (to avoid grub2-bhyve completely). So, I decided to simply reinstall it.

In theory, this shouldn’t have been too difficult since I had the foresight to have /home on a separate virtual disk. In practice, I spent several hours reinstalling Debian Buster over and over, trying to figure out why it’d install and reboot fine, but subsequent boots wouldn’t make it past the EFI firmware.

It turns out that Debian tries to make multi-booting possible and puts its EFI binary into a non-standard location. That combined with bhyve not persisting EFI variables after shutdown results in any boot after the the first poweroff not even trying to look at the Debian-specific path.

This is not a Debian bug, but rather bhyve’s EFI support being incomplete. The easiest way around this is to copy the Debian binary into the standard location immediately after installation. In other words:

# cd /boot/efi/EFI
# mkdir BOOT
# cp debian/grubx64.efi BOOT/bootx64.efi

That’s all that’s needed. The downside to this is that the copy will not get automatically upgraded when grub gets an update.

For completeness, here are the relevant bits of the vm’s config:

loader="uefi"
graphics="yes"
cpu="6"
memory="1G"
network0_type="virtio-net"
network0_switch="public"
zfs_zvol_opts="volblocksize=4096"
disk0_dev="sparse-zvol"
disk0_type="virtio-blk"
disk0_name="disk0"
disk1_dev="sparse-zvol"
disk1_type="virtio-blk"
disk1_name="disk1"

Prague & IETF

This post is part of a series named “Europe 2017” where I share photos from my adventures in Europe during the summer 2017.

In mid-July 2017, I got to travel to Prague to participate in IETF’s 99th meeting.

The IETF meeting itself was great—lots of useful discussion about the next-generation email accessing protocol (called JMAP).

I stayed a couple of days extra to enjoy Prague, and Holly flew out from Helsinki to revisit Prague where she’s been once before—for our honeymoon.

I dragged my D750 and the two lenses with me and made to sure to take photos (almost) all the time. The gallery contains only a handful of the approximately 1100 raw files. Of those, I selected 11 for this blahg post.

Wikipedia article: Malá Strana Bridge Tower:

Wikipedia article: St. Nicholas Church with the Wikipedia article: Žižkov Television Tower in the background:

Wikipedia article: Matthias Gate with Wikipedia article: St. Vitus Cathedral peeking in the background:

The Wikipedia article: National Theatre:

Wikipedia article: Charles Bridge and a view of Wikipedia article: Old Town:

Wikipedia article: St. Vitus Cathedral from Wikipedia article: Petřín near sunset:

St. Nicholas Church again during sunset (and without the ugly Žižkov TV tower):

A gargoyle keeping the St. Nicholas Church’s masonry safe:

A view from the top of Wikipedia article: Old Town Bridge Tower with roofs and towers of (left to right):

Church of Saint Wikipedia article: Francis of Assisi, the left tower of Wikipedia article: Church of Our Lady before Týn, the clock tower and the Astronomical tower of Wikipedia article: Clementinum:

St. Nicholas Church yet again, this time from the Malá Strana Bridge Tower:

Wikipedia article: Charles Bridge, Wikipedia article: Old Town Bridge Tower, Church of Saint Wikipedia article: Francis of Assisi, and Wikipedia article: Žižkov Television Tower (from the Malá Strana Bridge Tower):

Prague offers a lot to see. The few photos I selected for this blahg post don’t show anywhere near enough of it. There are more photos in the gallery, but even those are merely highlights of what one can see in Prague. If you haven’t been to Prague yet, I highly recommend a trip.

Creative xor Use

Last month at work I got to try to optimize a function that takes a number and rounds it up to the next power of 2. The previous implementation used a simple loop. I didn’t dive into obscure bit twiddling, but rather used a helper function that is already in the codebase. Yes, I let the compiler do the heavy lifting of turning easy to understand code into good machine code. The x86 binary that gcc 6.3 produced has an interesting idiom, and that’s why I’m writing this entry.

The new code:

static inline unsigned int bits_required32(uint32_t num)
{
        return num == 0 ? 0 : 32 - __builtin_clz(num);
}

/* Returns x, such that x is the smallest power of 2 >= num. */
uint32_t nearest_power(uint32_t num)
{
	if (num == 0)
		return 1;

        return 1U << bits_required32(num - 1);
}

This is a slightly simplified version of the code, but it demonstrates the optimization quite well.

The nearest_power function disassembles as:

nearest_power()
    nearest_power:      8b 54 24 04        movl   0x4(%esp),%edx
    nearest_power+0x4:  b8 01 00 00 00     movl   $0x1,%eax
    nearest_power+0x9:  85 d2              testl  %edx,%edx
    nearest_power+0xb:  74 14              je     +0x14	<nearest_power+0x21>
    nearest_power+0xd:  83 ea 01           subl   $0x1,%edx
    nearest_power+0x10: 74 0f              je     +0xf	<nearest_power+0x21>
    nearest_power+0x12: 0f bd d2           bsrl   %edx,%edx
    nearest_power+0x15: b9 20 00 00 00     movl   $0x20,%ecx
    nearest_power+0x1a: 83 f2 1f           xorl   $0x1f,%edx
    nearest_power+0x1d: 29 d1              subl   %edx,%ecx
    nearest_power+0x1f: d3 e0              shll   %cl,%eax
    nearest_power+0x21: c3                 ret    

The first 6 instructions contain the prologue and deal with num being zero or one—both cases produce the result 1. The remaining 6 instructions make up the epilogue and are where the calculation happens. I’m going to ignore the first half of the function, since the second half is where the interesting things happen.

First, we get the number of leading zeros in num - 1 and stash the value 32 in a register:

    nearest_power+0x12: 0f bd d2           bsrl   %edx,%edx
    nearest_power+0x15: b9 20 00 00 00     movl   $0x20,%ecx

The number of leading zeros (%edx) is in the range 0–31.

Here is the really interesting bit:

    nearest_power+0x1a: 83 f2 1f           xorl   $0x1f,%edx

This xors the number of leading zeros (i.e., 0–31) with 31. To decipher what this does, I find it easier to consider the top 27 bits and the bottom 5 bits separately.

operand binary
0x1f 00000000 00000000 00000000 000 11111
edx 00000000 00000000 00000000 000 xxxxx

The xor of the top bits produces 0 since both the constant 31 and the register containing any of the numbers 0–31 have zeros there.

The xor of the bottom bits negates them since the constant has ones there.

When combined, the xor has the same effect as this C expression:

out = (~in) & 0x1f;

This seems very weird and useless, but it is far from it. It turns out that for inputs 0–31 the above expression is the same as:

out = 31 - in;

I think it is really cool that gcc produced this xor instead of a less optimal multi-instruction version.

The remainder of the disassembly just subtracts and shifts to produce the return value.

Why xor?

I think the reason gcc (and clang for that matter) produce this sort of xor instruction instead of a subtraction is very simple: on x86 the sub instruction’s left hand side and the destination must be the same register. That is, on x86 the sub instruction works as:

x -= y;

Since the destination must be a register, it isn’t possible to express out = 31 - in using just one sub.

Anyway, that’s it for today. I hope you enjoyed this as much as I did.

bool bitfield:1

This is the first of hopefully many posts related to interesting pieces of code I’ve stumbled across in the dovecot repository.

Back in 1999, C99 added the bool type. This is old news. The thing I’ve never seen before is what amounts to:

struct foo {
	bool	a:1;
	bool	b:1;
};

Sure, I’ve seen bitfields before—just never with booleans. Since this is C, the obvious thing happens here. The compiler packs the two bool bits into a single byte. In other words, sizeof(struct foo) is 1 (instead of 2 had we not used bitfields).

The compiler emits pretty compact code as well. For example, suppose we have this simple function:

void set(struct foo *x)
{
	x->b = true;
}

We compile it and disassemble:

$ gcc -c -O2 -Wall -m64 test.c
$ dis -F set test.o
disassembly for test.o

set()
    set:     80 0f 02           orb    $0x2,(%rdi)
    set+0x3: c3                 ret

Had we used non-bitfield booleans, the resulting code would be:

set()
    set:     c6 47 01 01        movb   $0x1,0x1(%rdi)
    set+0x4: c3                 ret

There’s not much of a difference in these simple examples, but in more complicated structures with many boolean flags the structure size difference may be significant.

Of course, the usual caveats about bitfields apply (e.g., the machine’s endian matters).

Working @ Dovecot

It’s been a hectic couple of weeks, and so this post is a bit delayed. Oh well.

A couple of months ago, I decided that it was time for me to move on work-wise. As a result, four weeks ago, I joined Dovecot Oy (a part of Open-Xchange).

As you may have guessed from the name of the company, I get to spend my time making the Dovecot email server code better, more featureful, and otherwise more excellent. It is certainly a significant but fun change—going from kernel hacking on a fairly unknown operating system to hacking on the world’s most popular IMAP server. Not a day goes by where I’m not surprised just how much functionality is in the Dovecot codebase, or when I get to consult an RFC related to some IMAP extension I didn’t even know existed.

So, with this said, you should expect to see some posts related to Dovecot, Dovecot code, and email in general.

Wannalancit Mill

As you may or may not know, Nexenta has a small office in Massachusetts, and so I end up in Wikipedia article: Lowell a couple of times a week. Lowell is a decent size city with a history of textile production. As a result, the city is peppered with old mills, most of which have been converted to office and apartment buildings and a couple serve as museums.

The Wannalancit mill is one of the mills that ended up turning into an office building. (It is connected to the adjacent Suffolk mill so I often forget that technically they are separate buildings. You’ve been warned.)

The thing that makes this mill more interesting is that the National Park Service maintains an operational water turbine in the basement. The turbine turns a large flywheel (I am guessing it is about 5 m in diameter).

The turbine itself is in the “basement” along with other goodies. The basement is not very well lit, but the D750 performed quite well even at ISO 5000–8000.

The turbine is geared to the flywheel.

Finally, here is the turbine (inside the red-brown metal object in the background) and the governor (green machine in the foreground) controlling the amount of water entering the turbine and therefore the amount of energy getting stored in the flywheel.

The park service shows up in the mornings to “turn-on” the turbine.

A close-up of the governor:

The table on the left was used for repairing of the 2 cm thick leather belts. I got the impression that the four section cabinet housed containers of oil used to lubricate various parts around the mill.

There are more than three times as many photos in the full gallery. Enjoy!

Bugs in Time

Recently, I blahgd about GCC optimizing code interestingly. There, I mentioned a couple of bugs I’ve stumbled across. I’m going to talk more about them in this post.

ddi_get_time

It all started when I got assigned a bug at work. “The installer hangs while checking available disks.” That’s the extent of the information I was given along with a test system. It didn’t take long to figure that devfsadm -c disk was waiting on a kernel thread that didn’t seem to be making any progress:

swtch+0x141
cv_timedwait_hires+0xec
cv_reltimedwait+0x51
ibdm`ibdm_ibnex_port_settle_wait+0x5f
ib`ibnex_bus_config+0x1e8
devi_config_common+0xa5
mt_config_thread+0x58
thread_start+8

The function of interest here is ibdm_ibnex_port_settle, but before I talk about it I need to mention that the ibdm kmod stashes a ddi_get_time timestamp of when the HCA attached. Now, ibdm_ibnex_port_settle calls ibdm_get_waittime to get a delay to feed to cv_reltimedwait. The delay is (more or less) calculated as: ddi_get_time() - hca_attach_time. This works fine as long as ddi_get_time continues incrementing at a constant rate (1 sec/sec).

You may already see where this is going. The problem is that ddi_get_time returns a Unix timestamp based on the current time-of-day clock. If the TOD setting changes for whatever reason (daylight saving time adjustments, NTP, etc.), the value returned by ddi_get_time may change non-monotonically. This makes it unsuitable for calculating timeouts and wait times. Converting ibdm_get_waittime to use a monotonic clock source (like gethrtime or ddi_get_lbolt) fixes this bug. (Illumos bug 4777)

Things get a bit worse. While figuring out what ddi_get_time does, I noticed that the man page actively encouraged developers to use it for timeouts. (Illumos bug 4776)

Of course, once I knew about this potential abuse, I had to check that there weren’t similar issues elsewhere in the kernel… and so I got to file bugs for iprb (4778), vhci (4779), COMSTAR iSCSI target (4780), sd (4781), usba (4782), emlxs (4786), ipf (4787), mac (4788), amr (4789), arcmsr (4790), aac (4791), and heci (4792).

I’m fixing all except: amr, arcmsr, aac, and heci.

NANOSEC

While developing the series of fixes mentioned in the previous section, I ran into the fact that NANOSEC was defined as 1000000000. This made it an int — a 32-bit signed integer (on both ILP32 and LP64).

If NANOSEC (defined this way) is used to convert seconds to nanoseconds (by multiplying), the naive approach will fail with quantities larger than 2 seconds. For example (hrtime_t is a 64-bit signed int):

hrtime_t convert(int secs)
{
        return (secs * NANOSEC);
}

Since both secs and NANOSEC are integers, the compiler will compute the product and then sign extend the result to 64-bits. If you look around the Illumos codebase, you’ll see plenty of places that cast or use ULL or LL suffix to make the compiler do the right thing. Why not just change the definition of NANOSEC to include a LL suffix releaving the users of this tedious (and error prone!) duty? Well, now you know what Illumos bug 4809 is about. :)

So, I changed the definition and rebuilt everything. Then, using wsdiff (think: recursive diff that understands how to compare ELF files) I found two places where the before and after binaries differed for non-trivial reasons. (I define a trivial reason as “the compiler decided to use registers differently, but the result is the same”.) Each non-trivial difference implies that there was an expression that changed — it used to be busted!

The first difference was in ZFS (Illumos bug 4810). There, spa_async_tasks_pending miscalculated a timeout making the condition always true.

The second difference was in in.mpathd. 4811). This daemon has a utility function to convert a struct timeval into a hrtime_t. You can read more about it in my previous post.

Before the NANOSEC change, I would have needed casts to fix this. With the change in definition, I don’t have to change a thing! And that’s how a one liner closed three bugs at the same time:

commit b59e2127f21675e88c58a4dd924bc55eeb83c7a6
Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Date:   Mon Apr 28 15:53:04 2014 -0400

    4809 NANOSEC should be 'long long' to avoid integer overflow bugs
    4810 spa_async_tasks_pending suffers from an integer overflow bug
    4811 in.mpathd: tv2ns suffers from an integer overflow bug
    Reviewed by: Marcel Telka <marcel.telka@nexenta.com>
    Reviewed by: Dan McDonald <danmcd@omniti.com>
    Approved by: Robert Mustacchi <rm@joyent.com>

Greetings from Nexenta

In case you missed it, back in mid-2011 I discovered Illumos and OpenIndiana. At that point, I already missed hacking on the (Linux) kernel. Based on my blahg posts [1,2], it shouldn’t surprise you that it didn’t take long before I wanted to hack on the Illumos kernel…and so I did.

If you ever contributed to an open source project in your free time while employed full-time, you understand that there’s only so much time you can devote to the open source project and therefore there is only so much you can do.

A couple of months ago, I decided to explore the possibility of working full-time on Illumos. There are only a handful of companies that visibly participate in the Illumos ecosystem, but their use of Illumos is pretty varied (from public clouds to virtualized databases to SAN/NAS appliances). As of this past Tuesday (Monday was a holiday), I’m at Nexenta. At least for now, I’m working remotely (from Ann Arbor) with the fine folks in the Wikipedia article: Lowell office. It feels great to work on open source again.

CMake

Recently, I looked into various build system in hopes of finding one that sucks less than the custom built (not by me) one I had the “pleasure” of dealing with for about a month. The requirements were that it had to work on Windows, Linux, and Mac OS X. Of all the possibilities, CMake looked really promising. After successfully convincing the others that CMake was better for the project, most of the code got switched over. I have to say, CMake is nice.

Since then, I have replaced Makefiles in several of my projects (e.g., my blogging system) with CMakeLists.txt. Not only is the console output cleaner, but it does proper dependencies, some sanity checks, and in general simplifies one’s life. I am actually considering switching HVF build system to it.

Accepted to EuroSys '08!

My summer internship at IBM resulted in a publication in EuroSys 2008!

So, keep an eye out for my name in EuroSys proceedings near you. :)

Powered by blahgd