Josef “Jeff” Sipek

Inlining Atomic Operations

One of the items on my ever growing TODO list (do these ever shrink?) was to see if inlining Illumos’s atomic_* functions would make any difference. (For the record, these functions atomically manipulate variables. You can read more about them in the various man pages — atomic_add, atomic_and, atomic_bits, atomic_cas, atomic_dec, atomic_inc, atomic_or, atomic_swap.) Of course once I looked at the issue deeply enough, I ended up with five cleanup patches. The gist of it is, inlining them caused not only about 1% kernel performance improvement on the benchmarks, but also reduced the kernel size by a couple of kilobytes. You can read all about it in the associated bugs (5042, 5043, 5044, 5045, 5046, 5047) and the patch 0/6 email I sent to the developer list. In this blahg post, I want to talk about how exactly Illumos presents these atomic functions in a stable ABI but at the same time allows for inlines.


It should come as no surprise that the “content” of these functions really needs to be written in assembly. The functions are 100% implemented in assembly in usr/src/common/atomic. There, you will find a directory per architecture. For example, in the amd64 directory, we’ll find the code for a 64-bit atomic increment:

	incq	(%rdi)

The ENTRY, ALTENTRY, and SET_SIZE macros are C preprocessor macros to make writing assembly functions semi-sane. Anyway, this code is used by both the kernel as well as userspace. I am going to ignore the userspace side of the picture and talk about the kernel only.

These assembly functions, get mangled by the C preprocessor, and then are fed into the assembler. The object file is then linked into the rest of the kernel. When a module binary references these functions the krtld (linker-loader) wires up those references to this code.


Replacing these function with inline functions (using the GNU definition) would be fine as far as all the code in Illumos is concerned. However doing so would remove the actual functions (as well as the symbol table entries) and so the linker would not be able to wire up any references from modules. Since Illumos cares about not breaking existing external modules (both open source and closed source), this simple approach would be a no-go.

Inline v2

Before I go into the next and final approach, I’m going to make a small detour through C land.

extern inline

First off, let’s say that we have a simple function, add, that returns the sum of the two integer arguments, and we keep it in a file called add.c:

#include "add.h"

int add(int x, int y)
	return x + y;

In the associated header file, add.h, we may include a prototype like the following to let the compiler know that add exists elsewhere and what types to expect.

extern int add(int, int);

Then, we attempt to call it from a function in, say, test.c:

#include "add.h"

int test()
	return add(5, 7);

Now, let’s turn these two .c files into a .so. We get the obvious result — test calls add:

    test:     be 07 00 00 00     movl   $0x7,%esi
    test+0x5: bf 05 00 00 00     movl   $0x5,%edi
    test+0xa: e9 b1 fe ff ff     jmp    -0x14f	<0xc90>

And the binary contains both functions:

$ /usr/bin/nm | egrep '(Value|test$|add$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[74]	|                3520|                   4|FUNC |GLOB |0    |13   |add
[65]	|                3536|                  15|FUNC |GLOB |0    |13   |test

Now suppose that we modify the header file to include the following (assuming GCC’s inline definition):

extern int add(int, int);

extern inline int add(int a, int b)
	return a + b;

If we compile and link the same .so the same way, that is we feed in the object file with the previously used implementation of add, we’ll get a slightly different binary. The invocation of add will use the inlined version:

    test:     b8 0c 00 00 00     movl   $0xc,%eax
    test+0x5: c3                 ret    

But the binary will still include the symbol:

$ /usr/bin/nm | egrep '(Value|test$|add$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[72]	|                3408|                   4|FUNC |GLOB |0    |11   |add
[63]	|                3424|                   6|FUNC |GLOB |0    |11   |test

Neat, eh?

extern inline atomic what?

How does this apply to the atomic functions? Pretty simply. As I pointed out, usr/src/common/atomic contains the pure assembly implementations — these are the functions you’ll always find in the symbol table.

The common header file that defines extern prototypes is usr/src/uts/common/sys/atomic.h.

Now, the trick. If you look carefully at the header file, you’ll spot a check on line 39. If all the conditions are true (kernel code, GCC, inline assembly is allowed, and x86), we include asm/atomic.h — which lives at usr/src/uts/intel/asm/atomic.h. This is where the extern inline versions of the atomic functions get defined.

So, kernel code simply includes <sys/atomic.h>, and if the stars align properly, any atomic function use will get inlined.

Phew! This ended up being longer than I expected. :)

Segment Drivers

Lately, I started poking around the Illumos memory management code. As I’ve done in the past, I decided to use this blahg as a place to document some of my discoveries.

Memory Layout

In Illumos (and Solaris), address spaces are managed as sets of segments. Each segment has a base address, length, and a number of other properties. This is true for both process memory as well as kernel memory. Do not confuse these segments with Wikipedia article: memory segmentation that processors like Wikipedia article: x86 provide.

Each process has its own struct as:

> ::pgrep vim
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R  10852  10777  10850  10777    101 0x4a004000 ffffff0411e1c0a0 vim
> ffffff0411e1c0a0::print proc_t p_as | ::print struct as a_segtree
a_segtree = {
    a_segtree.avl_root = 0xffffff03f7c62ea8
    a_segtree.avl_compar = as_segcompar
    a_segtree.avl_offset = 0x20
    a_segtree.avl_numnodes = 0x18
    a_segtree.avl_size = 0x60

The kernel address space is maintained in the kas global:

> kas::print a_segtree
a_segtree = {
    a_segtree.avl_root = kvseg+0x20
    a_segtree.avl_compar = as_segcompar
    a_segtree.avl_offset = 0x20
    a_segtree.avl_numnodes = 0x9
    a_segtree.avl_size = 0x60

(Once upon a time this set of segments was a linked list, but for a long while now it has been an AVL tree indexed by the base address.)

Regardless of which address space we’re dealing with, the same rules apply: segments represent contiguous regions within the address space. Each segment can represent a different type of memory. For example, walking the kernel address space segment tree yields nine different segments of four different types (kpm, kmem, kp, and map):

> kas::print a_segtree | ::walk avl | ::printf "%p.%016x %a\n" "struct seg" s_base s_size s_ops
fffffe0000000000.000000031e000000 segkpm_ops
ffffff0000000000.0000000017000000 segkmem_ops
ffffff0017000000.0000000080000000 segkp_ops
ffffff0097000000.00000002fca00000 segkmem_ops
ffffff03d3a00000.0000000004000000 segmap_ops
ffffff03d7a00000.000000fbe8600000 segkmem_ops
ffffffffc0000000.000000003b7fb000 segkmem_ops
fffffffffb800000.0000000000550000 segkmem_ops
ffffffffff800000.0000000000400000 segkmem_ops

Segment Drivers

Illumos comes with seven different architecture- and platform-independent segment drivers. A segment driver is a “driver” that implements a couple of functions to manage a segment of memory. That is, each segment type can handle page faults, page locking, sync operations, etc. differently.

For example, suppose that a page fault occurs because a process tried to load a value from a page that lacks a page table entry. The platform specific (assembly) fault handling code gets invoked by the processor. After doing a little bit of work, it calls into the generic (C) fault handling code, as_fault. There, the segtree AVL tree is consulted and the corresponding segment’s fault operation gets invoked.

(Solaris Internals lists 12 and 11 segment drivers, respectively, in the two editions.) In Illumos, the seven common segment drivers are:

Most of the time, userspace processes do not need to map devices into their address space. In the rare case when a process does want a device mapped (e.g., Xorg), the dev segment driver maintains that mapping.
This segment driver maps the kernel heap, module text, and all early boot memory. (code)
In general, kernel memory is not pageable. In the rare case that something can be in kernel pageable memory, this segment is what maintains the anonymous page mappings.
If possible (you’re on a 64-bit system), the kpm segment driver maps all physical memory into the kernel’s address space. This allows the kernel to not have to set up temporary mappings to operate on physical memory. (code)
The map segment driver is a kernel-only higher performance version of the vn segment driver. (See below.)
This segment driver is responsible for maintaining SysV shared memory segments. (Not to be confused with POSIX shared memory.)
Memory mapped files are handled by the vn segment driver. This includes both regular files as well as anonymous memory.

There are also two platform specific segment drivers:

seg_mf (i86xpv only)
This segment driver is only used by dom0 processes (read: Xen) to map pages from other domains.
seg_nf (sparc v9 only)
The header for the file says that it is for non-faulting loads. I don’t actually know what exactly it is for. (And I don’t care enough to dig deeper given that it is Sparc specific.)

The Reality

This is a lot of different segment drivers. Are all of them used all the time? Well, sort of. The mdb output earlier shows that the (amd64) kernel uses only four different segment drivers (kpm, kmem, kp, and map). A typical userspace process is very boring — it is only made up of vn segments. There are, however, exceptions. For instance, Xorg uses vn and dev. This accounts for six of the seven drivers. The last common segment driver is spt, which provides System V shared memory. (I talked about SysV shared memory previously.) So, on a 64-bit x86 system, all seven common segment drivers are in use.

The story is a bit different on 32-bit kernels. Since a 32-bit system has much smaller address space, the kernel tries to eliminate a number of mappings. Here is the list of segments in a 32-bit kernel:

> kas::print a_segtree | ::walk avl | ::printf "%p %a\n" "struct seg" s_base s_ops
b5802000 segmap_ops
b6800000 segkmem_ops
ef400000 segkmem_ops
fe800000 segkmem_ops
ff000000 segkmem_ops

As you can see, the kp and kpm segments went away. While at first this is surprising, it actually makes perfect sense. When thinking about memory there are two “types” to consider: physical and virtual. In theory, one can have more virtual than physical thanks to the MMU but in reality this is only true on 64-bit systems. The physical memory sizes have outgrown 4 GB a number of years ago and therefore a 32-bit address space can trivially be 100% backed by physical memory. In other words, 32-bit address spaces are tight on virtual memory, while 64-bit address spaces are “tight” on physical memory.

Let’s consider the disappearance of the kp segment on 32-bits. What does kp let us do? It lets us oversubscribe physical memory by backing some virtual memory with disk space. On 32-bit systems we have enough physical memory to back all the virtual memory in the kernel so we don’t need to back some of it by disk. So we have no use for it. (Yes, the kernel still could have paged parts of itself out, but kernel text and data is generally considered important enough to keep it in non-pageable memory. The memory utilization will more than pay for itself by the performance improvement of not having the kernel paged out.)

As I stated before, kpm segments map physical memory into the kernel’s address space for performance reasons (without it the kernel would have to temporarily map a page to access the contents). Therefore, they are good candidates for removal when it comes to slimming down the kernel’s address space demands. (Well, the actual story is the other way… the introduction of 64-bit capable hardware allowed kpm segments to exist to improve kernel performance.)

Bugs in Time

Recently, I blahgd about GCC optimizing code interestingly. There, I mentioned a couple of bugs I’ve stumbled across. I’m going to talk more about them in this post.


It all started when I got assigned a bug at work. “The installer hangs while checking available disks.” That’s the extent of the information I was given along with a test system. It didn’t take long to figure that devfsadm -c disk was waiting on a kernel thread that didn’t seem to be making any progress:


The function of interest here is ibdm_ibnex_port_settle, but before I talk about it I need to mention that the ibdm kmod stashes a ddi_get_time timestamp of when the HCA attached. Now, ibdm_ibnex_port_settle calls ibdm_get_waittime to get a delay to feed to cv_reltimedwait. The delay is (more or less) calculated as: ddi_get_time() - hca_attach_time. This works fine as long as ddi_get_time continues incrementing at a constant rate (1 sec/sec).

You may already see where this is going. The problem is that ddi_get_time returns a Unix timestamp based on the current time-of-day clock. If the TOD setting changes for whatever reason (daylight saving time adjustments, NTP, etc.), the value returned by ddi_get_time may change non-monotonically. This makes it unsuitable for calculating timeouts and wait times. Converting ibdm_get_waittime to use a monotonic clock source (like gethrtime or ddi_get_lbolt) fixes this bug. (Illumos bug 4777)

Things get a bit worse. While figuring out what ddi_get_time does, I noticed that the man page actively encouraged developers to use it for timeouts. (Illumos bug 4776)

Of course, once I knew about this potential abuse, I had to check that there weren’t similar issues elsewhere in the kernel… and so I got to file bugs for iprb (4778), vhci (4779), COMSTAR iSCSI target (4780), sd (4781), usba (4782), emlxs (4786), ipf (4787), mac (4788), amr (4789), arcmsr (4790), aac (4791), and heci (4792).

I’m fixing all except: amr, arcmsr, aac, and heci.


While developing the series of fixes mentioned in the previous section, I ran into the fact that NANOSEC was defined as 1000000000. This made it an int — a 32-bit signed integer (on both ILP32 and LP64).

If NANOSEC (defined this way) is used to convert seconds to nanoseconds (by multiplying), the naive approach will fail with quantities larger than 2 seconds. For example (hrtime_t is a 64-bit signed int):

hrtime_t convert(int secs)
        return (secs * NANOSEC);

Since both secs and NANOSEC are integers, the compiler will compute the product and then sign extend the result to 64-bits. If you look around the Illumos codebase, you’ll see plenty of places that cast or use ULL or LL suffix to make the compiler do the right thing. Why not just change the definition of NANOSEC to include a LL suffix releaving the users of this tedious (and error prone!) duty? Well, now you know what Illumos bug 4809 is about. :)

So, I changed the definition and rebuilt everything. Then, using wsdiff (think: recursive diff that understands how to compare ELF files) I found two places where the before and after binaries differed for non-trivial reasons. (I define a trivial reason as “the compiler decided to use registers differently, but the result is the same”.) Each non-trivial difference implies that there was an expression that changed — it used to be busted!

The first difference was in ZFS (Illumos bug 4810). There, spa_async_tasks_pending miscalculated a timeout making the condition always true.

The second difference was in in.mpathd. 4811). This daemon has a utility function to convert a struct timeval into a hrtime_t. You can read more about it in my previous post.

Before the NANOSEC change, I would have needed casts to fix this. With the change in definition, I don’t have to change a thing! And that’s how a one liner closed three bugs at the same time:

commit b59e2127f21675e88c58a4dd924bc55eeb83c7a6
Author: Josef 'Jeff' Sipek <>
Date:   Mon Apr 28 15:53:04 2014 -0400

    4809 NANOSEC should be 'long long' to avoid integer overflow bugs
    4810 spa_async_tasks_pending suffers from an integer overflow bug
    4811 in.mpathd: tv2ns suffers from an integer overflow bug
    Reviewed by: Marcel Telka <>
    Reviewed by: Dan McDonald <>
    Approved by: Robert Mustacchi <>

Greetings from Nexenta

In case you missed it, back in mid-2011 I discovered Illumos and OpenIndiana. At that point, I already missed hacking on the (Linux) kernel. Based on my blahg posts [1,2], it shouldn’t surprise you that it didn’t take long before I wanted to hack on the Illumos kernel…and so I did.

If you ever contributed to an open source project in your free time while employed full-time, you understand that there’s only so much time you can devote to the open source project and therefore there is only so much you can do.

A couple of months ago, I decided to explore the possibility of working full-time on Illumos. There are only a handful of companies that visibly participate in the Illumos ecosystem, but their use of Illumos is pretty varied (from public clouds to virtualized databases to SAN/NAS appliances). As of this past Tuesday (Monday was a holiday), I’m at Nexenta. At least for now, I’m working remotely (from Ann Arbor) with the fine folks in the Wikipedia article: Lowell office. It feels great to work on open source again.

x2APIC, IOMMU, Illumos

About a week ago, I hinted at a boot hang I was debugging. I’ve made some progress with it, and along the way I found some interesting things about which I’ll blog over the next few days. Today, I’m going to talk about the Wikipedia article: APIC, xAPIC, and Wikipedia article: x2APIC and how they’re handled in Illumos.


I strongly suggest you become at least a little familiar with APIC architecture before reading on. The Wikipedia articles above are a good start.

First things first, we need some definitions. APIC can refer to either the architecture or to very old (pre-Pentium 4) implementation. Since I’m working with a Sandy Bridge, I’m going to use APIC to refer to the architecture and completely ignore that these chips existed. Everything they do is a subset of xAPIC. xAPIC is an extension to APIC. xAPIC chips started showed up in NetBurst architecture Intel CPUs (i.e., Pentium 4). xAPIC included some goodies such as upping the limit on the number of CPUs to 256 (from 16). x2APIC is an extension to xAPIC. x2APIC chips started appearing around the same time Sandy Bridge systems started showing up. It is a major update to how interrupts are handled, but as with many things in the PC industry the x2APIC is fully backwards compatible with xAPICs. x2APIC includes some goodies such as upping the limit on the number of CPUs to $2^{32}$.

Regardless of which exact flavor you happen to use, you will find two components: the local APIC and I/O APIC. Each processor gets their own local APIC and I/O buses get I/O APICs. I/O APICs can service more than one device, and in fact many systems have only one I/O APIC.

The xAPIC uses Wikipedia article: MMIO to program the local and I/O APICs.

x2APIC has two mode of operation. First, there is the xAPIC compatibility mode which makes the x2APIC behave just like an xAPIC. This mode doesn’t give you all the new bells and whistles. Second, there is the new x2APIC mode. In this mode, the APIC is programmed using Wikipedia article: MSRs.

One interesting fact about x2APIC is that it requires an Wikipedia article: iommu. My Sandy Bridge laptop has an Intel iommu as part of the VT-d feature.

Illumos /etc/mach

x2APIC in Illumos has two APIC drivers. First, there is pcplusmp which knows how to handle APIC and xAPIC. Second, there is apix which targets x2APIC, but knows how to operate it in both modes. On boot, the kernel consults /etc/mach to get a list of machine specific modules to try to load. Currently, the default contents (trimmed for display here) are:

# CAUTION!  The order of modules specified here is very important. If the
# order is not correct it can result in unexpected system behavior. The
# loading of modules is in the reverse order specified here (i.e. the last
# entry is loaded first and the first entry loaded last).

Since I’m not running Xen, xpv_psm will fail to load, and apix gets its chance to load.

pcplusmp + apix Code Sharing

The code in these two modules can be summarized with a word: mess. Following what happens when would be enough of an adventure. The code for the two modules lives in four directories: usr/src/uts/i86pc/io, usr/src/uts/i86pc/io/psm, usr/src/uts/i86pc/io/pcplusmp, and usr/src/uts/i86pc/io/apix. But the sharing isn’t as straight forward as one would hope.

Directory pcplusmp apix
i86pc/io mp_platform_common.c, mp_platform_misc.c, hpet_acpi.c mp_platform_common.c, hpet_acpi.c
i86pc/io/psm psm_common.c psm_common.c
i86pc/io/pcplusmp * apic_regops.c, apic_common.c, apic_timer.c
i86pc/io/apix *

This is of course not clear at all when you look at the code. (Reality is a bit messier because of the i86xpv platform which uses some of the i86pc source.)


When the apix module gets loaded, its probe function (apix_probe) is called. This is the place where the module decides if the hardware is worthy. Specifically, if it finds that the CPU reports x2APIC support via Wikipedia article: cpuid, it goes on to call the common APIC probe code (apic_probe_common). Unless that fails, the system will use the apix module — even if there is no iommu and therefore the x2APIC needs to operate in xAPIC mode.

What mode are you using? Easy, just check the apic_mode global in the kernel:

# echo apic_mode::whatis | mdb -k
fffffffffbd0ee4c is apic_mode, in apix's data segment
# echo apic_mode::print | mdb -k

2 (LOCAL_APIC) indicates xAPIC mode, while 3 (LOCAL_X2APIC) indicates x2APIC mode.

Because this part is as clear as mud, I made a table that tells you what module and mode to expect given your hardware, what CPUID says, and the presence and state of the iommu.

APIC hw CPUID IOMMU IOMMU state Module apic_mode
xAPIC off pcplusmp LOCAL_APIC
x2APIC off pcplusmp LOCAL_APIC
x2APIC on absent apix LOCAL_APIC
x2APIC on present off apix LOCAL_APIC
x2APIC on present on apix LOCAL_X2APIC


I’ve never seen apic_mode equal to LOCAL_X2APIC in the wild. This was very puzzling. Yesterday, I discovered why. As I mentioned earlier, in order for the x2APIC to operate in x2APIC mode an iommu is required. Long story short, the default config that Illumos ships disables iommus on boot. Specifically:

$ cat /platform/i86pc/kernel/drv/rootnex.conf | grep -v '^\(#.*\|\)$'

In order to get LOCAL_X2APIC mode, you need to set:


Once you put those into the config file, update you boot archive and reboot. You should be set… except the iommu support in Illumos is… shall we say… poor.

(I should point out that it is possible for the BIOS to enable x2APIC mode before handing control off to the OS. This is pretty rare unless you have a really big x86 system.)


It would seem that the hci1394 driver doesn’t quite know how to deal with an iommu “messing” with its I/Os and its interrupt service routine shuts down the driver. (On a debug build it throws is ASSERT(0) for good measure.) I just disabled 1394 in the BIOS since I don’t have any Firewire devices handy and therefore no use for the port at the moment.

immu-enable Details

In case you want to know how iommu initialization affects the apix initialization…

During boot, immu_init gets called to initialize iommus. If the config option (immu-enable) is not true, the function just returns instead of calling immu_subsystems_setup which calls immu_intrmap_setup which sets psm_vt_ops to non-NULL value.

Later on, when apix is loaded and is initializing itself in apix_picinit, it calls apic_intrmap_init. This function does nothing if psm_vt_ops are NULL.

The Hang

I might as well tell you a bit about my progress on tracking down the hang. It happens only if I’m using the apix module and I allow deep C states in the idle thread (technically, it could also be an mwait related issue since I cannot disable just mwait without disabling deep C states). It does not matter if the apic_mode is LOCAL_APIC or LOCAL_X2APIC.

Assorted Documentation

  1. Intel 64 Architecture x2APIC Specification
  2. Intel MP Spec 1.4

Haskell Kernel Modules

Insanity! Someone has made it possible to write kernel modules in Haskell. (FYI, Haskell is a functional language with very strong typing.) Currently, they support only x86, but I wouldn’t be surprised if some other architectures got a port soonish.

OLS 2008 - Day 3

Yeah, I really wanted to write this yesterday — since it is about yesterday, but I was too tired when I got to the hotel. Either way, here it is.

The day started at 10am again - I love it. Previous years, presentations started at 9am (except the first day that was 10am). The first talk I attended was a about kernel documentation — where it resides, and why the current state is bad. The talk was a bit confusing. At one point, the presenter decided to read some text right from a HTML file — opening it in a text editor instead of a browser. He also seemed to contradict himself a bit … at one point he seemed to have said that HTML was better than plaintext docs, and then some time later, he said the other thing — plaintext docs were better than HTML. I kinda gave up understanding what his point was.

I decided to be lazy, and stayed in the same room for the next talk: On submitting kernel features. I zoned out for quite a bit — I knew a bunch of things already, and it was a bit hard to lex what Andi Kleen (the speaker) was saying.

I was going to go to the ext4 talk. Unfortunately, I got distracted by people on my way to the talk, and before I knew it, I missed most of it. I guess I’ll just have to read the paper.

After lunch, I went to Virtualization of Linux servers: a comparative study. The talk was interesting, and I will read the paper. It showed exactly how much x86 virtualization sucks (at least compared to what’s on the mainframe). I can’t wait to have some time to hack on HVF some more. :)

Then, I got distracted by people, preparation of slides for my BOF about Guilt, pondering about trying SELinux again, etc., etc.

Anyway, I’m going to finish a summary of what happened yesterday later today. Until then…

Linux Kernel Developers Go Insane

This is a continuation of the lguest: The New Kid on the Block post I made the other day.

In responses to Rusty’s patches, Linus Torvalds and Alan Cox attempt poetry.

First, Linus…

There’s a reason for [not having enough poetry in the kernel].

There once was a lad from Braidwood
With a wife and a hatred for FUD
He hacked kernels for fun,
couldn’t get them to run.
But he always felt that he should.


So when you say "there’s not enough poetry", next time you’ll know why. You *really* don’t want want poetry.

Then Alan Cox replied with modified lyrics to Eleanor Rigby:

Ah look at all the laundered pages
Ah look at all the laundered pages

Handling Pages
Pick up the list and the link where kswap has been
A paging scheme
Runs down the I/O
Watching the queues that now keep me a list of the store
Who is it for

All the laundered pages
Where do they all come from
All the laundered pages
Where do they all belong

Meeting bdflush
Writing the pages of a disk file that no one will clear
No task comes near
Look at it working
Sleeping a lot in the night when there’s no pressure there
What does it care

All the laundered pages
Where do they all come from
All the laundered pages
Where do they all belong

Ah look at all the laundered pages
Ah look at all the laundered pages

Oracle DB
Died under load and was freed along with its name
No admin came
Good old bdflush
Wiping the dirt from the pages as it walks down the chain
Nothing was aged

All the laundered pages
(Ah look at all the laundered pages)
Where do they all come from
All the laundered pages
(Ah look at all the laundered pages)
Where do they all belong

Then, there was an exchange of limerics between Rusty and Alan…


There once was a virtualization coder,
Whose patches kept getting older,
Each time upstream would drop,
His documentation would slightly rot,


There once was a man they called rusty
Who patches were terribly crusty
Though his patches were right
And Linus was bright
They sat on the list getting dusty.


There was a poetic infection
Which distorted the kernel’s direction,
The code got no time
As they all tried to rhyme
And it shipped needing lots of correction.

And finally, Alan:

Dear Rusty I think that we know
Your code has good things to show
But an unreliable guide
To the poetic aside
Would probably steal the show

Either way, these are the people that write your operating system. :)

lguest: The New Kid on the Block

As most of you know, virtuallization doesn’t really interest me, so me writing about lguest is rather unusual. For those who don’t know, lguest is Rusty Russell’s way of saying virtualization sucks and I can make it better (don’t quote me on that).

Yesterday, Rusty sent out 7 patch series ( 1, 2, 3, 4, 5, 6, 7) that contains most of the documentation for lguest. This is not the normal style of documentation you’ll find in the kernel. Here’s Rusty’s description…

Lguest is an adventure, with you, the reader, as Hero. I can’t think of many 5000-line projects which offer both such capability and glimpses of future potential; it is an exciting time to be delving into the source!

But be warned; this is an arduous journey of several hours or more! And as we know, all true Heroes are driven by a Noble Goal. Thus I offer a Beer (or equivalent) to anyone I meet who has completed this documentation.

So get comfortable and keep your wits about you (both quick and humorous). Along your way to the Noble Goal, you will also gain masterly insight into lguest, and hypervisors and x86 virtualization in general.

There is a very large number of totally hillarious comments. It looks like one doesn’t have to be an x86 expert to get a laugh out of them, but knowing a thing or two about the architecture makes it all the more enjoyable.

I can’t help but include few excerpts here…

Intel provided a special instruction to clear the TS bit for people too cool to use write_cr0() to do it. This "clts" instruction is faster, because all the vowels have been optimized out.

I’m told there are only two stories in the world worth telling: love and hate. So there used to be a love scene here like this:

Launcher: We could make beautiful I/O together, you and I.
Guest: My, that’s a big disk!

Unfortunately, it was just too raunchy for our otherwise-gentle tale.

Just read the patches. They are really amusing :)

Looking up Files, Part II

So, here’s more updates about my adventures within the realm of unionfs_lookup (I suggest you read part I first). After my first post about lookup code, I went back to coding, and I had the pleasure to try to figure out why I was hitting a BUG_ON() with my new code, but not with the old code.

I made a simple test case, in one terminal I’d run fsx (a POSIX compliance tester program) on unionfs:

mount -t unionfs -o dirs=/mnt/foo/b0:/mnt/foo/b1=ro none unionfs/
cd unionfs/
fsx -l 104060000 -q foo

And then mid-way through, I’d insert a branch as the new branch index 0:

mount -o remount,add=/mnt/foo/b0:/mnt/foo/b2=rw /mnt/unionfs

The remount command immediatelly caused the BUG_ON (that tests for dentry validity) in unionfs_setattr to trigger. It seemed rather odd that the lookup code replacement would do something that’d cause the unionfs dentry to be invalid. I pondered for a bit, and then I tried to insert a number of branches quickly with the old code. Eureka! The same BUG_ON() got triggered. Some lxr-ing later, it became apparent that we need to potentially revalidate inside the inode ops (like unionfs_setattr). Seems kinda obvious now, oh well. I’m also pondering about the posibility of changing the VFS to call d_revalidate, but I’m still not sure if that’s the Right Thing(tm) to do.

Until next time!

Powered by blahgd