Josef “Jeff” Sipek

Delegating mount/umount Privileges

Recently, I was doing some file system changes. Obviously, I wanted to run them as an unprivileged user. Unfortunately, the test involved mounting and unmounting a filesystem (tmpfs to be specific). At first I was going to set up a sudo rule to allow mount and umount to run without asking for a password. Then I remembered that I should be able to give the unprivileged user the additional privileges. It turns out that there is only one privilege (sys_mount) necessary to delegate…and it is easy to do!

$ usermod -K defaultpriv=basic,sys_mount jeffpc

Then it’s a matter of logging out and back in. We can check using ppriv:

$ ppriv $$
925:    bash
flags = <none>
        E: basic,sys_mount
        I: basic,sys_mount
        P: basic,sys_mount
        L: all

At this point, mounting and unmounting works without sudo or similar user switching:

$ mkdir tmp
$ mount -F tmpfs none /tmp/tmp
$ df -h /tmp/tmp
Filesystem      Size  Used Avail Use% Mounted on
swap            2.6G     0  2.6G   0% /tmp/tmp

Serial Console in a Zone

In the past, I’ve talked about serial consoles. I have described how to set up a serial console on Solaris/OpenIndiana. I’ve talked about Grub’s composite console in Illumos-based distros. This time, I’m going do describe the one trick necessary to get tip(1) in a zone working.

In my case, I am using SmartOS to run my zones. Sadly, SmartOS doesn’t support device pass-through of this sort, so I have to tweak the zone config after I create the zone with vmadm.

Let’s assume that the serial port I want to pass through is /dev/term/a. Passing it through into a zone is as easy as:

[root@isis ~]# zonecfg -z 7cff99f6-2b01-464d-9f72-d0ef16ce48af
zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af> add device
zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af:device> set match=/dev/term/a
zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af:device> end
zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af> commit

At this point, you’ll probably want to reboot the zone (I don’t remember if it is strictly necessary). Once it is back up, you’ll want to get into the zone and point your software of choice at /dev/term/a. It doesn’t matter that you are in a zone. The same configuration rules apply — in my case, it’s the same change to /etc/remote as I described previously.

Inlining Atomic Operations

One of the items on my ever growing TODO list (do these ever shrink?) was to see if inlining Illumos’s atomic_* functions would make any difference. (For the record, these functions atomically manipulate variables. You can read more about them in the various man pages — atomic_add, atomic_and, atomic_bits, atomic_cas, atomic_dec, atomic_inc, atomic_or, atomic_swap.) Of course once I looked at the issue deeply enough, I ended up with five cleanup patches. The gist of it is, inlining them caused not only about 1% kernel performance improvement on the benchmarks, but also reduced the kernel size by a couple of kilobytes. You can read all about it in the associated bugs (5042, 5043, 5044, 5045, 5046, 5047) and the patch 0/6 email I sent to the developer list. In this blahg post, I want to talk about how exactly Illumos presents these atomic functions in a stable ABI but at the same time allows for inlines.

Genesis

It should come as no surprise that the “content” of these functions really needs to be written in assembly. The functions are 100% implemented in assembly in usr/src/common/atomic. There, you will find a directory per architecture. For example, in the amd64 directory, we’ll find the code for a 64-bit atomic increment:

	ENTRY(atomic_inc_64)
	ALTENTRY(atomic_inc_ulong)
	lock
	incq	(%rdi)
	ret
	SET_SIZE(atomic_inc_ulong)
	SET_SIZE(atomic_inc_64)

The ENTRY, ALTENTRY, and SET_SIZE macros are C preprocessor macros to make writing assembly functions semi-sane. Anyway, this code is used by both the kernel as well as userspace. I am going to ignore the userspace side of the picture and talk about the kernel only.

These assembly functions, get mangled by the C preprocessor, and then are fed into the assembler. The object file is then linked into the rest of the kernel. When a module binary references these functions the krtld (linker-loader) wires up those references to this code.

Inline

Replacing these function with inline functions (using the GNU definition) would be fine as far as all the code in Illumos is concerned. However doing so would remove the actual functions (as well as the symbol table entries) and so the linker would not be able to wire up any references from modules. Since Illumos cares about not breaking existing external modules (both open source and closed source), this simple approach would be a no-go.

Inline v2

Before I go into the next and final approach, I’m going to make a small detour through C land.

extern inline

First off, let’s say that we have a simple function, add, that returns the sum of the two integer arguments, and we keep it in a file called add.c:

#include "add.h"

int add(int x, int y)
{
	return x + y;
}

In the associated header file, add.h, we may include a prototype like the following to let the compiler know that add exists elsewhere and what types to expect.

extern int add(int, int);

Then, we attempt to call it from a function in, say, test.c:

#include "add.h"

int test()
{
	return add(5, 7);
}

Now, let’s turn these two .c files into a .so. We get the obvious result — test calls add:

test()
    test:     be 07 00 00 00     movl   $0x7,%esi
    test+0x5: bf 05 00 00 00     movl   $0x5,%edi
    test+0xa: e9 b1 fe ff ff     jmp    -0x14f	<0xc90>

And the binary contains both functions:

$ /usr/bin/nm test.so | egrep '(Value|test$|add$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[74]	|                3520|                   4|FUNC |GLOB |0    |13   |add
[65]	|                3536|                  15|FUNC |GLOB |0    |13   |test

Now suppose that we modify the header file to include the following (assuming GCC’s inline definition):

extern int add(int, int);

extern inline int add(int a, int b)
{
	return a + b;
}

If we compile and link the same .so the same way, that is we feed in the object file with the previously used implementation of add, we’ll get a slightly different binary. The invocation of add will use the inlined version:

test()
    test:     b8 0c 00 00 00     movl   $0xc,%eax
    test+0x5: c3                 ret    

But the binary will still include the symbol:

$ /usr/bin/nm test.so | egrep '(Value|test$|add$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[72]	|                3408|                   4|FUNC |GLOB |0    |11   |add
[63]	|                3424|                   6|FUNC |GLOB |0    |11   |test

Neat, eh?

extern inline atomic what?

How does this apply to the atomic functions? Pretty simply. As I pointed out, usr/src/common/atomic contains the pure assembly implementations — these are the functions you’ll always find in the symbol table.

The common header file that defines extern prototypes is usr/src/uts/common/sys/atomic.h.

Now, the trick. If you look carefully at the header file, you’ll spot a check on line 39. If all the conditions are true (kernel code, GCC, inline assembly is allowed, and x86), we include asm/atomic.h — which lives at usr/src/uts/intel/asm/atomic.h. This is where the extern inline versions of the atomic functions get defined.

So, kernel code simply includes <sys/atomic.h>, and if the stars align properly, any atomic function use will get inlined.

Phew! This ended up being longer than I expected. :)

Grub Composite Console

In the past, I’ve described how to get a serial console going on Illumos based systems. If you ever used a serial console in Grub (regardless of the OS you ended up booting), you probably know that telling Grub to output to a serial port causes the VGA console to become totally useless — it’s blank.

Well, if you are using Illumos, you are in luck. About 5 months ago, Joyent integrated a “composite console” in Grub. You can read the full description in the bug report/feature request. The short version is: all grub output can be sent to both the VGA console as well as over a serial port.

It is very easy to configure. In your menu.lst, change the terminal to composite. For example, this comes from my test box’s config file (omitting the uninteresting bits):

serial --unit=0 --speed=115200
terminal composite

Note the use of composite instead of serial. That’s all there is to it.

Segment Drivers

Lately, I started poking around the Illumos memory management code. As I’ve done in the past, I decided to use this blahg as a place to document some of my discoveries.

Memory Layout

In Illumos (and Solaris), address spaces are managed as sets of segments. Each segment has a base address, length, and a number of other properties. This is true for both process memory as well as kernel memory. Do not confuse these segments with Wikipedia article: memory segmentation that processors like Wikipedia article: x86 provide.

Each process has its own struct as:

> ::pgrep vim
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R  10852  10777  10850  10777    101 0x4a004000 ffffff0411e1c0a0 vim
> ffffff0411e1c0a0::print proc_t p_as | ::print struct as a_segtree
a_segtree = {
    a_segtree.avl_root = 0xffffff03f7c62ea8
    a_segtree.avl_compar = as_segcompar
    a_segtree.avl_offset = 0x20
    a_segtree.avl_numnodes = 0x18
    a_segtree.avl_size = 0x60
}

The kernel address space is maintained in the kas global:

> kas::print a_segtree
a_segtree = {
    a_segtree.avl_root = kvseg+0x20
    a_segtree.avl_compar = as_segcompar
    a_segtree.avl_offset = 0x20
    a_segtree.avl_numnodes = 0x9
    a_segtree.avl_size = 0x60
}

(Once upon a time this set of segments was a linked list, but for a long while now it has been an AVL tree indexed by the base address.)

Regardless of which address space we’re dealing with, the same rules apply: segments represent contiguous regions within the address space. Each segment can represent a different type of memory. For example, walking the kernel address space segment tree yields nine different segments of four different types (kpm, kmem, kp, and map):

> kas::print a_segtree | ::walk avl | ::printf "%p.%016x %a\n" "struct seg" s_base s_size s_ops
fffffe0000000000.000000031e000000 segkpm_ops
ffffff0000000000.0000000017000000 segkmem_ops
ffffff0017000000.0000000080000000 segkp_ops
ffffff0097000000.00000002fca00000 segkmem_ops
ffffff03d3a00000.0000000004000000 segmap_ops
ffffff03d7a00000.000000fbe8600000 segkmem_ops
ffffffffc0000000.000000003b7fb000 segkmem_ops
fffffffffb800000.0000000000550000 segkmem_ops
ffffffffff800000.0000000000400000 segkmem_ops

Segment Drivers

Illumos comes with seven different architecture- and platform-independent segment drivers. A segment driver is a “driver” that implements a couple of functions to manage a segment of memory. That is, each segment type can handle page faults, page locking, sync operations, etc. differently.

For example, suppose that a page fault occurs because a process tried to load a value from a page that lacks a page table entry. The platform specific (assembly) fault handling code gets invoked by the processor. After doing a little bit of work, it calls into the generic (C) fault handling code, as_fault. There, the segtree AVL tree is consulted and the corresponding segment’s fault operation gets invoked.

(Solaris Internals lists 12 and 11 segment drivers, respectively, in the two editions.) In Illumos, the seven common segment drivers are:

seg_dev
Most of the time, userspace processes do not need to map devices into their address space. In the rare case when a process does want a device mapped (e.g., Xorg), the dev segment driver maintains that mapping.
seg_kmem
This segment driver maps the kernel heap, module text, and all early boot memory. (code)
seg_kp
In general, kernel memory is not pageable. In the rare case that something can be in kernel pageable memory, this segment is what maintains the anonymous page mappings.
seg_kpm
If possible (you’re on a 64-bit system), the kpm segment driver maps all physical memory into the kernel’s address space. This allows the kernel to not have to set up temporary mappings to operate on physical memory. (code)
seg_map
The map segment driver is a kernel-only higher performance version of the vn segment driver. (See below.)
seg_spt
This segment driver is responsible for maintaining SysV shared memory segments. (Not to be confused with POSIX shared memory.)
seg_vn
Memory mapped files are handled by the vn segment driver. This includes both regular files as well as anonymous memory.

There are also two platform specific segment drivers:

seg_mf (i86xpv only)
This segment driver is only used by dom0 processes (read: Xen) to map pages from other domains.
seg_nf (sparc v9 only)
The header for the file says that it is for non-faulting loads. I don’t actually know what exactly it is for. (And I don’t care enough to dig deeper given that it is Sparc specific.)

The Reality

This is a lot of different segment drivers. Are all of them used all the time? Well, sort of. The mdb output earlier shows that the (amd64) kernel uses only four different segment drivers (kpm, kmem, kp, and map). A typical userspace process is very boring — it is only made up of vn segments. There are, however, exceptions. For instance, Xorg uses vn and dev. This accounts for six of the seven drivers. The last common segment driver is spt, which provides System V shared memory. (I talked about SysV shared memory previously.) So, on a 64-bit x86 system, all seven common segment drivers are in use.

The story is a bit different on 32-bit kernels. Since a 32-bit system has much smaller address space, the kernel tries to eliminate a number of mappings. Here is the list of segments in a 32-bit kernel:

> kas::print a_segtree | ::walk avl | ::printf "%p %a\n" "struct seg" s_base s_ops
b5802000 segmap_ops
b6800000 segkmem_ops
ef400000 segkmem_ops
fe800000 segkmem_ops
ff000000 segkmem_ops

As you can see, the kp and kpm segments went away. While at first this is surprising, it actually makes perfect sense. When thinking about memory there are two “types” to consider: physical and virtual. In theory, one can have more virtual than physical thanks to the MMU but in reality this is only true on 64-bit systems. The physical memory sizes have outgrown 4 GB a number of years ago and therefore a 32-bit address space can trivially be 100% backed by physical memory. In other words, 32-bit address spaces are tight on virtual memory, while 64-bit address spaces are “tight” on physical memory.

Let’s consider the disappearance of the kp segment on 32-bits. What does kp let us do? It lets us oversubscribe physical memory by backing some virtual memory with disk space. On 32-bit systems we have enough physical memory to back all the virtual memory in the kernel so we don’t need to back some of it by disk. So we have no use for it. (Yes, the kernel still could have paged parts of itself out, but kernel text and data is generally considered important enough to keep it in non-pageable memory. The memory utilization will more than pay for itself by the performance improvement of not having the kernel paged out.)

As I stated before, kpm segments map physical memory into the kernel’s address space for performance reasons (without it the kernel would have to temporarily map a page to access the contents). Therefore, they are good candidates for removal when it comes to slimming down the kernel’s address space demands. (Well, the actual story is the other way… the introduction of 64-bit capable hardware allowed kpm segments to exist to improve kernel performance.)

Unix Shared Memory

While investigating whether some memory management code was still in use (I’ll blahg about this in the future), I ended up learning quite a bit about shared memory on Unix systems. Since I managed to run into a couple of non-obvious snags while trying to get a simple test program running, I thought I’d share my findings here for my future self.

All in all, there are three ways to share memory between processes on a modern Unix system.

System V shm

This is the oldest of the three. First you call shmget to set up a shared memory segment and then you call shmat to map it into your address space. Here’s a quick example that does not do any error checking or cleanup:

void sysv_shm()
{
        int ret;
        void *ptr;

        ret = shmget(0x1234, 4096, IPC_CREAT);
        printf("shmget returned %d (%d: %s)\n", ret, errno,
               strerror(errno));

        ptr = shmat(ret, NULL, SHM_PAGEABLE | SHM_RND);
        printf("shmat returned %p (%d: %s)\n", ptr, errno, strerror(errno));
}

What’s so tricky about this? Well, by default Illumos’s shmat will return EPERM unless you are root. This sort of makes sense given how this flavor of shared memory is implemented. (Hint: it’s all in the kernel)

POSIX shm

As is frequently the case, POSIX came up with a different interface and different semantics for shared memory. Here’s the POSIX shm version of the above function:

void posix_shm()
{
	int fd;
	void *ptr;

	fd = shm_open("/blah", O_RDWR | O_CREAT, 0666);
	printf("shm_open returned %d (%d: %s)\n", fd, errno,
	       strerror(errno));

	ftruncate(fd, 4096); /* IMPORTANT! */

	ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	printf("mmap returned %p (%d: %s)\n", ptr, errno, strerror(errno));
}

The very important part here is the ftruncate call. Without it, shm_open may create an empty file and mmaping an empty file won’t work very well. (Well, on Illumos mmap succeeds, but you effectively have a 0-length mapping so any loads or stores will result in a SIGBUS. I haven’t tried other OSes.)

Aside from the funny looking path (it must start with a slash, but cannot contain any other slashes), shm_open looks remarkably like the open system call. It turns out that at least on Illumos, shm_open is implemented entirely in libc. The implementation creates a file in /tmp based on the path provided and the file descriptor that it returns is actually a file descriptor for this file in /tmp. For example, “/blah” input translates into “/tmp/.SHMDblah”. (There is a second file “/tmp/.SHMLblah” that doesn’t live very long. I think it is a lock file.) The subsequent mmap call doesn’t have any idea that this file is special in any way.

Does this mean that you can reach around shm_open and manipulate the object directly? Not exactly. POSIX states: “It is unspecified whether the name appears in the file system and is visible to other functions that take pathnames as arguments.”

The big difference between POSIX and SysV shared memory is how you refer to the segment — SysV uses a numeric key, while POSIX uses a path.

mmap

The last way of sharing memory involves no specialized APIs. It’s just plain ol’ mmap on an open file. For completeness, here’s the function:

void mmap_shm()
{
	int fd;
	void *ptr;

	fd = open("/tmp/blah", O_RDWR | O_CREAT, 0666);
	printf("open returned %d (%d: %s)\n", fd, errno, strerror(errno));

	ftruncate(fd, 4096); /* IMPORTANT! */

	ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	printf("mmap returned %p (%d: %s)\n", ptr, errno, strerror(errno));
}

It is very similar to the POSIX shm code example. As before, we need the ftruncate to make the shared file non-empty.

pmap

In case you’ve wondered what SysV or POSIX shm segments look like on Illumos, here’s the pmap output for a process that basically runs the first two examples above.

6343:	./a.out
0000000000400000          8K r-x--  /storage/home/jeffpc/src/shm/a.out
0000000000411000          4K rw---  /storage/home/jeffpc/src/shm/a.out
0000000000412000         16K rw---    [ heap ]
FFFFFD7FFF160000          4K rwxs-    [ dism shmid=0x13 ]
FFFFFD7FFF170000          4K rw-s-  /tmp/.SHMDblah
FFFFFD7FFF180000         24K rwx--    [ anon ]
FFFFFD7FFF190000          4K rwx--    [ anon ]
FFFFFD7FFF1A0000       1596K r-x--  /lib/amd64/libc.so.1
FFFFFD7FFF33F000         52K rw---  /lib/amd64/libc.so.1
FFFFFD7FFF34C000          8K rw---  /lib/amd64/libc.so.1
FFFFFD7FFF350000          4K rwx--    [ anon ]
FFFFFD7FFF360000          4K rwx--    [ anon ]
FFFFFD7FFF370000          4K rw---    [ anon ]
FFFFFD7FFF380000          4K rw---    [ anon ]
FFFFFD7FFF390000          4K rwx--    [ anon ]
FFFFFD7FFF393000        348K r-x--  /lib/amd64/ld.so.1
FFFFFD7FFF3FA000         12K rwx--  /lib/amd64/ld.so.1
FFFFFD7FFF3FD000          8K rwx--  /lib/amd64/ld.so.1
FFFFFD7FFFDFD000         12K rw---    [ stack ]
         total         2120K

You can see that the POSIX shm file got mapped in the standard way (address FFFFFD7FFF170000). The SysV shm segment is special — it is not a plain old memory map (address FFFFFD7FFF160000).

That’s it for today. I’m going to talk about segment types in the different post in the near future.

Bugs in Time

Recently, I blahgd about GCC optimizing code interestingly. There, I mentioned a couple of bugs I’ve stumbled across. I’m going to talk more about them in this post.

ddi_get_time

It all started when I got assigned a bug at work. “The installer hangs while checking available disks.” That’s the extent of the information I was given along with a test system. It didn’t take long to figure that devfsadm -c disk was waiting on a kernel thread that didn’t seem to be making any progress:

swtch+0x141
cv_timedwait_hires+0xec
cv_reltimedwait+0x51
ibdm`ibdm_ibnex_port_settle_wait+0x5f
ib`ibnex_bus_config+0x1e8
devi_config_common+0xa5
mt_config_thread+0x58
thread_start+8

The function of interest here is ibdm_ibnex_port_settle, but before I talk about it I need to mention that the ibdm kmod stashes a ddi_get_time timestamp of when the HCA attached. Now, ibdm_ibnex_port_settle calls ibdm_get_waittime to get a delay to feed to cv_reltimedwait. The delay is (more or less) calculated as: ddi_get_time() - hca_attach_time. This works fine as long as ddi_get_time continues incrementing at a constant rate (1 sec/sec).

You may already see where this is going. The problem is that ddi_get_time returns a Unix timestamp based on the current time-of-day clock. If the TOD setting changes for whatever reason (daylight saving time adjustments, NTP, etc.), the value returned by ddi_get_time may change non-monotonically. This makes it unsuitable for calculating timeouts and wait times. Converting ibdm_get_waittime to use a monotonic clock source (like gethrtime or ddi_get_lbolt) fixes this bug. (Illumos bug 4777)

Things get a bit worse. While figuring out what ddi_get_time does, I noticed that the man page actively encouraged developers to use it for timeouts. (Illumos bug 4776)

Of course, once I knew about this potential abuse, I had to check that there weren’t similar issues elsewhere in the kernel… and so I got to file bugs for iprb (4778), vhci (4779), COMSTAR iSCSI target (4780), sd (4781), usba (4782), emlxs (4786), ipf (4787), mac (4788), amr (4789), arcmsr (4790), aac (4791), and heci (4792).

I’m fixing all except: amr, arcmsr, aac, and heci.

NANOSEC

While developing the series of fixes mentioned in the previous section, I ran into the fact that NANOSEC was defined as 1000000000. This made it an int — a 32-bit signed integer (on both ILP32 and LP64).

If NANOSEC (defined this way) is used to convert seconds to nanoseconds (by multiplying), the naive approach will fail with quantities larger than 2 seconds. For example (hrtime_t is a 64-bit signed int):

hrtime_t convert(int secs)
{
        return (secs * NANOSEC);
}

Since both secs and NANOSEC are integers, the compiler will compute the product and then sign extend the result to 64-bits. If you look around the Illumos codebase, you’ll see plenty of places that cast or use ULL or LL suffix to make the compiler do the right thing. Why not just change the definition of NANOSEC to include a LL suffix releaving the users of this tedious (and error prone!) duty? Well, now you know what Illumos bug 4809 is about. :)

So, I changed the definition and rebuilt everything. Then, using wsdiff (think: recursive diff that understands how to compare ELF files) I found two places where the before and after binaries differed for non-trivial reasons. (I define a trivial reason as “the compiler decided to use registers differently, but the result is the same”.) Each non-trivial difference implies that there was an expression that changed — it used to be busted!

The first difference was in ZFS (Illumos bug 4810). There, spa_async_tasks_pending miscalculated a timeout making the condition always true.

The second difference was in in.mpathd. 4811). This daemon has a utility function to convert a struct timeval into a hrtime_t. You can read more about it in my previous post.

Before the NANOSEC change, I would have needed casts to fix this. With the change in definition, I don’t have to change a thing! And that’s how a one liner closed three bugs at the same time:

commit b59e2127f21675e88c58a4dd924bc55eeb83c7a6
Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Date:   Mon Apr 28 15:53:04 2014 -0400

    4809 NANOSEC should be 'long long' to avoid integer overflow bugs
    4810 spa_async_tasks_pending suffers from an integer overflow bug
    4811 in.mpathd: tv2ns suffers from an integer overflow bug
    Reviewed by: Marcel Telka <marcel.telka@nexenta.com>
    Reviewed by: Dan McDonald <danmcd@omniti.com>
    Approved by: Robert Mustacchi <rm@joyent.com>

GCC Optimizations

Recently, I’ve been given a hang bug to work on. This lead me to a another bug related to timing which pushed me to clean up a time related #define which uncovered at least two bugs. Got all that? Good. The rest of this post is going to talk about the changed define, and one of the “at least two bugs”. When I talk about GCC, I’m talking about the Illumos-specific GCC version 4.4.4. (Illumos needs a couple of features that stock GCC doesn’t provide.)

The #define change I’m hoping to make is very simple:

diff --git a/usr/src/uts/common/sys/time.h b/usr/src/uts/common/sys/time.h
--- a/usr/src/uts/common/sys/time.h
+++ b/usr/src/uts/common/sys/time.h
@@ -234,7 +234,7 @@ struct itimerval32 {
 #define        SEC             1
 #define        MILLISEC        1000
 #define        MICROSEC        1000000
-#define        NANOSEC         1000000000
+#define        NANOSEC         1000000000ll
 
 #define        MSEC2NSEC(m)    ((hrtime_t)(m) * (NANOSEC / MILLISEC))
 #define        NSEC2MSEC(n)    ((n) / (NANOSEC / MILLISEC))

Without it, multiplying by NANOSEC will cause integer overflow issues on IPL32 and LP64 systems (read: basically everywhere).

One of the “at least two bugs“ involves a simple (buggy) function aptly named tv2ns as it converts a struct timeval to a 64-bit nanosecond count:

static int64_t
tv2ns(struct timeval *tvp)
{
	return (tvp->tv_sec * NANOSEC + tvp->tv_usec * 1000);
}

At first glance, this function looks correct. The only flaw with it is that first portion of the expression multiplies a time_t (32-bit signed int) with an int (also 32-bit signed) making the result of that subexpression 32-bit signed expression. With NANOSEC changed to a long long, everything works as expected. Now, the fun part… disassembling this function without the fix. You don’t have to be an expert to see that this function is strangely repetitive. I’ve annotated the assembly.

tv2ns:          movl   0x4(%esp),%eax     ; eax = tvp
tv2ns+4:        movl   0x4(%eax),%edx     ; edx = tvp->tv_usec
tv2ns+7:        leal   (%edx,%edx,4),%edx ; edx = edx + 4 * edx
tv2ns+0xa:      leal   (%edx,%edx,4),%edx ;     = 5 * edx
tv2ns+0xd:      leal   (%edx,%edx,4),%edx
;
; at this point:  edx = 5 * 5 * 5 * tvp->tv_usec,
; which is the same as: 125 * tvp->tv_usec
;
tv2ns+0x10:     movl   (%eax),%eax        ; eax = tvp->tv_sec
tv2ns+0x12:     leal   (%eax,%eax,4),%eax ; eax = eax + 4 * eax
tv2ns+0x15:     leal   (%eax,%eax,4),%eax ;     = 5 * eax
tv2ns+0x18:     leal   (%eax,%eax,4),%eax
tv2ns+0x1b:     leal   (%eax,%eax,4),%eax
tv2ns+0x1e:     leal   (%eax,%eax,4),%eax
tv2ns+0x21:     leal   (%eax,%eax,4),%eax
tv2ns+0x24:     leal   (%eax,%eax,4),%eax
tv2ns+0x27:     leal   (%eax,%eax,4),%eax
tv2ns+0x2a:     leal   (%eax,%eax,4),%eax
;
; at this point,  eax = 5 * 5 * 5 * 5 * 5 * 5 * 5 * 5 * 5 * tvp->tv_sec,
; which is the same as: 1953125 * tvp->tv_sec
;
tv2ns+0x2d:     shll   $0x9,%eax          ; eax <<= 9
;
; eax = (1953125 * tvp->tv_sec) << 9,
; which suprprisingly ends up being the same as: 1000000000 * tvp->tv_sec
;
; so, now we have 'eax' with the tv_sec converted to nanoseconds and 'edx'
; with 125 * tv_usec
;
tv2ns+0x30:     leal   (%eax,%edx,8),%eax ; eax = eax + 8 * edx
;
; 8 * 125 = 1000, which is the factor to convert tv_usec to nanoseconds!
;
tv2ns+0x33:     cltd                      ; sign-extend eax to edx:eax
tv2ns+0x34:     ret    

I found it interesting that GCC decided to emit leal instructions to multiply by 5 and then finish it off with a shift and another leal. This is another one of those times when I realize that the compiler is smarter than me. (The sign-extension of course happens too late — all the math needs to happen as 64-bit arithmetic, but that’s not GCC’s fault.)

For the record, with the #define changed, the function looks like the following — sorry, no comments on this one:

tv2ns:          pushl  %edi
tv2ns+1:        pushl  %esi
tv2ns+2:        pushl  %ebx
tv2ns+3:        subl   $0x8,%esp
tv2ns+6:        movl   0x18(%esp),%ecx
tv2ns+0xa:      movl   0x4(%ecx),%eax
tv2ns+0xd:      leal   (%eax,%eax,4),%eax
tv2ns+0x10:     leal   (%eax,%eax,4),%eax
tv2ns+0x13:     leal   (%eax,%eax,4),%ebx
tv2ns+0x16:     shll   $0x3,%ebx
tv2ns+0x19:     movl   %ebx,%esi
tv2ns+0x1b:     sarl   $0x1f,%esi
tv2ns+0x1e:     movl   $0x3b9aca00,%edi
tv2ns+0x23:     movl   (%ecx),%eax
tv2ns+0x25:     imull  %edi
tv2ns+0x27:     movl   %eax,(%esp)
tv2ns+0x2a:     movl   %edx,0x4(%esp)
tv2ns+0x2e:     addl   %ebx,%eax
tv2ns+0x30:     adcl   %esi,%edx
tv2ns+0x32:     addl   $0x8,%esp
tv2ns+0x35:     popl   %ebx
tv2ns+0x36:     popl   %esi
tv2ns+0x37:     popl   %edi
tv2ns+0x38:     ret    

Maybe one day I’ll rummage through my brain and dig up other times that GCC is outsmarted me and blahg about them. :)

Greetings from Nexenta

In case you missed it, back in mid-2011 I discovered Illumos and OpenIndiana. At that point, I already missed hacking on the (Linux) kernel. Based on my blahg posts [1,2], it shouldn’t surprise you that it didn’t take long before I wanted to hack on the Illumos kernel…and so I did.

If you ever contributed to an open source project in your free time while employed full-time, you understand that there’s only so much time you can devote to the open source project and therefore there is only so much you can do.

A couple of months ago, I decided to explore the possibility of working full-time on Illumos. There are only a handful of companies that visibly participate in the Illumos ecosystem, but their use of Illumos is pretty varied (from public clouds to virtualized databases to SAN/NAS appliances). As of this past Tuesday (Monday was a holiday), I’m at Nexenta. At least for now, I’m working remotely (from Ann Arbor) with the fine folks in the Wikipedia article: Lowell office. It feels great to work on open source again.

x2APIC, IOMMU, Illumos

About a week ago, I hinted at a boot hang I was debugging. I’ve made some progress with it, and along the way I found some interesting things about which I’ll blog over the next few days. Today, I’m going to talk about the Wikipedia article: APIC, xAPIC, and Wikipedia article: x2APIC and how they’re handled in Illumos.

APIC, xAPIC, x2APIC

I strongly suggest you become at least a little familiar with APIC architecture before reading on. The Wikipedia articles above are a good start.

First things first, we need some definitions. APIC can refer to either the architecture or to very old (pre-Pentium 4) implementation. Since I’m working with a Sandy Bridge, I’m going to use APIC to refer to the architecture and completely ignore that these chips existed. Everything they do is a subset of xAPIC. xAPIC is an extension to APIC. xAPIC chips started showed up in NetBurst architecture Intel CPUs (i.e., Pentium 4). xAPIC included some goodies such as upping the limit on the number of CPUs to 256 (from 16). x2APIC is an extension to xAPIC. x2APIC chips started appearing around the same time Sandy Bridge systems started showing up. It is a major update to how interrupts are handled, but as with many things in the PC industry the x2APIC is fully backwards compatible with xAPICs. x2APIC includes some goodies such as upping the limit on the number of CPUs to 232.

Regardless of which exact flavor you happen to use, you will find two components: the local APIC and I/O APIC. Each processor gets their own local APIC and I/O buses get I/O APICs. I/O APICs can service more than one device, and in fact many systems have only one I/O APIC.

The xAPIC uses Wikipedia article: MMIO to program the local and I/O APICs.

x2APIC has two mode of operation. First, there is the xAPIC compatibility mode which makes the x2APIC behave just like an xAPIC. This mode doesn’t give you all the new bells and whistles. Second, there is the new x2APIC mode. In this mode, the APIC is programmed using Wikipedia article: MSRs.

One interesting fact about x2APIC is that it requires an Wikipedia article: iommu. My Sandy Bridge laptop has an Intel iommu as part of the VT-d feature.

Illumos /etc/mach

x2APIC in Illumos has two APIC drivers. First, there is pcplusmp which knows how to handle APIC and xAPIC. Second, there is apix which targets x2APIC, but knows how to operate it in both modes. On boot, the kernel consults /etc/mach to get a list of machine specific modules to try to load. Currently, the default contents (trimmed for display here) are:

#
# CAUTION!  The order of modules specified here is very important. If the
# order is not correct it can result in unexpected system behavior. The
# loading of modules is in the reverse order specified here (i.e. the last
# entry is loaded first and the first entry loaded last).
#
pcplusmp
apix
xpv_psm

Since I’m not running Xen, xpv_psm will fail to load, and apix gets its chance to load.

pcplusmp + apix Code Sharing

The code in these two modules can be summarized with a word: mess. Following what happens when would be enough of an adventure. The code for the two modules lives in four directories: usr/src/uts/i86pc/io, usr/src/uts/i86pc/io/psm, usr/src/uts/i86pc/io/pcplusmp, and usr/src/uts/i86pc/io/apix. But the sharing isn’t as straight forward as one would hope.

Directory pcplusmp apix
i86pc/io mp_platform_common.c, mp_platform_misc.c, hpet_acpi.c mp_platform_common.c, hpet_acpi.c
i86pc/io/psm psm_common.c psm_common.c
i86pc/io/pcplusmp * apic_regops.c, apic_common.c, apic_timer.c
i86pc/io/apix *

This is of course not clear at all when you look at the code. (Reality is a bit messier because of the i86xpv platform which uses some of the i86pc source.)

apix_probe

When the apix module gets loaded, its probe function (apix_probe) is called. This is the place where the module decides if the hardware is worthy. Specifically, if it finds that the CPU reports x2APIC support via Wikipedia article: cpuid, it goes on to call the common APIC probe code (apic_probe_common). Unless that fails, the system will use the apix module — even if there is no iommu and therefore the x2APIC needs to operate in xAPIC mode.

What mode are you using? Easy, just check the apic_mode global in the kernel:

# echo apic_mode::whatis | mdb -k
fffffffffbd0ee4c is apic_mode, in apix's data segment
# echo apic_mode::print | mdb -k
0x2

2 (LOCAL_APIC) indicates xAPIC mode, while 3 (LOCAL_X2APIC) indicates x2APIC mode.

Because this part is as clear as mud, I made a table that tells you what module and mode to expect given your hardware, what CPUID says, and the presence and state of the iommu.

APIC hw CPUID IOMMU IOMMU state Module apic_mode
xAPIC off pcplusmp LOCAL_APIC
x2APIC off pcplusmp LOCAL_APIC
x2APIC on absent apix LOCAL_APIC
x2APIC on present off apix LOCAL_APIC
x2APIC on present on apix LOCAL_X2APIC

Defaults

I’ve never seen apic_mode equal to LOCAL_X2APIC in the wild. This was very puzzling. Yesterday, I discovered why. As I mentioned earlier, in order for the x2APIC to operate in x2APIC mode an iommu is required. Long story short, the default config that Illumos ships disables iommus on boot. Specifically:

$ cat /platform/i86pc/kernel/drv/rootnex.conf | grep -v '^\(#.*\|\)$'
immu-enable="false";

In order to get LOCAL_X2APIC mode, you need to set:

immu-enable="true";
immu-intrmap-enable="true";

Once you put those into the config file, update you boot archive and reboot. You should be set… except the iommu support in Illumos is… shall we say… poor.

(I should point out that it is possible for the BIOS to enable x2APIC mode before handing control off to the OS. This is pretty rare unless you have a really big x86 system.)

1394

It would seem that the hci1394 driver doesn’t quite know how to deal with an iommu “messing” with its I/Os and its interrupt service routine shuts down the driver. (On a debug build it throws is ASSERT(0) for good measure.) I just disabled 1394 in the BIOS since I don’t have any Firewire devices handy and therefore no use for the port at the moment.

immu-enable Details

In case you want to know how iommu initialization affects the apix initialization…

During boot, immu_init gets called to initialize iommus. If the config option (immu-enable) is not true, the function just returns instead of calling immu_subsystems_setup which calls immu_intrmap_setup which sets psm_vt_ops to non-NULL value.

Later on, when apix is loaded and is initializing itself in apix_picinit, it calls apic_intrmap_init. This function does nothing if psm_vt_ops are NULL.

The Hang

I might as well tell you a bit about my progress on tracking down the hang. It happens only if I’m using the apix module and I allow deep C states in the idle thread (technically, it could also be an mwait related issue since I cannot disable just mwait without disabling deep C states). It does not matter if the apic_mode is LOCAL_APIC or LOCAL_X2APIC.

Assorted Documentation

  1. Intel 64 Architecture x2APIC Specification
  2. Intel MP Spec 1.4

Powered by blahgd