Josef “Jeff” Sipek

Creative xor Use

Last month at work I got to try to optimize a function that takes a number and rounds it up to the next power of 2. The previous implementation used a simple loop. I didn’t dive into obscure bit twiddling, but rather used a helper function that is already in the codebase. Yes, I let the compiler do the heavy lifting of turning easy to understand code into good machine code. The x86 binary that gcc 6.3 produced has an interesting idiom, and that’s why I’m writing this entry.

The new code:

static inline unsigned int bits_required32(uint32_t num)
{
        return num == 0 ? 0 : 32 - __builtin_clz(num);
}

/* Returns x, such that x is the smallest power of 2 >= num. */
uint32_t nearest_power(uint32_t num)
{
	if (num == 0)
		return 1;

        return 1U << bits_required32(num - 1);
}

This is a slightly simplified version of the code, but it demonstrates the optimization quite well.

The nearest_power function disassembles as:

nearest_power()
    nearest_power:      8b 54 24 04        movl   0x4(%esp),%edx
    nearest_power+0x4:  b8 01 00 00 00     movl   $0x1,%eax
    nearest_power+0x9:  85 d2              testl  %edx,%edx
    nearest_power+0xb:  74 14              je     +0x14	<nearest_power+0x21>
    nearest_power+0xd:  83 ea 01           subl   $0x1,%edx
    nearest_power+0x10: 74 0f              je     +0xf	<nearest_power+0x21>
    nearest_power+0x12: 0f bd d2           bsrl   %edx,%edx
    nearest_power+0x15: b9 20 00 00 00     movl   $0x20,%ecx
    nearest_power+0x1a: 83 f2 1f           xorl   $0x1f,%edx
    nearest_power+0x1d: 29 d1              subl   %edx,%ecx
    nearest_power+0x1f: d3 e0              shll   %cl,%eax
    nearest_power+0x21: c3                 ret    

The first 6 instructions contain the prologue and deal with num being zero or one—both cases produce the result 1. The remaining 6 instructions make up the epilogue and are where the calculation happens. I’m going to ignore the first half of the function, since the second half is where the interesting things happen.

First, we get the number of leading zeros in num - 1 and stash the value 32 in a register:

    nearest_power+0x12: 0f bd d2           bsrl   %edx,%edx
    nearest_power+0x15: b9 20 00 00 00     movl   $0x20,%ecx

The number of leading zeros (%edx) is in the range 0–31.

Here is the really interesting bit:

    nearest_power+0x1a: 83 f2 1f           xorl   $0x1f,%edx

This xors the number of leading zeros (i.e., 0–31) with 31. To decipher what this does, I find it easier to consider the top 27 bits and the bottom 5 bits separately.

operand binary
0x1f 00000000 00000000 00000000 000 11111
edx 00000000 00000000 00000000 000 xxxxx

The xor of the top bits produces 0 since both the constant 31 and the register containing any of the numbers 0–31 have zeros there.

The xor of the bottom bits negates them since the constant has ones there.

When combined, the xor has the same effect as this C expression:

out = (~in) & 0x1f;

This seems very weird and useless, but it is far from it. It turns out that for inputs 0–31 the above expression is the same as:

out = 31 - in;

I think it is really cool that gcc produced this xor instead of a less optimal multi-instruction version.

The remainder of the disassembly just subtracts and shifts to produce the return value.

Why xor?

I think the reason gcc (and clang for that matter) produce this sort of xor instruction instead of a subtraction is very simple: on x86 the sub instruction’s left hand side and the destination must be the same register. That is, on x86 the sub instruction works as:

x -= y;

Since the destination must be a register, it isn’t possible to express out = 31 - in using just one sub.

Anyway, that’s it for today. I hope you enjoyed this as much as I did.

2015-11-16

a tale of software maintenance: OpenSSL and EVP_CHECK_DES_KEY

What’s Up With That: Why Do Cats Love Boxes So Much?

Software Usability II — a story from SGI in the 90’s

Crypto weakness in smart LED lightbulbs exposes Wi-Fi passwords

Intel x86 considered harmful

An Interview with Fred Brooks (cached by Google version)

Česko - Země příběhů

Track Your Heart with Your Phone, Even If Your Phone’s in Your Bag

Tail Call Optimization

I just had an interesting realization about tail call optimization. Often when people talk about it, they simply describe it as an optimization that the compiler does whenever you end a function with a function call whose return value is propagated up as is. Technically this is true. Practically, people use examples like this:

int foo(int increment)
{
	if (increment)
		return bar() + 1; /* NOT a tail-call */
	
	return bar(); /* a tail-call */
}

It sounds like a very solid example of a tail-call vs. not. I.e., if you “post process” the value before returning it is not a tail-call.

Going back to my realization, I think people often forget about one type of “post processing” — casts. Consider the following code:

extern short bar();

int foo()
{
        return bar();
}

Did you spot it? This is not a tail-call.

The integer promotion from short to int is done after bar returns but before foo returns.

For fun, here’s the disassembly:

$ gcc -Wall -O2 -c test.c
$ dis test.o
...
foo()
    foo:     55                 pushl  %ebp
    foo+0x1: 89 e5              movl   %esp,%ebp
    foo+0x3: 83 ec 08           subl   $0x8,%esp
    foo+0x6: e8 fc ff ff ff     call   -0x4	<foo+0x7>
    foo+0xb: c9                 leave  
    foo+0xc: 98                 cwtl   
    foo+0xd: c3                 ret    

For completeness, if we change the return value of bar to int:

$ gcc -Wall -O2 -c test.c
$ dis test.o
...
foo()
    foo:       e9 fc ff ff ff     jmp    -0x4	<foo+0x1>

I wonder how many people think they are tail-call optimizing, but in reality they are “wasting” a stack frame because of this silly reason.

Practical and Portable x86 Recompilation

About a month ago, I stumbled across this fascinating blog post. I finally got around to sharing it here on my blahg.

Practical and Portable x86 Recompilation

Inline Assembly & clang

Recently I talked about inline assembly with GCC and clang where I pointed out that LLVM seems to produce rather silly machine code. In a comment, a reader asked if this was LLVM’s IR doing this or if it was the machine code generator being silly. I was going to reply there, but the reply got long enough to deserve its own post.

I’ve dealt with LLVM’s IR for a couple of months during the fall of 2010. It was both interesting and quite painful.

The IR is at the Wikipedia article: single static assignment level. It assumes that stack space is cheap and infinite. Since it is a SSA form, it has no notion of registers. The optimization passes transform the IR quite a bit and at the end there is very little (if any!) useless code. In other words, I think it is the machine code generation that is responsible for the unnecessary stack frame push and pop. With that said, it is time to experiment.

Using the same test program as before, of course:

#define _KERNEL
#define _ASM_INLINES
#include <sys/atomic.h>

void test(uint32_t *x)
{
	atomic_inc_32(x);
}

Emitting LLVM IR

Let’s compile it with clang passing in the -emit-llvm option to have it generate test.ll file with the LLVM IR:

$ clang -S -emit-llvm -Wall -O2 -m64 test.c

There is a fair amount of “stuff” in the file, but the relevant portions are (line-wrapped by me):

; Function Attrs: nounwind
define void @test(i32* %x) #0 {
entry:
  tail call void asm sideeffect "lock; incl $0",
    "=*m,*m,~{dirflag},~{fpsr},~{flags}"(i32* %x, i32* %x) #1, !srcloc !1
  ret void
}

attributes #0 = { nounwind uwtable "less-precise-fpmad"="false"
  "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"
  "no-infs-fp-math"="false" "no-nans-fp-math"="false"
  "stack-protector-buffer-size"="8" "unsafe-fp-math"="false"
  "use-soft-float"="false" }

LLVM’s IR happens to be very short and to the point. The function prologue and epilogue are not expressed as part of IR blob that gets passed to the machine code generator. Note the function attribute no-frame-pointer-elim being true (meaning frame pointer elimination will not happen).

Now, let’s add in the -fomit-frame-pointer option.

$ clang -S -emit-llvm -Wall -O2 -m64 -fomit-frame-pointer test.c

Now, the relevant IR pieces are:

; Function Attrs: nounwind
define void @test(i32* %x) #0 {
entry:
  tail call void asm sideeffect "lock; incl $0",
    "=*m,*m,~{dirflag},~{fpsr},~{flags}"(i32* %x, i32* %x) #1, !srcloc !1
  ret void
}

attributes #0 = { nounwind uwtable "less-precise-fpmad"="false"
  "no-frame-pointer-elim"="false" "no-infs-fp-math"="false"
  "no-nans-fp-math"="false" "stack-protector-buffer-size"="8"
  "unsafe-fp-math"="false" "use-soft-float"="false" }

The no-frame-pointer-elim attribute changed (from true to false), but the IR of the function itself did not change. (The no-frame-pointer-elim-non-leaf attribute disappeared as well, but it really makes sense since -fomit-frame-pointer is a rather large hammer that just forces frame pointer elimination everywhere and so it doesn’t make sense to differentiate between leaf and non-leaf functions.)

So, to answer Steve’s question, the LLVM IR does not include the function prologue and epilogue. This actually makes a lot of sense given that the IR is architecture independent and the exact details of what the prologue has to do are define by the ABIs.

IR to Assembly

We can of course use llc to convert the IR into real 64-bit x86 assembly code.

$ llc --march=x86-64 test.ll
$ gas -o test.o --64 test.s

Here is the disassembly for clang invocation without -fomit-frame-pointer:

test()
    test:     55                 pushq  %rbp
    test+0x1: 48 89 e5           movq   %rsp,%rbp
    test+0x4: f0 ff 07           lock incl (%rdi)
    test+0x7: 5d                 popq   %rbp
    test+0x8: c3                 ret    

And here is the disassembly for clang invocation with -fomit-frame-pointer:

test()
    test:     f0 ff 07           lock incl (%rdi)
    test+0x3: c3                 ret    

Conclusion

So, it turns out that my previous post simply stumbled across the fact that GCC and clang have different set of optimizations for -O2. GCC includes -fomit-frame-pointer by default, while clang does not.

Inline Assembly & GCC, clang

Recently, I got to write a bit of inline assembly. In the process I got to test my changes by making a small C file which defined test function that called the inline function from the header. Then, I could look at the disassembly to verify all was well.

#define _KERNEL
#define _ASM_INLINES
#include <sys/atomic.h>

void test(uint32_t *x)
{
	atomic_inc_32(x);
}

GCC has been my go to complier for a long time now. So, at first I was using it to debug my inline assembly. I compiled the test programs using:

$ gcc -Wall -O2 -m64 -c test.c

Disassembling the object file yields the rather obvious:

test()
    test:     f0 ff 07           lock incl (%rdi)
    test+0x3: c3                 ret    

I can’t think of any way to make it better :)

Then, at some point I remembered that Clang/LLVM are pretty good as well. I compiled the same file with clang:

$ clang -Wall -O2 -m64 -c test.c

The result was rather disappointing:

test()
    test:     55                 pushq  %rbp
    test+0x1: 48 89 e5           movq   %rsp,%rbp
    test+0x4: f0 ff 07           lock incl (%rdi)
    test+0x7: 5d                 popq   %rbp
    test+0x8: c3                 ret    

For whatever reason, Clang feels the need to push/pop the frame pointer. I did a little bit of searching, and I couldn’t find a way to disable this behavior.

The story for 32-bit output is very similar (just drop the -m64 from the compiler invocation). GCC produced the superior output:

test()
    test:     8b 44 24 04        movl   0x4(%esp),%eax
    test+0x4: f0 ff 00           lock incl (%eax)
    test+0x7: c3                 ret    

While Clang still wanted to muck around with the frame pointer.

test()
    test:     55                 pushl  %ebp
    test+0x1: 89 e5              movl   %esp,%ebp
    test+0x3: 8b 45 08           movl   0x8(%ebp),%eax
    test+0x6: f0 ff 00           lock incl (%eax)
    test+0x9: 5d                 popl   %ebp
    test+0xa: c3                 ret    

For the curious ones, I’m using GCC 4.8.3 and Clang 3.4.2.

I realize this is a bit of a special case (how often to you make a function that simply calls an inline function?), but it makes me worried about what sort of sub-optimal code Clang produces in other cases.

Segment Drivers

Lately, I started poking around the Illumos memory management code. As I’ve done in the past, I decided to use this blahg as a place to document some of my discoveries.

Memory Layout

In Illumos (and Solaris), address spaces are managed as sets of segments. Each segment has a base address, length, and a number of other properties. This is true for both process memory as well as kernel memory. Do not confuse these segments with Wikipedia article: memory segmentation that processors like Wikipedia article: x86 provide.

Each process has its own struct as:

> ::pgrep vim
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R  10852  10777  10850  10777    101 0x4a004000 ffffff0411e1c0a0 vim
> ffffff0411e1c0a0::print proc_t p_as | ::print struct as a_segtree
a_segtree = {
    a_segtree.avl_root = 0xffffff03f7c62ea8
    a_segtree.avl_compar = as_segcompar
    a_segtree.avl_offset = 0x20
    a_segtree.avl_numnodes = 0x18
    a_segtree.avl_size = 0x60
}

The kernel address space is maintained in the kas global:

> kas::print a_segtree
a_segtree = {
    a_segtree.avl_root = kvseg+0x20
    a_segtree.avl_compar = as_segcompar
    a_segtree.avl_offset = 0x20
    a_segtree.avl_numnodes = 0x9
    a_segtree.avl_size = 0x60
}

(Once upon a time this set of segments was a linked list, but for a long while now it has been an AVL tree indexed by the base address.)

Regardless of which address space we’re dealing with, the same rules apply: segments represent contiguous regions within the address space. Each segment can represent a different type of memory. For example, walking the kernel address space segment tree yields nine different segments of four different types (kpm, kmem, kp, and map):

> kas::print a_segtree | ::walk avl | ::printf "%p.%016x %a\n" "struct seg" s_base s_size s_ops
fffffe0000000000.000000031e000000 segkpm_ops
ffffff0000000000.0000000017000000 segkmem_ops
ffffff0017000000.0000000080000000 segkp_ops
ffffff0097000000.00000002fca00000 segkmem_ops
ffffff03d3a00000.0000000004000000 segmap_ops
ffffff03d7a00000.000000fbe8600000 segkmem_ops
ffffffffc0000000.000000003b7fb000 segkmem_ops
fffffffffb800000.0000000000550000 segkmem_ops
ffffffffff800000.0000000000400000 segkmem_ops

Segment Drivers

Illumos comes with seven different architecture- and platform-independent segment drivers. A segment driver is a “driver” that implements a couple of functions to manage a segment of memory. That is, each segment type can handle page faults, page locking, sync operations, etc. differently.

For example, suppose that a page fault occurs because a process tried to load a value from a page that lacks a page table entry. The platform specific (assembly) fault handling code gets invoked by the processor. After doing a little bit of work, it calls into the generic (C) fault handling code, as_fault. There, the segtree AVL tree is consulted and the corresponding segment’s fault operation gets invoked.

(Solaris Internals lists 12 and 11 segment drivers, respectively, in the two editions.) In Illumos, the seven common segment drivers are:

seg_dev
Most of the time, userspace processes do not need to map devices into their address space. In the rare case when a process does want a device mapped (e.g., Xorg), the dev segment driver maintains that mapping.
seg_kmem
This segment driver maps the kernel heap, module text, and all early boot memory. (code)
seg_kp
In general, kernel memory is not pageable. In the rare case that something can be in kernel pageable memory, this segment is what maintains the anonymous page mappings.
seg_kpm
If possible (you’re on a 64-bit system), the kpm segment driver maps all physical memory into the kernel’s address space. This allows the kernel to not have to set up temporary mappings to operate on physical memory. (code)
seg_map
The map segment driver is a kernel-only higher performance version of the vn segment driver. (See below.)
seg_spt
This segment driver is responsible for maintaining SysV shared memory segments. (Not to be confused with POSIX shared memory.)
seg_vn
Memory mapped files are handled by the vn segment driver. This includes both regular files as well as anonymous memory.

There are also two platform specific segment drivers:

seg_mf (i86xpv only)
This segment driver is only used by dom0 processes (read: Xen) to map pages from other domains.
seg_nf (sparc v9 only)
The header for the file says that it is for non-faulting loads. I don’t actually know what exactly it is for. (And I don’t care enough to dig deeper given that it is Sparc specific.)

The Reality

This is a lot of different segment drivers. Are all of them used all the time? Well, sort of. The mdb output earlier shows that the (amd64) kernel uses only four different segment drivers (kpm, kmem, kp, and map). A typical userspace process is very boring — it is only made up of vn segments. There are, however, exceptions. For instance, Xorg uses vn and dev. This accounts for six of the seven drivers. The last common segment driver is spt, which provides System V shared memory. (I talked about SysV shared memory previously.) So, on a 64-bit x86 system, all seven common segment drivers are in use.

The story is a bit different on 32-bit kernels. Since a 32-bit system has much smaller address space, the kernel tries to eliminate a number of mappings. Here is the list of segments in a 32-bit kernel:

> kas::print a_segtree | ::walk avl | ::printf "%p %a\n" "struct seg" s_base s_ops
b5802000 segmap_ops
b6800000 segkmem_ops
ef400000 segkmem_ops
fe800000 segkmem_ops
ff000000 segkmem_ops

As you can see, the kp and kpm segments went away. While at first this is surprising, it actually makes perfect sense. When thinking about memory there are two “types” to consider: physical and virtual. In theory, one can have more virtual than physical thanks to the MMU but in reality this is only true on 64-bit systems. The physical memory sizes have outgrown 4 GB a number of years ago and therefore a 32-bit address space can trivially be 100% backed by physical memory. In other words, 32-bit address spaces are tight on virtual memory, while 64-bit address spaces are “tight” on physical memory.

Let’s consider the disappearance of the kp segment on 32-bits. What does kp let us do? It lets us oversubscribe physical memory by backing some virtual memory with disk space. On 32-bit systems we have enough physical memory to back all the virtual memory in the kernel so we don’t need to back some of it by disk. So we have no use for it. (Yes, the kernel still could have paged parts of itself out, but kernel text and data is generally considered important enough to keep it in non-pageable memory. The memory utilization will more than pay for itself by the performance improvement of not having the kernel paged out.)

As I stated before, kpm segments map physical memory into the kernel’s address space for performance reasons (without it the kernel would have to temporarily map a page to access the contents). Therefore, they are good candidates for removal when it comes to slimming down the kernel’s address space demands. (Well, the actual story is the other way… the introduction of 64-bit capable hardware allowed kpm segments to exist to improve kernel performance.)

GCC Optimizations

Recently, I’ve been given a hang bug to work on. This lead me to a another bug related to timing which pushed me to clean up a time related #define which uncovered at least two bugs. Got all that? Good. The rest of this post is going to talk about the changed define, and one of the “at least two bugs”. When I talk about GCC, I’m talking about the Illumos-specific GCC version 4.4.4. (Illumos needs a couple of features that stock GCC doesn’t provide.)

The #define change I’m hoping to make is very simple:

diff --git a/usr/src/uts/common/sys/time.h b/usr/src/uts/common/sys/time.h
--- a/usr/src/uts/common/sys/time.h
+++ b/usr/src/uts/common/sys/time.h
@@ -234,7 +234,7 @@ struct itimerval32 {
 #define        SEC             1
 #define        MILLISEC        1000
 #define        MICROSEC        1000000
-#define        NANOSEC         1000000000
+#define        NANOSEC         1000000000ll
 
 #define        MSEC2NSEC(m)    ((hrtime_t)(m) * (NANOSEC / MILLISEC))
 #define        NSEC2MSEC(n)    ((n) / (NANOSEC / MILLISEC))

Without it, multiplying by NANOSEC will cause integer overflow issues on IPL32 and LP64 systems (read: basically everywhere).

One of the “at least two bugs“ involves a simple (buggy) function aptly named tv2ns as it converts a struct timeval to a 64-bit nanosecond count:

static int64_t
tv2ns(struct timeval *tvp)
{
	return (tvp->tv_sec * NANOSEC + tvp->tv_usec * 1000);
}

At first glance, this function looks correct. The only flaw with it is that first portion of the expression multiplies a time_t (32-bit signed int) with an int (also 32-bit signed) making the result of that subexpression 32-bit signed expression. With NANOSEC changed to a long long, everything works as expected. Now, the fun part… disassembling this function without the fix. You don’t have to be an expert to see that this function is strangely repetitive. I’ve annotated the assembly.

tv2ns:          movl   0x4(%esp),%eax     ; eax = tvp
tv2ns+4:        movl   0x4(%eax),%edx     ; edx = tvp->tv_usec
tv2ns+7:        leal   (%edx,%edx,4),%edx ; edx = edx + 4 * edx
tv2ns+0xa:      leal   (%edx,%edx,4),%edx ;     = 5 * edx
tv2ns+0xd:      leal   (%edx,%edx,4),%edx
;
; at this point:  edx = 5 * 5 * 5 * tvp->tv_usec,
; which is the same as: 125 * tvp->tv_usec
;
tv2ns+0x10:     movl   (%eax),%eax        ; eax = tvp->tv_sec
tv2ns+0x12:     leal   (%eax,%eax,4),%eax ; eax = eax + 4 * eax
tv2ns+0x15:     leal   (%eax,%eax,4),%eax ;     = 5 * eax
tv2ns+0x18:     leal   (%eax,%eax,4),%eax
tv2ns+0x1b:     leal   (%eax,%eax,4),%eax
tv2ns+0x1e:     leal   (%eax,%eax,4),%eax
tv2ns+0x21:     leal   (%eax,%eax,4),%eax
tv2ns+0x24:     leal   (%eax,%eax,4),%eax
tv2ns+0x27:     leal   (%eax,%eax,4),%eax
tv2ns+0x2a:     leal   (%eax,%eax,4),%eax
;
; at this point,  eax = 5 * 5 * 5 * 5 * 5 * 5 * 5 * 5 * 5 * tvp->tv_sec,
; which is the same as: 1953125 * tvp->tv_sec
;
tv2ns+0x2d:     shll   $0x9,%eax          ; eax <<= 9
;
; eax = (1953125 * tvp->tv_sec) << 9,
; which suprprisingly ends up being the same as: 1000000000 * tvp->tv_sec
;
; so, now we have 'eax' with the tv_sec converted to nanoseconds and 'edx'
; with 125 * tv_usec
;
tv2ns+0x30:     leal   (%eax,%edx,8),%eax ; eax = eax + 8 * edx
;
; 8 * 125 = 1000, which is the factor to convert tv_usec to nanoseconds!
;
tv2ns+0x33:     cltd                      ; sign-extend eax to edx:eax
tv2ns+0x34:     ret    

I found it interesting that GCC decided to emit leal instructions to multiply by 5 and then finish it off with a shift and another leal. This is another one of those times when I realize that the compiler is smarter than me. (The sign-extension of course happens too late — all the math needs to happen as 64-bit arithmetic, but that’s not GCC’s fault.)

For the record, with the #define changed, the function looks like the following — sorry, no comments on this one:

tv2ns:          pushl  %edi
tv2ns+1:        pushl  %esi
tv2ns+2:        pushl  %ebx
tv2ns+3:        subl   $0x8,%esp
tv2ns+6:        movl   0x18(%esp),%ecx
tv2ns+0xa:      movl   0x4(%ecx),%eax
tv2ns+0xd:      leal   (%eax,%eax,4),%eax
tv2ns+0x10:     leal   (%eax,%eax,4),%eax
tv2ns+0x13:     leal   (%eax,%eax,4),%ebx
tv2ns+0x16:     shll   $0x3,%ebx
tv2ns+0x19:     movl   %ebx,%esi
tv2ns+0x1b:     sarl   $0x1f,%esi
tv2ns+0x1e:     movl   $0x3b9aca00,%edi
tv2ns+0x23:     movl   (%ecx),%eax
tv2ns+0x25:     imull  %edi
tv2ns+0x27:     movl   %eax,(%esp)
tv2ns+0x2a:     movl   %edx,0x4(%esp)
tv2ns+0x2e:     addl   %ebx,%eax
tv2ns+0x30:     adcl   %esi,%edx
tv2ns+0x32:     addl   $0x8,%esp
tv2ns+0x35:     popl   %ebx
tv2ns+0x36:     popl   %esi
tv2ns+0x37:     popl   %edi
tv2ns+0x38:     ret    

Maybe one day I’ll rummage through my brain and dig up other times that GCC is outsmarted me and blahg about them. :)

x2APIC, IOMMU, Illumos

About a week ago, I hinted at a boot hang I was debugging. I’ve made some progress with it, and along the way I found some interesting things about which I’ll blog over the next few days. Today, I’m going to talk about the Wikipedia article: APIC, xAPIC, and Wikipedia article: x2APIC and how they’re handled in Illumos.

APIC, xAPIC, x2APIC

I strongly suggest you become at least a little familiar with APIC architecture before reading on. The Wikipedia articles above are a good start.

First things first, we need some definitions. APIC can refer to either the architecture or to very old (pre-Pentium 4) implementation. Since I’m working with a Sandy Bridge, I’m going to use APIC to refer to the architecture and completely ignore that these chips existed. Everything they do is a subset of xAPIC. xAPIC is an extension to APIC. xAPIC chips started showed up in NetBurst architecture Intel CPUs (i.e., Pentium 4). xAPIC included some goodies such as upping the limit on the number of CPUs to 256 (from 16). x2APIC is an extension to xAPIC. x2APIC chips started appearing around the same time Sandy Bridge systems started showing up. It is a major update to how interrupts are handled, but as with many things in the PC industry the x2APIC is fully backwards compatible with xAPICs. x2APIC includes some goodies such as upping the limit on the number of CPUs to 232.

Regardless of which exact flavor you happen to use, you will find two components: the local APIC and I/O APIC. Each processor gets their own local APIC and I/O buses get I/O APICs. I/O APICs can service more than one device, and in fact many systems have only one I/O APIC.

The xAPIC uses Wikipedia article: MMIO to program the local and I/O APICs.

x2APIC has two mode of operation. First, there is the xAPIC compatibility mode which makes the x2APIC behave just like an xAPIC. This mode doesn’t give you all the new bells and whistles. Second, there is the new x2APIC mode. In this mode, the APIC is programmed using Wikipedia article: MSRs.

One interesting fact about x2APIC is that it requires an Wikipedia article: iommu. My Sandy Bridge laptop has an Intel iommu as part of the VT-d feature.

Illumos /etc/mach

x2APIC in Illumos has two APIC drivers. First, there is pcplusmp which knows how to handle APIC and xAPIC. Second, there is apix which targets x2APIC, but knows how to operate it in both modes. On boot, the kernel consults /etc/mach to get a list of machine specific modules to try to load. Currently, the default contents (trimmed for display here) are:

#
# CAUTION!  The order of modules specified here is very important. If the
# order is not correct it can result in unexpected system behavior. The
# loading of modules is in the reverse order specified here (i.e. the last
# entry is loaded first and the first entry loaded last).
#
pcplusmp
apix
xpv_psm

Since I’m not running Xen, xpv_psm will fail to load, and apix gets its chance to load.

pcplusmp + apix Code Sharing

The code in these two modules can be summarized with a word: mess. Following what happens when would be enough of an adventure. The code for the two modules lives in four directories: usr/src/uts/i86pc/io, usr/src/uts/i86pc/io/psm, usr/src/uts/i86pc/io/pcplusmp, and usr/src/uts/i86pc/io/apix. But the sharing isn’t as straight forward as one would hope.

Directory pcplusmp apix
i86pc/io mp_platform_common.c, mp_platform_misc.c, hpet_acpi.c mp_platform_common.c, hpet_acpi.c
i86pc/io/psm psm_common.c psm_common.c
i86pc/io/pcplusmp * apic_regops.c, apic_common.c, apic_timer.c
i86pc/io/apix *

This is of course not clear at all when you look at the code. (Reality is a bit messier because of the i86xpv platform which uses some of the i86pc source.)

apix_probe

When the apix module gets loaded, its probe function (apix_probe) is called. This is the place where the module decides if the hardware is worthy. Specifically, if it finds that the CPU reports x2APIC support via Wikipedia article: cpuid, it goes on to call the common APIC probe code (apic_probe_common). Unless that fails, the system will use the apix module — even if there is no iommu and therefore the x2APIC needs to operate in xAPIC mode.

What mode are you using? Easy, just check the apic_mode global in the kernel:

# echo apic_mode::whatis | mdb -k
fffffffffbd0ee4c is apic_mode, in apix's data segment
# echo apic_mode::print | mdb -k
0x2

2 (LOCAL_APIC) indicates xAPIC mode, while 3 (LOCAL_X2APIC) indicates x2APIC mode.

Because this part is as clear as mud, I made a table that tells you what module and mode to expect given your hardware, what CPUID says, and the presence and state of the iommu.

APIC hw CPUID IOMMU IOMMU state Module apic_mode
xAPIC off pcplusmp LOCAL_APIC
x2APIC off pcplusmp LOCAL_APIC
x2APIC on absent apix LOCAL_APIC
x2APIC on present off apix LOCAL_APIC
x2APIC on present on apix LOCAL_X2APIC

Defaults

I’ve never seen apic_mode equal to LOCAL_X2APIC in the wild. This was very puzzling. Yesterday, I discovered why. As I mentioned earlier, in order for the x2APIC to operate in x2APIC mode an iommu is required. Long story short, the default config that Illumos ships disables iommus on boot. Specifically:

$ cat /platform/i86pc/kernel/drv/rootnex.conf | grep -v '^\(#.*\|\)$'
immu-enable="false";

In order to get LOCAL_X2APIC mode, you need to set:

immu-enable="true";
immu-intrmap-enable="true";

Once you put those into the config file, update you boot archive and reboot. You should be set… except the iommu support in Illumos is… shall we say… poor.

(I should point out that it is possible for the BIOS to enable x2APIC mode before handing control off to the OS. This is pretty rare unless you have a really big x86 system.)

1394

It would seem that the hci1394 driver doesn’t quite know how to deal with an iommu “messing” with its I/Os and its interrupt service routine shuts down the driver. (On a debug build it throws is ASSERT(0) for good measure.) I just disabled 1394 in the BIOS since I don’t have any Firewire devices handy and therefore no use for the port at the moment.

immu-enable Details

In case you want to know how iommu initialization affects the apix initialization…

During boot, immu_init gets called to initialize iommus. If the config option (immu-enable) is not true, the function just returns instead of calling immu_subsystems_setup which calls immu_intrmap_setup which sets psm_vt_ops to non-NULL value.

Later on, when apix is loaded and is initializing itself in apix_picinit, it calls apic_intrmap_init. This function does nothing if psm_vt_ops are NULL.

The Hang

I might as well tell you a bit about my progress on tracking down the hang. It happens only if I’m using the apix module and I allow deep C states in the idle thread (technically, it could also be an mwait related issue since I cannot disable just mwait without disabling deep C states). It does not matter if the apic_mode is LOCAL_APIC or LOCAL_X2APIC.

Assorted Documentation

  1. Intel 64 Architecture x2APIC Specification
  2. Intel MP Spec 1.4

CPU Pause Threads

Recently, I ended up debugging a boot hang. (I’m still working on it, so I don’t have a resolution to it yet.) The hang seems to occur during the mp startup. That is, when the boot CPU tries to online all the other CPUs on the system. As a result, I spent a fair amount of time reading the code and poking around with mdb. Given the effort I put in, I decided to document my understanding of how CPUs get brought online during boot in Illumos. In this post, I’ll talk about the CPU pause threads.

Each CPU has a special thread — the pause thread. It is a very high priority thread that’s supposed to preempt everything on the CPU. If all CPUs are executing this high-priority thread, then we know for fact that nothing can possibly be dereferencing the CPU structures’ (cpu_t) pointers. Why is this useful? Here’s a comment from right above cpu_pause — the function pause threads execute:

/*
 * This routine is called to place the CPUs in a safe place so that
 * one of them can be taken off line or placed on line.  What we are
 * trying to do here is prevent a thread from traversing the list
 * of active CPUs while we are changing it or from getting placed on
 * the run queue of a CPU that has just gone off line.  We do this by
 * creating a thread with the highest possible prio for each CPU and
 * having it call this routine.  The advantage of this method is that
 * we can eliminate all checks for CPU_ACTIVE in the disp routines.
 * This makes disp faster at the expense of making p_online() slower
 * which is a good trade off.
 */

The pause thread is pointed to by the CPU structure’s cpu_pause_thread member. A new CPU does not have a pause thread until after it has been added to the list of existing CPUs. (cpu_pause_alloc does the actual allocation.)

CPU pausing is pretty strange. First of all, let’s call the CPU requesting other CPUs to pause the controlling CPU and all online CPUs that will pause the pausing CPUs. (The controlling CPU does not pause itself.) Second, there are two global structures: (1) a global array called safe_list which contains a 8-bit integer for each possible CPU where each element holds a value ranging from 0 to 4 (PAUSE_*) denoting the state of that CPU’s pause thread, and (2) cpu_pause_info which contains some additional goodies used for synchronization.

Pausing

To pause CPUs, the controlling CPU calls pause_cpus (which uses cpu_pause_start), where it iterates over all the pausing CPUs setting their safe_list entries to PAUSE_IDLE and queueing up (using setbackdq) their pause threads.

Now, just because the pause threads got queued doesn’t mean that they’ll get to execute immediately. That is why the controlling CPU then waits for each of the pause threads to up a semaphore in the cpu_pause_info structure. Once all the pause threads have upped the semaphore, the controlling CPU sets the cp_go flag to let the pause threads know that it’s time for them to go to sleep. Then the controlling CPU waits for each pause thread to signal (via the safe_list) that they have disabled just about all interrupts and that they are spinning (mach_cpu_pause). At this point, pause_cpus knows that all online CPUs are in a safe place.

Starting

Starting the CPUs back up is pretty easy. The controlling CPU just needs to set all the CPU’s safe_list to a PAUSE_IDLE. That will cause the pausing CPUs to break out of their spin-loop. Once out of the spin loop, interrupts are re-enabled and a CPU control relinquished (via swtch). The controlling CPU does some cleanup of its own, but that’s all that is to it.

Synchronization

Why not use a mutex or semaphore for everything? The problem lies in the fact that we are in a really fragile state. We don’t want to lose the CPU because we blocked on a semaphore. That’s why this code uses a custom synchronization primitives.

Powered by blahgd