Josef “Jeff” Sipek

Creative xor Use

Last month at work I got to try to optimize a function that takes a number and rounds it up to the next power of 2. The previous implementation used a simple loop. I didn’t dive into obscure bit twiddling, but rather used a helper function that is already in the codebase. Yes, I let the compiler do the heavy lifting of turning easy to understand code into good machine code. The x86 binary that gcc 6.3 produced has an interesting idiom, and that’s why I’m writing this entry.

The new code:

static inline unsigned int bits_required32(uint32_t num)
        return num == 0 ? 0 : 32 - __builtin_clz(num);

/* Returns x, such that x is the smallest power of 2 >= num. */
uint32_t nearest_power(uint32_t num)
	if (num == 0)
		return 1;

        return 1U << bits_required32(num - 1);

This is a slightly simplified version of the code, but it demonstrates the optimization quite well.

The nearest_power function disassembles as:

    nearest_power:      8b 54 24 04        movl   0x4(%esp),%edx
    nearest_power+0x4:  b8 01 00 00 00     movl   $0x1,%eax
    nearest_power+0x9:  85 d2              testl  %edx,%edx
    nearest_power+0xb:  74 14              je     +0x14	<nearest_power+0x21>
    nearest_power+0xd:  83 ea 01           subl   $0x1,%edx
    nearest_power+0x10: 74 0f              je     +0xf	<nearest_power+0x21>
    nearest_power+0x12: 0f bd d2           bsrl   %edx,%edx
    nearest_power+0x15: b9 20 00 00 00     movl   $0x20,%ecx
    nearest_power+0x1a: 83 f2 1f           xorl   $0x1f,%edx
    nearest_power+0x1d: 29 d1              subl   %edx,%ecx
    nearest_power+0x1f: d3 e0              shll   %cl,%eax
    nearest_power+0x21: c3                 ret    

The first 6 instructions contain the prologue and deal with num being zero or one—both cases produce the result 1. The remaining 6 instructions make up the epilogue and are where the calculation happens. I’m going to ignore the first half of the function, since the second half is where the interesting things happen.

First, we get the number of leading zeros in num - 1 and stash the value 32 in a register:

    nearest_power+0x12: 0f bd d2           bsrl   %edx,%edx
    nearest_power+0x15: b9 20 00 00 00     movl   $0x20,%ecx

The number of leading zeros (%edx) is in the range 0–31.

Here is the really interesting bit:

    nearest_power+0x1a: 83 f2 1f           xorl   $0x1f,%edx

This xors the number of leading zeros (i.e., 0–31) with 31. To decipher what this does, I find it easier to consider the top 27 bits and the bottom 5 bits separately.

operand binary
0x1f 00000000 00000000 00000000 000 11111
edx 00000000 00000000 00000000 000 xxxxx

The xor of the top bits produces 0 since both the constant 31 and the register containing any of the numbers 0–31 have zeros there.

The xor of the bottom bits negates them since the constant has ones there.

When combined, the xor has the same effect as this C expression:

out = (~in) & 0x1f;

This seems very weird and useless, but it is far from it. It turns out that for inputs 0–31 the above expression is the same as:

out = 31 - in;

I think it is really cool that gcc produced this xor instead of a less optimal multi-instruction version.

The remainder of the disassembly just subtracts and shifts to produce the return value.

Why xor?

I think the reason gcc (and clang for that matter) produce this sort of xor instruction instead of a subtraction is very simple: on x86 the sub instruction’s left hand side and the destination must be the same register. That is, on x86 the sub instruction works as:

x -= y;

Since the destination must be a register, it isn’t possible to express out = 31 - in using just one sub.

Anyway, that’s it for today. I hope you enjoyed this as much as I did.

GNU inline vs. C99 inline

Recently, I’ve been looking at inline functions in C. However instead of just the usual static inlines, I’ve been looking at all the variants. This used to be a pretty straightforward GNU C extension and then C99 introduced the inline keyword officially. Sadly, for whatever reason decided that the semantics would be just different enough to confuse me and everyone else.

GCC documentation has the following to say:

GCC implements three different semantics of declaring a function inline. One is available with -std=gnu89 or -fgnu89-inline or when gnu_inline attribute is present on all inline declarations, another when -std=c99, -std=c11, -std=gnu99 or -std=gnu11 (without -fgnu89-inline), and the third is used when compiling C++.

Dang! Ok, I don’t really care about C++, so there are only two ways inline can behave.

Before diving into the two different behaviors, there are two cases to consider: the use of an inline function, and the inline function itself. The good news is that the use of an inline function behaves the same in both C90 and C99. Where the behavior changes is how the compiler deals with the inline function itself.

After reading the GCC documentation and skimming the C99 standard, I have put it all into the following table. It lists the different ways of using the inline keyword and for each use whether or not a symbol is produced in C90 (with inline extension) and in C99.

Emit (C90) Emit (C99)
inline always never
static inline maybe maybe
extern inline never always

(“always” means that a global symbol is always produced regardless of if all the uses of it are inlined. “maybe” means that a local symbol will be produced if and only if some uses cannot be inlined. “never” means that no symbols are produced and any non-inlined uses will be dealt with via relocations to an external symbol.)

Note that C99 “switched” the meaning of inline and extern inline. The good news is, static inline is totally unaffected (and generally the most useful).

For whatever reason, I cannot ever remember this difference. My hope is that this post will help me in the future.

Trying it Out

We can verify this experimentally. We can compile the following C file with -std=gnu89 and -std=gnu99 and compare what symbols the compiler produces:

static inline void si(int x)

extern inline void ei(int x)

inline void i(int x)

And here’s what nm has to say about them:

00000000 T i

00000000 T ei

This is an extremely simple example where the “never” and “maybe” cases all skip generating a symbol. In a more involved program that has inline functions that use features of C that prevent inlining (e.g., VLAs) we would see either relocations to external symbols or local symbols.

Tail Call Optimization

I just had an interesting realization about tail call optimization. Often when people talk about it, they simply describe it as an optimization that the compiler does whenever you end a function with a function call whose return value is propagated up as is. Technically this is true. Practically, people use examples like this:

int foo(int increment)
	if (increment)
		return bar() + 1; /* NOT a tail-call */
	return bar(); /* a tail-call */

It sounds like a very solid example of a tail-call vs. not. I.e., if you “post process” the value before returning it is not a tail-call.

Going back to my realization, I think people often forget about one type of “post processing” — casts. Consider the following code:

extern short bar();

int foo()
        return bar();

Did you spot it? This is not a tail-call.

The integer promotion from short to int is done after bar returns but before foo returns.

For fun, here’s the disassembly:

$ gcc -Wall -O2 -c test.c
$ dis test.o
    foo:     55                 pushl  %ebp
    foo+0x1: 89 e5              movl   %esp,%ebp
    foo+0x3: 83 ec 08           subl   $0x8,%esp
    foo+0x6: e8 fc ff ff ff     call   -0x4	<foo+0x7>
    foo+0xb: c9                 leave  
    foo+0xc: 98                 cwtl   
    foo+0xd: c3                 ret    

For completeness, if we change the return value of bar to int:

$ gcc -Wall -O2 -c test.c
$ dis test.o
    foo:       e9 fc ff ff ff     jmp    -0x4	<foo+0x1>

I wonder how many people think they are tail-call optimizing, but in reality they are “wasting” a stack frame because of this silly reason.

Inline Assembly & clang

Recently I talked about inline assembly with GCC and clang where I pointed out that LLVM seems to produce rather silly machine code. In a comment, a reader asked if this was LLVM’s IR doing this or if it was the machine code generator being silly. I was going to reply there, but the reply got long enough to deserve its own post.

I’ve dealt with LLVM’s IR for a couple of months during the fall of 2010. It was both interesting and quite painful.

The IR is at the Wikipedia article: single static assignment level. It assumes that stack space is cheap and infinite. Since it is a SSA form, it has no notion of registers. The optimization passes transform the IR quite a bit and at the end there is very little (if any!) useless code. In other words, I think it is the machine code generation that is responsible for the unnecessary stack frame push and pop. With that said, it is time to experiment.

Using the same test program as before, of course:

#define _KERNEL
#define _ASM_INLINES
#include <sys/atomic.h>

void test(uint32_t *x)

Emitting LLVM IR

Let’s compile it with clang passing in the -emit-llvm option to have it generate test.ll file with the LLVM IR:

$ clang -S -emit-llvm -Wall -O2 -m64 test.c

There is a fair amount of “stuff” in the file, but the relevant portions are (line-wrapped by me):

; Function Attrs: nounwind
define void @test(i32* %x) #0 {
  tail call void asm sideeffect "lock; incl $0",
    "=*m,*m,~{dirflag},~{fpsr},~{flags}"(i32* %x, i32* %x) #1, !srcloc !1
  ret void

attributes #0 = { nounwind uwtable "less-precise-fpmad"="false"
  "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"
  "no-infs-fp-math"="false" "no-nans-fp-math"="false"
  "stack-protector-buffer-size"="8" "unsafe-fp-math"="false"
  "use-soft-float"="false" }

LLVM’s IR happens to be very short and to the point. The function prologue and epilogue are not expressed as part of IR blob that gets passed to the machine code generator. Note the function attribute no-frame-pointer-elim being true (meaning frame pointer elimination will not happen).

Now, let’s add in the -fomit-frame-pointer option.

$ clang -S -emit-llvm -Wall -O2 -m64 -fomit-frame-pointer test.c

Now, the relevant IR pieces are:

; Function Attrs: nounwind
define void @test(i32* %x) #0 {
  tail call void asm sideeffect "lock; incl $0",
    "=*m,*m,~{dirflag},~{fpsr},~{flags}"(i32* %x, i32* %x) #1, !srcloc !1
  ret void

attributes #0 = { nounwind uwtable "less-precise-fpmad"="false"
  "no-frame-pointer-elim"="false" "no-infs-fp-math"="false"
  "no-nans-fp-math"="false" "stack-protector-buffer-size"="8"
  "unsafe-fp-math"="false" "use-soft-float"="false" }

The no-frame-pointer-elim attribute changed (from true to false), but the IR of the function itself did not change. (The no-frame-pointer-elim-non-leaf attribute disappeared as well, but it really makes sense since -fomit-frame-pointer is a rather large hammer that just forces frame pointer elimination everywhere and so it doesn’t make sense to differentiate between leaf and non-leaf functions.)

So, to answer Steve’s question, the LLVM IR does not include the function prologue and epilogue. This actually makes a lot of sense given that the IR is architecture independent and the exact details of what the prologue has to do are define by the ABIs.

IR to Assembly

We can of course use llc to convert the IR into real 64-bit x86 assembly code.

$ llc --march=x86-64 test.ll
$ gas -o test.o --64 test.s

Here is the disassembly for clang invocation without -fomit-frame-pointer:

    test:     55                 pushq  %rbp
    test+0x1: 48 89 e5           movq   %rsp,%rbp
    test+0x4: f0 ff 07           lock incl (%rdi)
    test+0x7: 5d                 popq   %rbp
    test+0x8: c3                 ret    

And here is the disassembly for clang invocation with -fomit-frame-pointer:

    test:     f0 ff 07           lock incl (%rdi)
    test+0x3: c3                 ret    


So, it turns out that my previous post simply stumbled across the fact that GCC and clang have different set of optimizations for -O2. GCC includes -fomit-frame-pointer by default, while clang does not.

Powered by blahgd