Josef “Jeff” Sipek

bool bitfield:1

This is the first of hopefully many posts related to interesting pieces of code I’ve stumbled across in the dovecot repository.

Back in 1999, C99 added the bool type. This is old news. The thing I’ve never seen before is what amounts to:

struct foo {
	bool	a:1;
	bool	b:1;
};

Sure, I’ve seen bitfields before—just never with booleans. Since this is C, the obvious thing happens here. The compiler packs the two bool bits into a single byte. In other words, sizeof(struct foo) is 1 (instead of 2 had we not used bitfields).

The compiler emits pretty compact code as well. For example, suppose we have this simple function:

void set(struct foo *x)
{
	x->b = true;
}

We compile it and disassemble:

$ gcc -c -O2 -Wall -m64 test.c
$ dis -F set test.o
disassembly for test.o

set()
    set:     80 0f 02           orb    $0x2,(%rdi)
    set+0x3: c3                 ret

Had we used non-bitfield booleans, the resulting code would be:

set()
    set:     c6 47 01 01        movb   $0x1,0x1(%rdi)
    set+0x4: c3                 ret

There’s not much of a difference in these simple examples, but in more complicated structures with many boolean flags the structure size difference may be significant.

Of course, the usual caveats about bitfields apply (e.g., the machine’s endian matters).

GNU inline vs. C99 inline

Recently, I’ve been looking at inline functions in C. However instead of just the usual static inlines, I’ve been looking at all the variants. This used to be a pretty straightforward GNU C extension and then C99 introduced the inline keyword officially. Sadly, for whatever reason decided that the semantics would be just different enough to confuse me and everyone else.

GCC documentation has the following to say:

GCC implements three different semantics of declaring a function inline. One is available with -std=gnu89 or -fgnu89-inline or when gnu_inline attribute is present on all inline declarations, another when -std=c99, -std=c11, -std=gnu99 or -std=gnu11 (without -fgnu89-inline), and the third is used when compiling C++.

Dang! Ok, I don’t really care about C++, so there are only two ways inline can behave.

Before diving into the two different behaviors, there are two cases to consider: the use of an inline function, and the inline function itself. The good news is that the use of an inline function behaves the same in both C90 and C99. Where the behavior changes is how the compiler deals with the inline function itself.

After reading the GCC documentation and skimming the C99 standard, I have put it all into the following table. It lists the different ways of using the inline keyword and for each use whether or not a symbol is produced in C90 (with inline extension) and in C99.

Emit (C90) Emit (C99)
inline always never
static inline maybe maybe
extern inline never always

(“always” means that a global symbol is always produced regardless of if all the uses of it are inlined. “maybe” means that a local symbol will be produced if and only if some uses cannot be inlined. “never” means that no symbols are produced and any non-inlined uses will be dealt with via relocations to an external symbol.)

Note that C99 “switched” the meaning of inline and extern inline. The good news is, static inline is totally unaffected (and generally the most useful).

For whatever reason, I cannot ever remember this difference. My hope is that this post will help me in the future.

Trying it Out

We can verify this experimentally. We can compile the following C file with -std=gnu89 and -std=gnu99 and compare what symbols the compiler produces:

static inline void si(int x)
{
}

extern inline void ei(int x)
{
}

inline void i(int x)
{
}

And here’s what nm has to say about them:

test-gcc89:
00000000 T i

test-gcc99:
00000000 T ei

This is an extremely simple example where the “never” and “maybe” cases all skip generating a symbol. In a more involved program that has inline functions that use features of C that prevent inlining (e.g., VLAs) we would see either relocations to external symbols or local symbols.

Tail Call Optimization

I just had an interesting realization about tail call optimization. Often when people talk about it, they simply describe it as an optimization that the compiler does whenever you end a function with a function call whose return value is propagated up as is. Technically this is true. Practically, people use examples like this:

int foo(int increment)
{
	if (increment)
		return bar() + 1; /* NOT a tail-call */
	
	return bar(); /* a tail-call */
}

It sounds like a very solid example of a tail-call vs. not. I.e., if you “post process” the value before returning it is not a tail-call.

Going back to my realization, I think people often forget about one type of “post processing” — casts. Consider the following code:

extern short bar();

int foo()
{
        return bar();
}

Did you spot it? This is not a tail-call.

The integer promotion from short to int is done after bar returns but before foo returns.

For fun, here’s the disassembly:

$ gcc -Wall -O2 -c test.c
$ dis test.o
...
foo()
    foo:     55                 pushl  %ebp
    foo+0x1: 89 e5              movl   %esp,%ebp
    foo+0x3: 83 ec 08           subl   $0x8,%esp
    foo+0x6: e8 fc ff ff ff     call   -0x4	<foo+0x7>
    foo+0xb: c9                 leave  
    foo+0xc: 98                 cwtl   
    foo+0xd: c3                 ret    

For completeness, if we change the return value of bar to int:

$ gcc -Wall -O2 -c test.c
$ dis test.o
...
foo()
    foo:       e9 fc ff ff ff     jmp    -0x4	<foo+0x1>

I wonder how many people think they are tail-call optimizing, but in reality they are “wasting” a stack frame because of this silly reason.

Inline Assembly & clang

Recently I talked about inline assembly with GCC and clang where I pointed out that LLVM seems to produce rather silly machine code. In a comment, a reader asked if this was LLVM’s IR doing this or if it was the machine code generator being silly. I was going to reply there, but the reply got long enough to deserve its own post.

I’ve dealt with LLVM’s IR for a couple of months during the fall of 2010. It was both interesting and quite painful.

The IR is at the Wikipedia article: single static assignment level. It assumes that stack space is cheap and infinite. Since it is a SSA form, it has no notion of registers. The optimization passes transform the IR quite a bit and at the end there is very little (if any!) useless code. In other words, I think it is the machine code generation that is responsible for the unnecessary stack frame push and pop. With that said, it is time to experiment.

Using the same test program as before, of course:

#define _KERNEL
#define _ASM_INLINES
#include <sys/atomic.h>

void test(uint32_t *x)
{
	atomic_inc_32(x);
}

Emitting LLVM IR

Let’s compile it with clang passing in the -emit-llvm option to have it generate test.ll file with the LLVM IR:

$ clang -S -emit-llvm -Wall -O2 -m64 test.c

There is a fair amount of “stuff” in the file, but the relevant portions are (line-wrapped by me):

; Function Attrs: nounwind
define void @test(i32* %x) #0 {
entry:
  tail call void asm sideeffect "lock; incl $0",
    "=*m,*m,~{dirflag},~{fpsr},~{flags}"(i32* %x, i32* %x) #1, !srcloc !1
  ret void
}

attributes #0 = { nounwind uwtable "less-precise-fpmad"="false"
  "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"
  "no-infs-fp-math"="false" "no-nans-fp-math"="false"
  "stack-protector-buffer-size"="8" "unsafe-fp-math"="false"
  "use-soft-float"="false" }

LLVM’s IR happens to be very short and to the point. The function prologue and epilogue are not expressed as part of IR blob that gets passed to the machine code generator. Note the function attribute no-frame-pointer-elim being true (meaning frame pointer elimination will not happen).

Now, let’s add in the -fomit-frame-pointer option.

$ clang -S -emit-llvm -Wall -O2 -m64 -fomit-frame-pointer test.c

Now, the relevant IR pieces are:

; Function Attrs: nounwind
define void @test(i32* %x) #0 {
entry:
  tail call void asm sideeffect "lock; incl $0",
    "=*m,*m,~{dirflag},~{fpsr},~{flags}"(i32* %x, i32* %x) #1, !srcloc !1
  ret void
}

attributes #0 = { nounwind uwtable "less-precise-fpmad"="false"
  "no-frame-pointer-elim"="false" "no-infs-fp-math"="false"
  "no-nans-fp-math"="false" "stack-protector-buffer-size"="8"
  "unsafe-fp-math"="false" "use-soft-float"="false" }

The no-frame-pointer-elim attribute changed (from true to false), but the IR of the function itself did not change. (The no-frame-pointer-elim-non-leaf attribute disappeared as well, but it really makes sense since -fomit-frame-pointer is a rather large hammer that just forces frame pointer elimination everywhere and so it doesn’t make sense to differentiate between leaf and non-leaf functions.)

So, to answer Steve’s question, the LLVM IR does not include the function prologue and epilogue. This actually makes a lot of sense given that the IR is architecture independent and the exact details of what the prologue has to do are define by the ABIs.

IR to Assembly

We can of course use llc to convert the IR into real 64-bit x86 assembly code.

$ llc --march=x86-64 test.ll
$ gas -o test.o --64 test.s

Here is the disassembly for clang invocation without -fomit-frame-pointer:

test()
    test:     55                 pushq  %rbp
    test+0x1: 48 89 e5           movq   %rsp,%rbp
    test+0x4: f0 ff 07           lock incl (%rdi)
    test+0x7: 5d                 popq   %rbp
    test+0x8: c3                 ret    

And here is the disassembly for clang invocation with -fomit-frame-pointer:

test()
    test:     f0 ff 07           lock incl (%rdi)
    test+0x3: c3                 ret    

Conclusion

So, it turns out that my previous post simply stumbled across the fact that GCC and clang have different set of optimizations for -O2. GCC includes -fomit-frame-pointer by default, while clang does not.

Inline Assembly & GCC, clang

Recently, I got to write a bit of inline assembly. In the process I got to test my changes by making a small C file which defined test function that called the inline function from the header. Then, I could look at the disassembly to verify all was well.

#define _KERNEL
#define _ASM_INLINES
#include <sys/atomic.h>

void test(uint32_t *x)
{
	atomic_inc_32(x);
}

GCC has been my go to complier for a long time now. So, at first I was using it to debug my inline assembly. I compiled the test programs using:

$ gcc -Wall -O2 -m64 -c test.c

Disassembling the object file yields the rather obvious:

test()
    test:     f0 ff 07           lock incl (%rdi)
    test+0x3: c3                 ret    

I can’t think of any way to make it better :)

Then, at some point I remembered that Clang/LLVM are pretty good as well. I compiled the same file with clang:

$ clang -Wall -O2 -m64 -c test.c

The result was rather disappointing:

test()
    test:     55                 pushq  %rbp
    test+0x1: 48 89 e5           movq   %rsp,%rbp
    test+0x4: f0 ff 07           lock incl (%rdi)
    test+0x7: 5d                 popq   %rbp
    test+0x8: c3                 ret    

For whatever reason, Clang feels the need to push/pop the frame pointer. I did a little bit of searching, and I couldn’t find a way to disable this behavior.

The story for 32-bit output is very similar (just drop the -m64 from the compiler invocation). GCC produced the superior output:

test()
    test:     8b 44 24 04        movl   0x4(%esp),%eax
    test+0x4: f0 ff 00           lock incl (%eax)
    test+0x7: c3                 ret    

While Clang still wanted to muck around with the frame pointer.

test()
    test:     55                 pushl  %ebp
    test+0x1: 89 e5              movl   %esp,%ebp
    test+0x3: 8b 45 08           movl   0x8(%ebp),%eax
    test+0x6: f0 ff 00           lock incl (%eax)
    test+0x9: 5d                 popl   %ebp
    test+0xa: c3                 ret    

For the curious ones, I’m using GCC 4.8.3 and Clang 3.4.2.

I realize this is a bit of a special case (how often to you make a function that simply calls an inline function?), but it makes me worried about what sort of sub-optimal code Clang produces in other cases.

Inlining Atomic Operations

One of the items on my ever growing TODO list (do these ever shrink?) was to see if inlining Illumos’s atomic_* functions would make any difference. (For the record, these functions atomically manipulate variables. You can read more about them in the various man pages — atomic_add, atomic_and, atomic_bits, atomic_cas, atomic_dec, atomic_inc, atomic_or, atomic_swap.) Of course once I looked at the issue deeply enough, I ended up with five cleanup patches. The gist of it is, inlining them caused not only about 1% kernel performance improvement on the benchmarks, but also reduced the kernel size by a couple of kilobytes. You can read all about it in the associated bugs (5042, 5043, 5044, 5045, 5046, 5047) and the patch 0/6 email I sent to the developer list. In this blahg post, I want to talk about how exactly Illumos presents these atomic functions in a stable ABI but at the same time allows for inlines.

Genesis

It should come as no surprise that the “content” of these functions really needs to be written in assembly. The functions are 100% implemented in assembly in usr/src/common/atomic. There, you will find a directory per architecture. For example, in the amd64 directory, we’ll find the code for a 64-bit atomic increment:

	ENTRY(atomic_inc_64)
	ALTENTRY(atomic_inc_ulong)
	lock
	incq	(%rdi)
	ret
	SET_SIZE(atomic_inc_ulong)
	SET_SIZE(atomic_inc_64)

The ENTRY, ALTENTRY, and SET_SIZE macros are C preprocessor macros to make writing assembly functions semi-sane. Anyway, this code is used by both the kernel as well as userspace. I am going to ignore the userspace side of the picture and talk about the kernel only.

These assembly functions, get mangled by the C preprocessor, and then are fed into the assembler. The object file is then linked into the rest of the kernel. When a module binary references these functions the krtld (linker-loader) wires up those references to this code.

Inline

Replacing these function with inline functions (using the GNU definition) would be fine as far as all the code in Illumos is concerned. However doing so would remove the actual functions (as well as the symbol table entries) and so the linker would not be able to wire up any references from modules. Since Illumos cares about not breaking existing external modules (both open source and closed source), this simple approach would be a no-go.

Inline v2

Before I go into the next and final approach, I’m going to make a small detour through C land.

extern inline

First off, let’s say that we have a simple function, add, that returns the sum of the two integer arguments, and we keep it in a file called add.c:

#include "add.h"

int add(int x, int y)
{
	return x + y;
}

In the associated header file, add.h, we may include a prototype like the following to let the compiler know that add exists elsewhere and what types to expect.

extern int add(int, int);

Then, we attempt to call it from a function in, say, test.c:

#include "add.h"

int test()
{
	return add(5, 7);
}

Now, let’s turn these two .c files into a .so. We get the obvious result — test calls add:

test()
    test:     be 07 00 00 00     movl   $0x7,%esi
    test+0x5: bf 05 00 00 00     movl   $0x5,%edi
    test+0xa: e9 b1 fe ff ff     jmp    -0x14f	<0xc90>

And the binary contains both functions:

$ /usr/bin/nm test.so | egrep '(Value|test$|add$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[74]	|                3520|                   4|FUNC |GLOB |0    |13   |add
[65]	|                3536|                  15|FUNC |GLOB |0    |13   |test

Now suppose that we modify the header file to include the following (assuming GCC’s inline definition):

extern int add(int, int);

extern inline int add(int a, int b)
{
	return a + b;
}

If we compile and link the same .so the same way, that is we feed in the object file with the previously used implementation of add, we’ll get a slightly different binary. The invocation of add will use the inlined version:

test()
    test:     b8 0c 00 00 00     movl   $0xc,%eax
    test+0x5: c3                 ret    

But the binary will still include the symbol:

$ /usr/bin/nm test.so | egrep '(Value|test$|add$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[72]	|                3408|                   4|FUNC |GLOB |0    |11   |add
[63]	|                3424|                   6|FUNC |GLOB |0    |11   |test

Neat, eh?

extern inline atomic what?

How does this apply to the atomic functions? Pretty simply. As I pointed out, usr/src/common/atomic contains the pure assembly implementations — these are the functions you’ll always find in the symbol table.

The common header file that defines extern prototypes is usr/src/uts/common/sys/atomic.h.

Now, the trick. If you look carefully at the header file, you’ll spot a check on line 39. If all the conditions are true (kernel code, GCC, inline assembly is allowed, and x86), we include asm/atomic.h — which lives at usr/src/uts/intel/asm/atomic.h. This is where the extern inline versions of the atomic functions get defined.

So, kernel code simply includes <sys/atomic.h>, and if the stars align properly, any atomic function use will get inlined.

Phew! This ended up being longer than I expected. :)

GCC Optimizations

Recently, I’ve been given a hang bug to work on. This lead me to a another bug related to timing which pushed me to clean up a time related #define which uncovered at least two bugs. Got all that? Good. The rest of this post is going to talk about the changed define, and one of the “at least two bugs”. When I talk about GCC, I’m talking about the Illumos-specific GCC version 4.4.4. (Illumos needs a couple of features that stock GCC doesn’t provide.)

The #define change I’m hoping to make is very simple:

diff --git a/usr/src/uts/common/sys/time.h b/usr/src/uts/common/sys/time.h
--- a/usr/src/uts/common/sys/time.h
+++ b/usr/src/uts/common/sys/time.h
@@ -234,7 +234,7 @@ struct itimerval32 {
 #define        SEC             1
 #define        MILLISEC        1000
 #define        MICROSEC        1000000
-#define        NANOSEC         1000000000
+#define        NANOSEC         1000000000ll
 
 #define        MSEC2NSEC(m)    ((hrtime_t)(m) * (NANOSEC / MILLISEC))
 #define        NSEC2MSEC(n)    ((n) / (NANOSEC / MILLISEC))

Without it, multiplying by NANOSEC will cause integer overflow issues on IPL32 and LP64 systems (read: basically everywhere).

One of the “at least two bugs“ involves a simple (buggy) function aptly named tv2ns as it converts a struct timeval to a 64-bit nanosecond count:

static int64_t
tv2ns(struct timeval *tvp)
{
	return (tvp->tv_sec * NANOSEC + tvp->tv_usec * 1000);
}

At first glance, this function looks correct. The only flaw with it is that first portion of the expression multiplies a time_t (32-bit signed int) with an int (also 32-bit signed) making the result of that subexpression 32-bit signed expression. With NANOSEC changed to a long long, everything works as expected. Now, the fun part… disassembling this function without the fix. You don’t have to be an expert to see that this function is strangely repetitive. I’ve annotated the assembly.

tv2ns:          movl   0x4(%esp),%eax     ; eax = tvp
tv2ns+4:        movl   0x4(%eax),%edx     ; edx = tvp->tv_usec
tv2ns+7:        leal   (%edx,%edx,4),%edx ; edx = edx + 4 * edx
tv2ns+0xa:      leal   (%edx,%edx,4),%edx ;     = 5 * edx
tv2ns+0xd:      leal   (%edx,%edx,4),%edx
;
; at this point:  edx = 5 * 5 * 5 * tvp->tv_usec,
; which is the same as: 125 * tvp->tv_usec
;
tv2ns+0x10:     movl   (%eax),%eax        ; eax = tvp->tv_sec
tv2ns+0x12:     leal   (%eax,%eax,4),%eax ; eax = eax + 4 * eax
tv2ns+0x15:     leal   (%eax,%eax,4),%eax ;     = 5 * eax
tv2ns+0x18:     leal   (%eax,%eax,4),%eax
tv2ns+0x1b:     leal   (%eax,%eax,4),%eax
tv2ns+0x1e:     leal   (%eax,%eax,4),%eax
tv2ns+0x21:     leal   (%eax,%eax,4),%eax
tv2ns+0x24:     leal   (%eax,%eax,4),%eax
tv2ns+0x27:     leal   (%eax,%eax,4),%eax
tv2ns+0x2a:     leal   (%eax,%eax,4),%eax
;
; at this point,  eax = 5 * 5 * 5 * 5 * 5 * 5 * 5 * 5 * 5 * tvp->tv_sec,
; which is the same as: 1953125 * tvp->tv_sec
;
tv2ns+0x2d:     shll   $0x9,%eax          ; eax <<= 9
;
; eax = (1953125 * tvp->tv_sec) << 9,
; which suprprisingly ends up being the same as: 1000000000 * tvp->tv_sec
;
; so, now we have 'eax' with the tv_sec converted to nanoseconds and 'edx'
; with 125 * tv_usec
;
tv2ns+0x30:     leal   (%eax,%edx,8),%eax ; eax = eax + 8 * edx
;
; 8 * 125 = 1000, which is the factor to convert tv_usec to nanoseconds!
;
tv2ns+0x33:     cltd                      ; sign-extend eax to edx:eax
tv2ns+0x34:     ret    

I found it interesting that GCC decided to emit leal instructions to multiply by 5 and then finish it off with a shift and another leal. This is another one of those times when I realize that the compiler is smarter than me. (The sign-extension of course happens too late — all the math needs to happen as 64-bit arithmetic, but that’s not GCC’s fault.)

For the record, with the #define changed, the function looks like the following — sorry, no comments on this one:

tv2ns:          pushl  %edi
tv2ns+1:        pushl  %esi
tv2ns+2:        pushl  %ebx
tv2ns+3:        subl   $0x8,%esp
tv2ns+6:        movl   0x18(%esp),%ecx
tv2ns+0xa:      movl   0x4(%ecx),%eax
tv2ns+0xd:      leal   (%eax,%eax,4),%eax
tv2ns+0x10:     leal   (%eax,%eax,4),%eax
tv2ns+0x13:     leal   (%eax,%eax,4),%ebx
tv2ns+0x16:     shll   $0x3,%ebx
tv2ns+0x19:     movl   %ebx,%esi
tv2ns+0x1b:     sarl   $0x1f,%esi
tv2ns+0x1e:     movl   $0x3b9aca00,%edi
tv2ns+0x23:     movl   (%ecx),%eax
tv2ns+0x25:     imull  %edi
tv2ns+0x27:     movl   %eax,(%esp)
tv2ns+0x2a:     movl   %edx,0x4(%esp)
tv2ns+0x2e:     addl   %ebx,%eax
tv2ns+0x30:     adcl   %esi,%edx
tv2ns+0x32:     addl   $0x8,%esp
tv2ns+0x35:     popl   %ebx
tv2ns+0x36:     popl   %esi
tv2ns+0x37:     popl   %edi
tv2ns+0x38:     ret    

Maybe one day I’ll rummage through my brain and dig up other times that GCC is outsmarted me and blahg about them. :)

Designated Initializers

Designated initializers are a neat feature in C99 that I’ve used for about 6 years. I can’t fathom why anyone would not use them if C99 is available. (Of course if you have to support pre-C99 compilers, you’re very sad.) In case you’ve never seen them, consider this example that’s perfectly valid C99:

int abc[7] = {
	[1] = 0xabc,
	[2] = 0x12345678,
	[3] = 0x12345678,
	[4] = 0x12345678,
	[5] = 0xdef,
};

As you may have guessed, indices 1–5 will have the specified value. Indices 0 and 6 will be zero. Cool, eh?

GCC Extensions

Today I learned about a neat GNU extension in GCC to designated initializers. Consider this code snippet:

int abc[7] = {
	[1] = 0xabc,
	[2 ... 5] = 0x12345678,
	[5] = 0xdef,
};

Mind blowing, isn’t it?

Beware, however… GCC’s -std=c99 will not error out if you use ranges! You need to throw in -pedantic to get a warning.

$ gcc -c -Wall -std=c99 test.c
$ gcc -c -Wall -pedantic -std=c99 test.c
test.c:2:5: warning: ISO C forbids specifying range of elements to initialize [-pedantic]

Useless reinterpret_cast in C++

A few months ago (for whatever reason, I didn’t publish this post earlier), I happened to stumble on some C++ code that I had to modify. While trying to make things work, I happened to get code that essentially was:

uintptr_t x = ...;
uintptr_t y = reinterpret_cast<uintptr_t>(x);

Yes, the cast is useless. The actual code I had was much more complicated and it wasn’t immediately obvious that ‘x’ was already a uintptr_t. Thinking about it now, I would expect GCC to give a warning about a useless cast. What I did not expect was what I got:

foo.cpp:189:3: error: invalid cast from type "uintptr_t {aka long unsigned int}"
    to type "uintptr_t {aka long unsigned int}"

Huh? To me it seems a bit silly that the compiler does not know how to convert from one type to the same type. (For what it’s worth, this is GCC 4.6.2.)

Can anyone who knows more about GCC and/or C++ shed some light on this?

Recursion with Sun Studio

I mentioned my post about recursion & DTrace in the #dtrace, and one of the folks there asked about Sun’s/Oracle’s compiler — Studio. I happen to have version 12u1, so why not have some more fun? :)

First things first, the results. With -O1, both variants recurse — 6 calls followed by 6 returns. With -O2, the naive code still recurses, while the explicitly-coded-as-tail-recursive code tail-recurses.

Here are the disassemblies (for -O2). Naive-recursive:

08050aa0 <fac>:
 8050aa0:	55                   	push   %ebp
 8050aa1:	8b ec                	mov    %esp,%ebp
 8050aa3:	53                   	push   %ebx
 8050aa4:	83 ec 04             	sub    $0x4,%esp
 8050aa7:	83 e4 f0             	and    $0xfffffff0,%esp
 8050aaa:	8b 5d 08             	mov    0x8(%ebp),%ebx
 8050aad:	83 fb 01             	cmp    $0x1,%ebx
 8050ab0:	7f 07                	jg     8050ab9 <fac+0x19>
 8050ab2:	b8 01 00 00 00       	mov    $0x1,%eax
 8050ab7:	eb 12                	jmp    8050acb <fac+0x2b>
 8050ab9:	83 ec 0c             	sub    $0xc,%esp
 8050abc:	8d 43 ff             	lea    -0x1(%ebx),%eax
 8050abf:	50                   	push   %eax
 8050ac0:	e8 db ff ff ff       	call   8050aa0 <fac>
 8050ac5:	83 c4 10             	add    $0x10,%esp
 8050ac8:	0f af c3             	imul   %ebx,%eax
 8050acb:	8d 65 fc             	lea    -0x4(%ebp),%esp
 8050ace:	5b                   	pop    %ebx
 8050acf:	c9                   	leave  
 8050ad0:	c3                   	ret    

and tail-recursive:

08050aa0 <fac>:
 8050aa0:	55                   	push   %ebp
 8050aa1:	8b ec                	mov    %esp,%ebp
 8050aa3:	8b 55 08             	mov    0x8(%ebp),%edx
 8050aa6:	8b 45 0c             	mov    0xc(%ebp),%eax
 8050aa9:	83 fa 01             	cmp    $0x1,%edx
 8050aac:	7e 0d                	jle    8050abb <fac+0x1b>
 8050aae:	8d 4a ff             	lea    -0x1(%edx),%ecx
 8050ab1:	0f af c2             	imul   %edx,%eax
 8050ab4:	8b d1                	mov    %ecx,%edx
 8050ab6:	83 f9 01             	cmp    $0x1,%ecx
 8050ab9:	7f f3                	jg     8050aae <fac+0xe>
 8050abb:	89 45 0c             	mov    %eax,0xc(%ebp)
 8050abe:	89 55 08             	mov    %edx,0x8(%ebp)
 8050ac1:	c9                   	leave  
 8050ac2:	c3                   	ret 

I find it fascinating that Studio produced nearly identical code to gcc. The big differences that you can see are: (1) Studio defaults to saving the frame pointer while gcc omitted it, (2) gcc inserted a no-op to 16-byte align the loop, and (3) Studio uses an extra register to perform the subtraction (the n-1).

The subtraction surprised me, and so I examined the code more closely. My guess is that Studio attempts to move the subtraction as far up as possible to start it early. This way, by the time we actually need the subtracted value (the cmp instruction) it’s already available. Compare that to gcc’s sub followed immediately by cmp. Some processors are designed in such a way that the result of sub will be available by the time cmp wants to execute. Others are not.

Your Turn

Have you noticed any interesting compiler optimizations? Share them in a comment!

Powered by blahgd