Last night while reading about DTrace, I came across an example that involved tracing a simple recursive factorial program. I pointed it out to my girlfriend, Holly, since I thought that she’d find it interesting — the class she teaches has a section about recursion.
Here’s the original code:
extern int fac (int n);
int main(int argc, char **argv)
{
int f;
f = fac(6);
return 0;
}
int fac (int n)
{
if (n <= 1)
return 1;
else
return n * fac (n - 1);
}
Pretty simple. I compiled it, and ran it with DTrace:
$ gcc -o fac orig.c
$ cat fac.d
pid$target::fac:entry
{
trace (arg0);
}
pid$target::fac:return
{
trace (arg1);
}
$ sudo dtrace -Fs fac.d -c ./fac
dtrace: script 'fac.d' matched 2 probes
dtrace: pid 1146 has exited
CPU FUNCTION
0 -> fac 6
0 -> fac 5
0 -> fac 4
0 -> fac 3
0 -> fac 2
0 -> fac 1
0 <- fac 1
0 <- fac 2
0 <- fac 6
0 <- fac 24
0 <- fac 120
0 <- fac 720
Cool! I thought I was done. Holly asked whether it would work with tail-recursion. Interesting, I thought…it might - depending on how gcc handles the function prologue and epilogue. The D script is a little different to dump both of the arguments. Here’s the result:
$ gcc -o fact tail.c
$ cat fact.d
pid$target::fac:entry
{
trace (arg0); trace(arg1);
}
pid$target::fac:return
{
trace (arg1);
}
$ sudo dtrace -Fs fact.d -c ./fact
dtrace: script 'fact.d' matched 2 probes
dtrace: pid 1233 has exited
CPU FUNCTION
2 -> fac 6 1
2 -> fac 5 6
2 -> fac 4 30
2 -> fac 3 120
2 -> fac 2 360
2 -> fac 1 720
2 <- fac 720
2 <- fac 720
2 <- fac 720
2 <- fac 720
2 <- fac 720
2 <- fac 720
You can see the values getting calculated and passed down versus them getting calculated on during the return. That was fun to see. But wait! If this function is tail recursive, why are we seeing all these returns? It should be just one return! Doh! I didn’t compile with optimizations so gcc emitted the stupidest possible code. Easy enough to fix:
$ gcc -o fact -O2 tail.c
$ sudo dtrace -Fs fact.d -c ./fact
dtrace: script 'fact.d' matched 2 probes
dtrace: pid 1304 has exited
$
Huh… nothing happened! fac was never called. Let’s see what gcc emitted:
$ objdump -S fact
...
Disassembly of section .text.startup:
08050d10 <main>:
8050d10: 31 c0 xor %eax,%eax
8050d12: c3 ret
Great, the main function turned into a return 0 because the value returned by fac() was never used. Easy enough, let’s just return the value from main.
$ cat tail.c
extern int fac (int n);
int main(int argc, char **argv)
{
return fac(6);
}
int fac (int n)
{
if (n <= 1)
return 1;
else
return n * fac (n - 1);
}
$ gcc -o fact -O2 tail.c
$ objdump -S fact
...
Disassembly of section .text.startup:
08050d10 <main>:
8050d10: b8 d0 02 00 00 mov $0x2d0,%eax
8050d15: c3 ret
Argh! Thanks gcc, you replaced my fac(6) function call with the value 720 — that is the factorial of 6. Fine, let’s do this the hard way: get an int from the first argument and print it out. Also, to prevent inlining, let’s put it in a separate file. So, now we have:
$ cat fac.c
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
extern int fac (int n);
int main(int argc, char **argv)
{
printf("%d\n", fac(atoi(argv[1])));
return 0;
}
$ cat fac_2.c
int fac (int n)
{
if (n <= 1)
return 1;
else
return n * fac (n - 1);
}
$ cat fact.c
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
extern int fac (int n, int v);
int main(int argc, char **argv)
{
printf("%d\n", fac(atoi(argv[1]), 1));
return 0;
}
$ cat fact_2.c
int fac (int n, int v)
{
if (n <= 1)
return v;
else
return fac(n-1, n*v);
}
Now, if we run the two variants compiled with -O2, we get:
$ sudo dtrace -Fs fac.d -c "./fac 6"
dtrace: script 'fac.d' matched 2 probes
720
dtrace: pid 1397 has exited
CPU FUNCTION
2 -> fac 6
2 <- fac 720
$ sudo dtrace -Fs fact.d -c "./fact 6"
dtrace: script 'fact.d' matched 2 probes
720
dtrace: pid 1399 has exited
CPU FUNCTION
5 -> fac 6 1
5 <- fac 720
Weird, for both we can see only one function entry and return. Let’s try with -O1:
$ sudo dtrace -Fs fac.d -c "./fac 6"
dtrace: script 'fac.d' matched 2 probes
720
dtrace: pid 1393 has exited
CPU FUNCTION
6 -> fac 6
6 -> fac 5
6 -> fac 4
6 -> fac 3
6 -> fac 2
6 -> fac 1
6 <- fac 1
6 <- fac 2
6 <- fac 6
6 <- fac 24
6 <- fac 120
6 <- fac 720
$ sudo dtrace -Fs fact.d -c "./fact 6"
dtrace: script 'fact.d' matched 2 probes
720
dtrace: pid 1395 has exited
CPU FUNCTION
2 -> fac 6 1
2 -> fac 5 6
2 -> fac 4 30
2 -> fac 3 120
2 -> fac 2 360
2 -> fac 1 720
2 <- fac 720
2 <- fac 720
2 <- fac 720
2 <- fac 720
2 <- fac 720
2 <- fac 720
Ok, now were back to having call and return instructions for both cases — the tail recursive function is not actually tail recursing when it should. So, first moral of the story is: -O1 is not enough to make tail recursive functions tail recurse. The odd behavior of the non-tail recursive code with -O2 is still weird. Let’s disassemble it; first the simple recursive code:
08050d00 <fac>:
8050d00: 8b 54 24 04 mov 0x4(%esp),%edx
8050d04: b8 01 00 00 00 mov $0x1,%eax
8050d09: 83 fa 01 cmp $0x1,%edx
8050d0c: 7e 0d jle 8050d1b <fac+0x1b>
8050d0e: 66 90 xchg %ax,%ax
8050d10: 0f af c2 imul %edx,%eax
8050d13: 83 ea 01 sub $0x1,%edx
8050d16: 83 fa 01 cmp $0x1,%edx
8050d19: 75 f5 jne 8050d10 <fac+0x10>
8050d1b: f3 c3 repz ret
8050d1d: 90 nop
8050d1e: 90 nop
8050d1f: 90 nop
Whoa! gcc turned the plain recursive code into a tail-recursive one. For comparison, here’s the disassembly of the explicitly-coded-as-tail-recursive function:
08050d00 <fac>:
8050d00: 8b 54 24 04 mov 0x4(%esp),%edx
8050d04: 8b 44 24 08 mov 0x8(%esp),%eax
8050d08: 83 fa 01 cmp $0x1,%edx
8050d0b: 7e 0e jle 8050d1b <fac+0x1b>
8050d0d: 8d 76 00 lea 0x0(%esi),%esi
8050d10: 0f af c2 imul %edx,%eax
8050d13: 83 ea 01 sub $0x1,%edx
8050d16: 83 fa 01 cmp $0x1,%edx
8050d19: 75 f5 jne 8050d10 <fac+0x10>
8050d1b: f3 c3 repz ret
8050d1d: 90 nop
8050d1e: 90 nop
8050d1f: 90 nop
Do you see it? It’s virtually identical to what gcc emitted for the naive code.
So there you have it folks. The compiler is smarter than you, more consistent than you, and less likely to screw up compared to you when converting a recursive function into a tail-recursive one. In general, you should not prematurely optimize.
In case you care, I’ve used gcc 4.6.1 for these experiments on OpenIndiana.
Your Turn
Do you have an interesting compiler optimization story? Share it in a comment!