I mentioned my post about recursion & DTrace in the #dtrace, and one of the folks there asked about Sun’s/Oracle’s compiler — Studio. I happen to have version 12u1, so why not have some more fun? :)
First things first, the results. With -O1, both variants recurse — 6 calls followed by 6 returns. With -O2, the naive code still recurses, while the explicitly-coded-as-tail-recursive code tail-recurses.
Here are the disassemblies (for -O2). Naive-recursive:
08050aa0 <fac>:
8050aa0: 55 push %ebp
8050aa1: 8b ec mov %esp,%ebp
8050aa3: 53 push %ebx
8050aa4: 83 ec 04 sub $0x4,%esp
8050aa7: 83 e4 f0 and $0xfffffff0,%esp
8050aaa: 8b 5d 08 mov 0x8(%ebp),%ebx
8050aad: 83 fb 01 cmp $0x1,%ebx
8050ab0: 7f 07 jg 8050ab9 <fac+0x19>
8050ab2: b8 01 00 00 00 mov $0x1,%eax
8050ab7: eb 12 jmp 8050acb <fac+0x2b>
8050ab9: 83 ec 0c sub $0xc,%esp
8050abc: 8d 43 ff lea -0x1(%ebx),%eax
8050abf: 50 push %eax
8050ac0: e8 db ff ff ff call 8050aa0 <fac>
8050ac5: 83 c4 10 add $0x10,%esp
8050ac8: 0f af c3 imul %ebx,%eax
8050acb: 8d 65 fc lea -0x4(%ebp),%esp
8050ace: 5b pop %ebx
8050acf: c9 leave
8050ad0: c3 ret
and tail-recursive:
08050aa0 <fac>:
8050aa0: 55 push %ebp
8050aa1: 8b ec mov %esp,%ebp
8050aa3: 8b 55 08 mov 0x8(%ebp),%edx
8050aa6: 8b 45 0c mov 0xc(%ebp),%eax
8050aa9: 83 fa 01 cmp $0x1,%edx
8050aac: 7e 0d jle 8050abb <fac+0x1b>
8050aae: 8d 4a ff lea -0x1(%edx),%ecx
8050ab1: 0f af c2 imul %edx,%eax
8050ab4: 8b d1 mov %ecx,%edx
8050ab6: 83 f9 01 cmp $0x1,%ecx
8050ab9: 7f f3 jg 8050aae <fac+0xe>
8050abb: 89 45 0c mov %eax,0xc(%ebp)
8050abe: 89 55 08 mov %edx,0x8(%ebp)
8050ac1: c9 leave
8050ac2: c3 ret
I find it fascinating that Studio produced nearly identical code to gcc. The big differences that you can see are: (1) Studio defaults to saving the frame pointer while gcc omitted it, (2) gcc inserted a no-op to 16-byte align the loop, and (3) Studio uses an extra register to perform the subtraction (the n-1).
The subtraction surprised me, and so I examined the code more closely. My guess is that Studio attempts to move the subtraction as far up as possible to start it early. This way, by the time we actually need the subtracted value (the cmp instruction) it’s already available. Compare that to gcc’s sub followed immediately by cmp. Some processors are designed in such a way that the result of sub will be available by the time cmp wants to execute. Others are not.
Your Turn
Have you noticed any interesting compiler optimizations? Share them in a comment!