Recursion with Sun Studio
I mentioned my post about recursion & DTrace in the #dtrace, and one of the folks there asked about Sun’s/Oracle’s compiler — Studio. I happen to have version 12u1, so why not have some more fun? :)
First things first, the results. With -O1, both variants recurse — 6 calls followed by 6 returns. With -O2, the naive code still recurses, while the explicitly-coded-as-tail-recursive code tail-recurses.
Here are the disassemblies (for -O2). Naive-recursive:
08050aa0 <fac>: 8050aa0: 55 push %ebp 8050aa1: 8b ec mov %esp,%ebp 8050aa3: 53 push %ebx 8050aa4: 83 ec 04 sub $0x4,%esp 8050aa7: 83 e4 f0 and $0xfffffff0,%esp 8050aaa: 8b 5d 08 mov 0x8(%ebp),%ebx 8050aad: 83 fb 01 cmp $0x1,%ebx 8050ab0: 7f 07 jg 8050ab9 <fac+0x19> 8050ab2: b8 01 00 00 00 mov $0x1,%eax 8050ab7: eb 12 jmp 8050acb <fac+0x2b> 8050ab9: 83 ec 0c sub $0xc,%esp 8050abc: 8d 43 ff lea -0x1(%ebx),%eax 8050abf: 50 push %eax 8050ac0: e8 db ff ff ff call 8050aa0 <fac> 8050ac5: 83 c4 10 add $0x10,%esp 8050ac8: 0f af c3 imul %ebx,%eax 8050acb: 8d 65 fc lea -0x4(%ebp),%esp 8050ace: 5b pop %ebx 8050acf: c9 leave 8050ad0: c3 ret
and tail-recursive:
08050aa0 <fac>: 8050aa0: 55 push %ebp 8050aa1: 8b ec mov %esp,%ebp 8050aa3: 8b 55 08 mov 0x8(%ebp),%edx 8050aa6: 8b 45 0c mov 0xc(%ebp),%eax 8050aa9: 83 fa 01 cmp $0x1,%edx 8050aac: 7e 0d jle 8050abb <fac+0x1b> 8050aae: 8d 4a ff lea -0x1(%edx),%ecx 8050ab1: 0f af c2 imul %edx,%eax 8050ab4: 8b d1 mov %ecx,%edx 8050ab6: 83 f9 01 cmp $0x1,%ecx 8050ab9: 7f f3 jg 8050aae <fac+0xe> 8050abb: 89 45 0c mov %eax,0xc(%ebp) 8050abe: 89 55 08 mov %edx,0x8(%ebp) 8050ac1: c9 leave 8050ac2: c3 ret
I find it fascinating that Studio produced nearly identical code to gcc. The big differences that you can see are: (1) Studio defaults to saving the frame pointer while gcc omitted it, (2) gcc inserted a no-op to 16-byte align the loop, and (3) Studio uses an extra register to perform the subtraction (the n-1).
The subtraction surprised me, and so I examined the code more closely. My guess is that Studio attempts to move the subtraction as far up as possible to start it early. This way, by the time we actually need the subtracted value (the cmp instruction) it’s already available. Compare that to gcc’s sub followed immediately by cmp. Some processors are designed in such a way that the result of sub will be available by the time cmp wants to execute. Others are not.
Your Turn
Have you noticed any interesting compiler optimizations? Share them in a comment!