Josef “Jeff” Sipek

Recursion with Sun Studio

I mentioned my post about recursion & DTrace in the #dtrace, and one of the folks there asked about Sun’s/Oracle’s compiler — Studio. I happen to have version 12u1, so why not have some more fun? :)

First things first, the results. With -O1, both variants recurse — 6 calls followed by 6 returns. With -O2, the naive code still recurses, while the explicitly-coded-as-tail-recursive code tail-recurses.

Here are the disassemblies (for -O2). Naive-recursive:

08050aa0 <fac>:
 8050aa0:	55                   	push   %ebp
 8050aa1:	8b ec                	mov    %esp,%ebp
 8050aa3:	53                   	push   %ebx
 8050aa4:	83 ec 04             	sub    $0x4,%esp
 8050aa7:	83 e4 f0             	and    $0xfffffff0,%esp
 8050aaa:	8b 5d 08             	mov    0x8(%ebp),%ebx
 8050aad:	83 fb 01             	cmp    $0x1,%ebx
 8050ab0:	7f 07                	jg     8050ab9 <fac+0x19>
 8050ab2:	b8 01 00 00 00       	mov    $0x1,%eax
 8050ab7:	eb 12                	jmp    8050acb <fac+0x2b>
 8050ab9:	83 ec 0c             	sub    $0xc,%esp
 8050abc:	8d 43 ff             	lea    -0x1(%ebx),%eax
 8050abf:	50                   	push   %eax
 8050ac0:	e8 db ff ff ff       	call   8050aa0 <fac>
 8050ac5:	83 c4 10             	add    $0x10,%esp
 8050ac8:	0f af c3             	imul   %ebx,%eax
 8050acb:	8d 65 fc             	lea    -0x4(%ebp),%esp
 8050ace:	5b                   	pop    %ebx
 8050acf:	c9                   	leave  
 8050ad0:	c3                   	ret    

and tail-recursive:

08050aa0 <fac>:
 8050aa0:	55                   	push   %ebp
 8050aa1:	8b ec                	mov    %esp,%ebp
 8050aa3:	8b 55 08             	mov    0x8(%ebp),%edx
 8050aa6:	8b 45 0c             	mov    0xc(%ebp),%eax
 8050aa9:	83 fa 01             	cmp    $0x1,%edx
 8050aac:	7e 0d                	jle    8050abb <fac+0x1b>
 8050aae:	8d 4a ff             	lea    -0x1(%edx),%ecx
 8050ab1:	0f af c2             	imul   %edx,%eax
 8050ab4:	8b d1                	mov    %ecx,%edx
 8050ab6:	83 f9 01             	cmp    $0x1,%ecx
 8050ab9:	7f f3                	jg     8050aae <fac+0xe>
 8050abb:	89 45 0c             	mov    %eax,0xc(%ebp)
 8050abe:	89 55 08             	mov    %edx,0x8(%ebp)
 8050ac1:	c9                   	leave  
 8050ac2:	c3                   	ret 

I find it fascinating that Studio produced nearly identical code to gcc. The big differences that you can see are: (1) Studio defaults to saving the frame pointer while gcc omitted it, (2) gcc inserted a no-op to 16-byte align the loop, and (3) Studio uses an extra register to perform the subtraction (the n-1).

The subtraction surprised me, and so I examined the code more closely. My guess is that Studio attempts to move the subtraction as far up as possible to start it early. This way, by the time we actually need the subtracted value (the cmp instruction) it’s already available. Compare that to gcc’s sub followed immediately by cmp. Some processors are designed in such a way that the result of sub will be available by the time cmp wants to execute. Others are not.

Your Turn

Have you noticed any interesting compiler optimizations? Share them in a comment!


Atom feed for comments on this post.

Leave a comment

Powered by blahgd