Creative xor Use
Last month at work I got to try to optimize a function that takes a number and rounds it up to the next power of 2. The previous implementation used a simple loop. I didn’t dive into obscure bit twiddling, but rather used a helper function that is already in the codebase. Yes, I let the compiler do the heavy lifting of turning easy to understand code into good machine code. The x86 binary that gcc 6.3 produced has an interesting idiom, and that’s why I’m writing this entry.
The new code:
static inline unsigned int bits_required32(uint32_t num) { return num == 0 ? 0 : 32 - __builtin_clz(num); } /* Returns x, such that x is the smallest power of 2 >= num. */ uint32_t nearest_power(uint32_t num) { if (num == 0) return 1; return 1U << bits_required32(num - 1); }
This is a slightly simplified version of the code, but it demonstrates the optimization quite well.
The nearest_power function disassembles as:
nearest_power() nearest_power: 8b 54 24 04 movl 0x4(%esp),%edx nearest_power+0x4: b8 01 00 00 00 movl $0x1,%eax nearest_power+0x9: 85 d2 testl %edx,%edx nearest_power+0xb: 74 14 je +0x14 <nearest_power+0x21> nearest_power+0xd: 83 ea 01 subl $0x1,%edx nearest_power+0x10: 74 0f je +0xf <nearest_power+0x21> nearest_power+0x12: 0f bd d2 bsrl %edx,%edx nearest_power+0x15: b9 20 00 00 00 movl $0x20,%ecx nearest_power+0x1a: 83 f2 1f xorl $0x1f,%edx nearest_power+0x1d: 29 d1 subl %edx,%ecx nearest_power+0x1f: d3 e0 shll %cl,%eax nearest_power+0x21: c3 ret
The first 6 instructions contain the prologue and deal with num being zero or one—both cases produce the result 1. The remaining 6 instructions make up the epilogue and are where the calculation happens. I’m going to ignore the first half of the function, since the second half is where the interesting things happen.
First, we get the number of leading zeros in num - 1 and stash the value 32 in a register:
nearest_power+0x12: 0f bd d2 bsrl %edx,%edx nearest_power+0x15: b9 20 00 00 00 movl $0x20,%ecx
The number of leading zeros (%edx) is in the range 0–31.
Here is the really interesting bit:
nearest_power+0x1a: 83 f2 1f xorl $0x1f,%edx
This xors the number of leading zeros (i.e., 0–31) with 31. To decipher what this does, I find it easier to consider the top 27 bits and the bottom 5 bits separately.
operand | binary | |
0x1f | 00000000 00000000 00000000 000 | 11111 |
edx | 00000000 00000000 00000000 000 | xxxxx |
The xor of the top bits produces 0 since both the constant 31 and the register containing any of the numbers 0–31 have zeros there.
The xor of the bottom bits negates them since the constant has ones there.
When combined, the xor has the same effect as this C expression:
out = (~in) & 0x1f;
This seems very weird and useless, but it is far from it. It turns out that for inputs 0–31 the above expression is the same as:
out = 31 - in;
I think it is really cool that gcc produced this xor instead of a less optimal multi-instruction version.
The remainder of the disassembly just subtracts and shifts to produce the return value.
Why xor?
I think the reason gcc (and clang for that matter) produce this sort of xor instruction instead of a subtraction is very simple: on x86 the sub instruction’s left hand side and the destination must be the same register. That is, on x86 the sub instruction works as:
x -= y;
Since the destination must be a register, it isn’t possible to express out = 31 - in using just one sub.
Anyway, that’s it for today. I hope you enjoyed this as much as I did.