Josef “Jeff” Sipek

Creative xor Use

Last month at work I got to try to optimize a function that takes a number and rounds it up to the next power of 2. The previous implementation used a simple loop. I didn’t dive into obscure bit twiddling, but rather used a helper function that is already in the codebase. Yes, I let the compiler do the heavy lifting of turning easy to understand code into good machine code. The x86 binary that gcc 6.3 produced has an interesting idiom, and that’s why I’m writing this entry.

The new code:

static inline unsigned int bits_required32(uint32_t num)
        return num == 0 ? 0 : 32 - __builtin_clz(num);

/* Returns x, such that x is the smallest power of 2 >= num. */
uint32_t nearest_power(uint32_t num)
	if (num == 0)
		return 1;

        return 1U << bits_required32(num - 1);

This is a slightly simplified version of the code, but it demonstrates the optimization quite well.

The nearest_power function disassembles as:

    nearest_power:      8b 54 24 04        movl   0x4(%esp),%edx
    nearest_power+0x4:  b8 01 00 00 00     movl   $0x1,%eax
    nearest_power+0x9:  85 d2              testl  %edx,%edx
    nearest_power+0xb:  74 14              je     +0x14	<nearest_power+0x21>
    nearest_power+0xd:  83 ea 01           subl   $0x1,%edx
    nearest_power+0x10: 74 0f              je     +0xf	<nearest_power+0x21>
    nearest_power+0x12: 0f bd d2           bsrl   %edx,%edx
    nearest_power+0x15: b9 20 00 00 00     movl   $0x20,%ecx
    nearest_power+0x1a: 83 f2 1f           xorl   $0x1f,%edx
    nearest_power+0x1d: 29 d1              subl   %edx,%ecx
    nearest_power+0x1f: d3 e0              shll   %cl,%eax
    nearest_power+0x21: c3                 ret    

The first 6 instructions contain the prologue and deal with num being zero or one—both cases produce the result 1. The remaining 6 instructions make up the epilogue and are where the calculation happens. I’m going to ignore the first half of the function, since the second half is where the interesting things happen.

First, we get the number of leading zeros in num - 1 and stash the value 32 in a register:

    nearest_power+0x12: 0f bd d2           bsrl   %edx,%edx
    nearest_power+0x15: b9 20 00 00 00     movl   $0x20,%ecx

The number of leading zeros (%edx) is in the range 0–31.

Here is the really interesting bit:

    nearest_power+0x1a: 83 f2 1f           xorl   $0x1f,%edx

This xors the number of leading zeros (i.e., 0–31) with 31. To decipher what this does, I find it easier to consider the top 27 bits and the bottom 5 bits separately.

operand binary
0x1f 00000000 00000000 00000000 000 11111
edx 00000000 00000000 00000000 000 xxxxx

The xor of the top bits produces 0 since both the constant 31 and the register containing any of the numbers 0–31 have zeros there.

The xor of the bottom bits negates them since the constant has ones there.

When combined, the xor has the same effect as this C expression:

out = (~in) & 0x1f;

This seems very weird and useless, but it is far from it. It turns out that for inputs 0–31 the above expression is the same as:

out = 31 - in;

I think it is really cool that gcc produced this xor instead of a less optimal multi-instruction version.

The remainder of the disassembly just subtracts and shifts to produce the return value.

Why xor?

I think the reason gcc (and clang for that matter) produce this sort of xor instruction instead of a subtraction is very simple: on x86 the sub instruction’s left hand side and the destination must be the same register. That is, on x86 the sub instruction works as:

x -= y;

Since the destination must be a register, it isn’t possible to express out = 31 - in using just one sub.

Anyway, that’s it for today. I hope you enjoyed this as much as I did.

Exclusive Or Character

A couple of years ago I blogged about the CCS instruction in the Apollo Guidance Computer. Today I want to tell you about the XC instruction from the System/360 ISA.

Many ISAs have some sort of xor instruction. The 360 is no different. It offers several different xor instructions which differ in the type of operands that they operate on. In all cases, the operation they perform could be summarized as (using C syntax):

A ^= B;

That is one of the operands is used as both a source and a destination.

There are the boring X (reg ^= memory), XR (reg ^= reg), and XI (reg ^= immediate). Then there is XC which is what inspired this post. XC, or Exclusive Or Character, takes two memory locations and a length and performs what appears as byte by byte xor of the two buffers. (The hardware is smart enough to operate on bigger chunks of memory but the effect is as if it was done byte at a time.) In assembly XC looks like:

XC d1(l,b1),d2(b2)

The d are 12-bit unsigned displacements while the b specify the registers with the base address. For each of the operands the actual address is dX plus the value of the bX register. The l is a length field which encodes a length between 1 and 256.

To use more C pseudocode, XC does:

void XC(unsigned char *op1, size_t len, unsigned char *op2)
	while (len--) {
		*op1 ^= *op2;

(This pseudo code ignores the condition code calculation and exception generation which are not relevant to the discussion.)

This by itself is neat but not every exciting…until you remember that xor can be used to zero out a register. You can use XC to zero out up to 256 bytes of memory. It turns out this idiom is used pretty often in handwritten assembly, and compilers such as gcc even produce such instructions without any special effort on the programmer’s behalf.

For example, in HVF I have this line:

memset(&psw, 0, sizeof(struct psw));

Which GCC helpfully turns into (struct psw is 16 bytes in size):

xc      160(16,%r15),160(%r15)

When I first saw that line in the disassembly of HVF years ago, it blew my mind. It is elegant, fast thanks to the microarchitecture optimizations, and once you are used to the idiom it is clear about what it does. I hope your mind was as blown as mine. Till next time!

bool bitfield:1

This is the first of hopefully many posts related to interesting pieces of code I’ve stumbled across in the dovecot repository.

Back in 1999, C99 added the bool type. This is old news. The thing I’ve never seen before is what amounts to:

struct foo {
	bool	a:1;
	bool	b:1;

Sure, I’ve seen bitfields before—just never with booleans. Since this is C, the obvious thing happens here. The compiler packs the two bool bits into a single byte. In other words, sizeof(struct foo) is 1 (instead of 2 had we not used bitfields).

The compiler emits pretty compact code as well. For example, suppose we have this simple function:

void set(struct foo *x)
	x->b = true;

We compile it and disassemble:

$ gcc -c -O2 -Wall -m64 test.c
$ dis -F set test.o
disassembly for test.o

    set:     80 0f 02           orb    $0x2,(%rdi)
    set+0x3: c3                 ret

Had we used non-bitfield booleans, the resulting code would be:

    set:     c6 47 01 01        movb   $0x1,0x1(%rdi)
    set+0x4: c3                 ret

There’s not much of a difference in these simple examples, but in more complicated structures with many boolean flags the structure size difference may be significant.

Of course, the usual caveats about bitfields apply (e.g., the machine’s endian matters).

dis(1): support for System/370, System/390, and z/Architecture ELF bins

A few months ago, I came to the conclusion that it would be both fun and educational to add a new disassembler backend to libdisasm—the disassembler library in Illumos. Being a mainframe fan, I decided that implementing a System/390 and z/Architecture disassembler would be fun (I’ve done it before in HVF).

At first, I was targetting only the 390 and z/Architecture, but given that the System/370 is a trivial (almost) subset of the 390 (and there is a spec for 370 ELF files!), I ended up including the 370 support as well.

It took a while to get the code written (z/Architecture has so many instructions!) and reviewed, but it finally happened… the commit just landed in the repository.

If you get the latest Illumos bits, you’ll be able to disassemble 370, 390, and z/Architecture binaries with style. For example:

$ dis -F strcmp hvf             
disassembly for hvf

    strcmp:      a7 19 00 00        lghi    %r1,0
    strcmp+0x4:  a7 f4 00 08        j       0x111aec
    strcmp+0x8:  a7 1b 00 01        aghi    %r1,1
    strcmp+0xc:  b9 02 00 55        ltgr    %r5,%r5
    strcmp+0x10: a7 84 00 17        je      0x111b16
    strcmp+0x14: e3 51 20 00 00 90  llgc    %r5,0(%r1,%r2)
    strcmp+0x1a: e3 41 30 00 00 90  llgc    %r4,0(%r1,%r3)
    strcmp+0x20: 18 05              lr      %r0,%r5
    strcmp+0x22: 1b 04              sr      %r0,%r4
    strcmp+0x24: 18 40              lr      %r4,%r0
    strcmp+0x26: a7 41 00 ff        tmll    %r4,255
    strcmp+0x2a: a7 84 ff ef        je      0x111ae0
    strcmp+0x2e: 18 20              lr      %r2,%r0
    strcmp+0x30: 89 20 00 18        sll     %r2,%r0,24(%r0)
    strcmp+0x34: 8a 20 00 18        sra     %r2,%r0,24(%r0)
    strcmp+0x38: b9 14 00 22        lgfr    %r2,%r2
    strcmp+0x3c: 07 fe              br      %r14
    strcmp+0x3e: a7 28 00 00        lhi     %r2,0
    strcmp+0x42: b9 14 00 22        lgfr    %r2,%r2
    strcmp+0x46: 07 fe              br      %r14

I am hoping that this will help document all the places needed to change when adding support for a new ISA to libdisasm.

Happy disassembling!

Interactivity During nightly(1), part 2

Back in May, I talked about how I increase the priority of Firefox in order to get decent response times while killing my laptop with a nightly build of Illumos. Specifically, I have been increasing the priority of Firefox so that it would get to run in a timely manner. I have been doing this by setting it to the real-time (RT) scheduling class which has higher priority than most things on the system. This, of course, requires extra privileges.

Today, I realized that I was thinking about the problem the wrong way. What I really should be doing is lowering the priority of the build. This requires no special privileges. How do I do this? In my environment file, I include the following line:

priocntl -s -c FX -p 0 $$

This sets the nightly build script’s scheduling class to fixed (FX) and manually sets the priority to 0. From that point on, the nightly script and any processes it spawns run with a lower priority (zero) than everything else (which tends to be in the 40-59 range).

GNU inline vs. C99 inline

Recently, I’ve been looking at inline functions in C. However instead of just the usual static inlines, I’ve been looking at all the variants. This used to be a pretty straightforward GNU C extension and then C99 introduced the inline keyword officially. Sadly, for whatever reason decided that the semantics would be just different enough to confuse me and everyone else.

GCC documentation has the following to say:

GCC implements three different semantics of declaring a function inline. One is available with -std=gnu89 or -fgnu89-inline or when gnu_inline attribute is present on all inline declarations, another when -std=c99, -std=c11, -std=gnu99 or -std=gnu11 (without -fgnu89-inline), and the third is used when compiling C++.

Dang! Ok, I don’t really care about C++, so there are only two ways inline can behave.

Before diving into the two different behaviors, there are two cases to consider: the use of an inline function, and the inline function itself. The good news is that the use of an inline function behaves the same in both C90 and C99. Where the behavior changes is how the compiler deals with the inline function itself.

After reading the GCC documentation and skimming the C99 standard, I have put it all into the following table. It lists the different ways of using the inline keyword and for each use whether or not a symbol is produced in C90 (with inline extension) and in C99.

Emit (C90) Emit (C99)
inline always never
static inline maybe maybe
extern inline never always

(“always” means that a global symbol is always produced regardless of if all the uses of it are inlined. “maybe” means that a local symbol will be produced if and only if some uses cannot be inlined. “never” means that no symbols are produced and any non-inlined uses will be dealt with via relocations to an external symbol.)

Note that C99 “switched” the meaning of inline and extern inline. The good news is, static inline is totally unaffected (and generally the most useful).

For whatever reason, I cannot ever remember this difference. My hope is that this post will help me in the future.

Trying it Out

We can verify this experimentally. We can compile the following C file with -std=gnu89 and -std=gnu99 and compare what symbols the compiler produces:

static inline void si(int x)

extern inline void ei(int x)

inline void i(int x)

And here’s what nm has to say about them:

00000000 T i

00000000 T ei

This is an extremely simple example where the “never” and “maybe” cases all skip generating a symbol. In a more involved program that has inline functions that use features of C that prevent inlining (e.g., VLAs) we would see either relocations to external symbols or local symbols.

Dumping Memory in MDB

It doesn’t take much reading of documentation, other people’s blogs, and other random web search results to learn how to dump a piece of memory in mdb.

In the following examples, I’ll use the address fffffffffbc30a70. This just so happens to be an avl_tree_t on my system. We can use the ::dump command:

> fffffffffbc30a70::dump
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc30a70:  801dc3fb ffffffff 0087b1fb ffffffff  ................

Or we can use the adb-style /B command:

> fffffffffbc30a70/B
kas+0x50:       80      

We can even specify the amount of data we want to dump. ::dump takes how many bytes to dump, while /B takes how many 1-byte integers to dump (while for example, /X takes how many 4-byte integers to dump):

> fffffffffbc30a70,20::dump
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc30a70:  801dc3fb ffffffff 0087b1fb ffffffff  ................
fffffffffbc30a80:  20000000 00000000 09000000 00000000   ...............
> fffffffffbc30a70,20/B
kas+0x50:       80      1d      c3      fb      ff      ff      ff      ff      0       87      b1      
                fb      ff      ff      ff      ff      20      0       0       0       0       0       
                0       0       9       0       0       0       0       0       0       0       

Things break down if we want to use a walker and pipe the output to ::dump or /B:

> fffffffffbc30a70::walk avl | ::dump
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc6d2e0:  00000000 00feffff 0000001e 03000000  ................
> fffffffffbc30a70::walk avl | /B
kpmseg:         0       0       0       0       0       0       0       0       0       

Even though there are 9 entries in the AVL tree, ::dump dumps only the first one. /B does a bit better and it does print what appears to be the first byte of each. What if we want to dump more than just the first byte? Say, the first 32? ::dump is of no use already. Let’s see what we can make /B do:

> fffffffffbc30a70::walk avl | 20/B
mdb: syntax error near "20"
> fffffffffbc30a70::walk avl | ,20/B
mdb: syntax error near ","

No luck.


Ok, it’s time for the trick that makes it all work. You have to use the ::eval function. For example:

> fffffffffbc30a70::walk avl | ::eval .,20::dump
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc6d2e0:  00000000 00feffff 0000001e 03000000  ................
fffffffffbc6d2f0:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc34960:  00000000 00ffffff 00000017 00000000  ................
fffffffffbc34970:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc31ce0:  00000017 00ffffff 00000080 00000000  ................
fffffffffbc31cf0:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc35a80:  00000097 00ffffff 0000a0fc 02000000  ................
fffffffffbc35a90:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc34880:  0000a0d3 03ffffff 00000004 00000000  ................
fffffffffbc34890:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc31d60:  0000a0d7 03ffffff 000060e8 fb000000  ..........`.....
fffffffffbc31d70:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc7f3a0:  000000c0 ffffffff 00b07f3b 00000000  ...........;....
fffffffffbc7f3b0:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc7de60:  000080fb ffffffff 00105500 00000000  ..........U.....
fffffffffbc7de70:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc7e000:  000080ff ffffffff 00004000 00000000  ..........@.....
fffffffffbc7e010:  00000000 00000000 200ac3fb ffffffff  ........ .......

Perfect! ::eval makes repetition with /B work as well:

> fffffffffbc30a70::walk avl | ::eval .,8/B
kpmseg:         0       0       0       0       0       fe      ff      ff
kvalloc:        0       0       0       0       0       ff      ff      ff
kpseg:          0       0       0       17      0       ff      ff      ff
kzioseg:        0       0       0       97      0       ff      ff      ff
kmapseg:        0       0       a0      d3      3       ff      ff      ff
kvseg:          0       0       a0      d7      3       ff      ff      ff
kvseg_core:     0       0       0       c0      ff      ff      ff      ff
ktextseg:       0       0       80      fb      ff      ff      ff      ff
kdebugseg:      0       0       80      ff      ff      ff      ff      ff


There is one more trick I want to share in this post. Suppose you have a mostly useless core file, and you want to dump the stack. Not as hex, but rather as a symbol + offset (if possible). The magic command you want is /nap. ‘/’ for printing, ‘n’ for a newline, ‘a’ for symbol + offset (of the value at “dot”), and ‘p’ for symbol (or address) of “dot”. (Formatting differences aside, ‘p’ prints the pointer—“dot”, and ‘a’ prints the value being pointed to—*“dot”.)

For example:

> fd94e3a8,8/nap
0xfd94e3a8:     0xfd94f5a8      
0xfd94e3b0:     0xfdd6ce28      
0xfd94e3b4:     0xfdd6a423      
0xfd94e3b8:     0xcc            
0xfd94e3c0:     0xfdd6ce00      
0xfd94e3c4:     0xfdd6ce28      

Since the memory happens to be part of the stack, there are no symbols associated with it and therefore the ‘p’ prints a raw hex value.

So, remember: if you have a core file and you think that you need to dump the stack to scavenge for hopefully useful values, you want to…nap. :)


Wikipedia article: Interchange — This definitely reminds me of xkcd: Highway Engineer Pranks. At the same time, it is fascinating how there is a whole set of standard interchanges.

DxOMark — Very in-depth reviews of SLR lenses and bodies.

Lisp as the Maxwell’s equations of software — Reading this has rekindled my interest in Lisp and Scheme.

This Man Has Been Trying to Live Life as a Goat

What’s going on with a Python assignment puzzle — As a C programmer, this is totally counter-intuitive to me.

Internet Mainframes Project — Screenshots of Wikipedia article: 3270 login screens of tons of internet facing mainframes.

The Case for Teaching Ignorance


Cap’n Proto — an insanely fast data interchange format.

The First Port of Unix

International Obfuscated C Code Contest winning entries — list of all the winning entries for all 23 years of IOCCC.

How to destroy Programmer Productivity

The Open-Office Trap — Having spent a couple of years in an open-office environment, I can tell you first hand that open-office is a bad idea. It was very annoying dealing with one set of light switches for about 70 people. The people near the window wanted them off, while us far away from the window wanted them on. The noise was also annoying — the chatty people (e.g., sales and support) were the worst. They were not malicious, just doing their job or chatting between phone calls.

Inside The Secret World of Russia’s Cold War Mapmakers

Wikipedia article: Underwater rugby — It is more like underwater basketball. I find it fascinating that player positions are a 3D location not a 2D location like in “normal” sports.

The Memory Sinkhole — an x86 design flaw allowing ring -2 privilege escalation.

Simple File System

On three or four occasions over the past 4 years, I had a use for simple file system spec. Either to teach people about file systems, or to have a simple file system to implement to learn the idiosyncracies of an operating system’s VFS layer. This is what I came up with back in 2011 when helping a friend learn about file systems.

Simple File System

The structure is really simple. All multi-byte integers are stored as big endian.

A disk is a linear sequence of blocks. Each block is 512 bytes long. You can read/write a block at a time. The first block on the disk is number 0, the second is 1, etc.

The following is the file system structure. First of all, the file system uses 1024 byte blocks, and therefore you need to issue two disk I/Os to process a file system block worth of data.

The first fs block (disk blocks 0 & 1) is reserved, you should not change it in any way.

The second block contains the superblock:

struct superblock {
        uint32_t magic;        /* == 0x42420374 */
        uint32_t root_inode;   /* the disk block containing the root inode */
        uint32_t nblocks;      /* number of block on the disk */
        uint32_t _pad[253];    /* unused (should be '\0' filled) */

Starting at the third block is the block allocation map. The most significant bit of the first byte of this block represents fs block 0. The next bit represents block 1, etc.

Each file is represented by an inode. The inode contains a number of data block pointers. The first pointer (blocks[0]) contains the first 1024 bytes of the file, blocks[1] the second, etc. The timestamps are in microseconds since 00:00:00 Jan 1, 1900 UTC.

sturct inode {
        uint32_t size;         /* file length in bytes */
        uint32_t _pad0;        /* unused (should be 0) */
        uint64_t ctime;        /* creation time stamp */
        uint64_t mtime;        /* last modification time stamp */
        uint16_t nblocks;      /* number of data blocks in this file */
        uint16_t _pad1;        /* unused (should be 0) */
        uint32_t _pad2;        /* unused (should be 0) */
        uint32_t blocks[248];  /* file block ptrs */

The root directory is represented by an inode, the data pointed to by this inode’s blocks[] have a special format. They should be treated as arrays of directory entries. The filename is space padded (so, “foo.txt” would be stored as “foo.txt                     ”).

struct direntry {
        char fname[28];        /* the filename */
        uint32_t inode;        /* the inode block ptr */

So, graphically, the file system looks something like:

sb -> rootinode
        |-> direntries
        |     |-> <"foo.txt", inodeptr>
        |     |                 \-> inode
        |     |                       |-> data
        |     |                       |-> data
        |     |                       \-> data
        |     |-> <"bar.txt", inodeptr>
        |     |                 \-> inode
        |     |                       |-> data
        |     |                       |-> data
        |     |                       \-> data
        |     |             .
        |     |             .
        |     |             .
        |     \-> <"xyz.txt", inodeptr>
        |                       \-> inode
        |                             |-> data
        |                             |-> data
        |                             \-> data
        |                   .
        |                   .
        |                   .
        \-> direntries
              |-> <"aaa.txt", inodeptr>
              |                 \-> inode
              |                       |-> data
              |                       |-> data
              |                       \-> data
              |-> <"bbb.txt", inodeptr>
              |                 \-> inode
              |                       |-> data
              |                       |-> data
              |                       \-> data
              |             .
              |             .
              |             .
              \-> <"ccc.txt", inodeptr>
                                \-> inode
                                      |-> data
                                      |-> data
                                      \-> data

Powered by blahgd