Josef “Jeff” Sipek

2015-08-10

Cap’n Proto — an insanely fast data interchange format.

The First Port of Unix

International Obfuscated C Code Contest winning entries — list of all the winning entries for all 23 years of IOCCC.

How to destroy Programmer Productivity

The Open-Office Trap — Having spent a couple of years in an open-office environment, I can tell you first hand that open-office is a bad idea. It was very annoying dealing with one set of light switches for about 70 people. The people near the window wanted them off, while us far away from the window wanted them on. The noise was also annoying — the chatty people (e.g., sales and support) were the worst. They were not malicious, just doing their job or chatting between phone calls.

Inside The Secret World of Russia’s Cold War Mapmakers

Wikipedia article: Underwater rugby — It is more like underwater basketball. I find it fascinating that player positions are a 3D location not a 2D location like in “normal” sports.

The Memory Sinkhole — an x86 design flaw allowing ring -2 privilege escalation.

Simple File System

On three or four occasions over the past 4 years, I had a use for simple file system spec. Either to teach people about file systems, or to have a simple file system to implement to learn the idiosyncracies of an operating system’s VFS layer. This is what I came up with back in 2011 when helping a friend learn about file systems.

Simple File System

The structure is really simple. All multi-byte integers are stored as big endian.

A disk is a linear sequence of blocks. Each block is 512 bytes long. You can read/write a block at a time. The first block on the disk is number 0, the second is 1, etc.

The following is the file system structure. First of all, the file system uses 1024 byte blocks, and therefore you need to issue two disk I/Os to process a file system block worth of data.

The first fs block (disk blocks 0 & 1) is reserved, you should not change it in any way.

The second block contains the superblock:

struct superblock {
        uint32_t magic;        /* == 0x42420374 */
        uint32_t root_inode;   /* the disk block containing the root inode */
        uint32_t nblocks;      /* number of block on the disk */
        uint32_t _pad[253];    /* unused (should be '\0' filled) */
};

Starting at the third block is the block allocation map. The most significant bit of the first byte of this block represents fs block 0. The next bit represents block 1, etc.

Each file is represented by an inode. The inode contains a number of data block pointers. The first pointer (blocks[0]) contains the first 1024 bytes of the file, blocks[1] the second, etc. The timestamps are in microseconds since 00:00:00 Jan 1, 1900 UTC.

sturct inode {
        uint32_t size;         /* file length in bytes */
        uint32_t _pad0;        /* unused (should be 0) */
        uint64_t ctime;        /* creation time stamp */
        uint64_t mtime;        /* last modification time stamp */
        uint16_t nblocks;      /* number of data blocks in this file */
        uint16_t _pad1;        /* unused (should be 0) */
        uint32_t _pad2;        /* unused (should be 0) */
        uint32_t blocks[248];  /* file block ptrs */
};

The root directory is represented by an inode, the data pointed to by this inode’s blocks[] have a special format. They should be treated as arrays of directory entries. The filename is space padded (so, “foo.txt” would be stored as “foo.txt                     ”).

struct direntry {
        char fname[28];        /* the filename */
        uint32_t inode;        /* the inode block ptr */
};

So, graphically, the file system looks something like:

sb -> rootinode
        |-> direntries
        |     |-> <"foo.txt", inodeptr>
        |     |                 \-> inode
        |     |                       |-> data
        |     |                       |-> data
        |     |                       \-> data
        |     |-> <"bar.txt", inodeptr>
        |     |                 \-> inode
        |     |                       |-> data
        |     |                       |-> data
        |     |                       \-> data
        |     |             .
        |     |             .
        |     |             .
        |     \-> <"xyz.txt", inodeptr>
        |                       \-> inode
        |                             |-> data
        |                             |-> data
        |                             \-> data
        |                   .
        |                   .
        |                   .
        \-> direntries
              |-> <"aaa.txt", inodeptr>
              |                 \-> inode
              |                       |-> data
              |                       |-> data
              |                       \-> data
              |-> <"bbb.txt", inodeptr>
              |                 \-> inode
              |                       |-> data
              |                       |-> data
              |                       \-> data
              |             .
              |             .
              |             .
              \-> <"ccc.txt", inodeptr>
                                \-> inode
                                      |-> data
                                      |-> data
                                      \-> data

Tail Call Optimization

I just had an interesting realization about tail call optimization. Often when people talk about it, they simply describe it as an optimization that the compiler does whenever you end a function with a function call whose return value is propagated up as is. Technically this is true. Practically, people use examples like this:

int foo(int increment)
{
	if (increment)
		return bar() + 1; /* NOT a tail-call */
	
	return bar(); /* a tail-call */
}

It sounds like a very solid example of a tail-call vs. not. I.e., if you “post process” the value before returning it is not a tail-call.

Going back to my realization, I think people often forget about one type of “post processing” — casts. Consider the following code:

extern short bar();

int foo()
{
        return bar();
}

Did you spot it? This is not a tail-call.

The integer promotion from short to int is done after bar returns but before foo returns.

For fun, here’s the disassembly:

$ gcc -Wall -O2 -c test.c
$ dis test.o
...
foo()
    foo:     55                 pushl  %ebp
    foo+0x1: 89 e5              movl   %esp,%ebp
    foo+0x3: 83 ec 08           subl   $0x8,%esp
    foo+0x6: e8 fc ff ff ff     call   -0x4	<foo+0x7>
    foo+0xb: c9                 leave  
    foo+0xc: 98                 cwtl   
    foo+0xd: c3                 ret    

For completeness, if we change the return value of bar to int:

$ gcc -Wall -O2 -c test.c
$ dis test.o
...
foo()
    foo:       e9 fc ff ff ff     jmp    -0x4	<foo+0x1>

I wonder how many people think they are tail-call optimizing, but in reality they are “wasting” a stack frame because of this silly reason.

Interactivity During nightly(1)

Every so often, I do a nightly build of illumos on my laptop. This is a long and very CPU intensive process. During the build (which takes about 2.75 hours), the load average is rarely below 8 and most of the time it hovers in the low twenties. (This is a full debug and non-debug build with lint and all the other checking. If I need a build quickly, I can limit it to just what I need and then we’re talking minutes or seconds.)

Anyway, as you might imagine this sort of load puts some pressure on the CPUs. As a result, some interactive processes suffer a bit. Specifically, Firefox doesn’t get enough cycles to render the modern web (read: JavaScript cesspool). Instead of suffering for almost three hours, I just change Firefox’s scheduling class from IA (interactive) to RT (real time):

# priocntl -s -c RT `pgrep firefox`

This allows Firefox to preempt just about everything on my laptop. This works because Firefox actually yields the CPU properly. This will probably bite me sometime in the future when I end up on a page with such a busted JavaScript turd that it takes over a CPU and won’t let go. Till then, I have a pretty good workaround.

Debugging with mdb

Recently, Theo Schlossnagle posted two interesting articles about debugging on Illumos using mdb. They are MDB, CTF, DWARF, and other angelic things, and mdb custom dmods.

Inline Assembly & clang

Recently I talked about inline assembly with GCC and clang where I pointed out that LLVM seems to produce rather silly machine code. In a comment, a reader asked if this was LLVM’s IR doing this or if it was the machine code generator being silly. I was going to reply there, but the reply got long enough to deserve its own post.

I’ve dealt with LLVM’s IR for a couple of months during the fall of 2010. It was both interesting and quite painful.

The IR is at the Wikipedia article: single static assignment level. It assumes that stack space is cheap and infinite. Since it is a SSA form, it has no notion of registers. The optimization passes transform the IR quite a bit and at the end there is very little (if any!) useless code. In other words, I think it is the machine code generation that is responsible for the unnecessary stack frame push and pop. With that said, it is time to experiment.

Using the same test program as before, of course:

#define _KERNEL
#define _ASM_INLINES
#include <sys/atomic.h>

void test(uint32_t *x)
{
	atomic_inc_32(x);
}

Emitting LLVM IR

Let’s compile it with clang passing in the -emit-llvm option to have it generate test.ll file with the LLVM IR:

$ clang -S -emit-llvm -Wall -O2 -m64 test.c

There is a fair amount of “stuff” in the file, but the relevant portions are (line-wrapped by me):

; Function Attrs: nounwind
define void @test(i32* %x) #0 {
entry:
  tail call void asm sideeffect "lock; incl $0",
    "=*m,*m,~{dirflag},~{fpsr},~{flags}"(i32* %x, i32* %x) #1, !srcloc !1
  ret void
}

attributes #0 = { nounwind uwtable "less-precise-fpmad"="false"
  "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"
  "no-infs-fp-math"="false" "no-nans-fp-math"="false"
  "stack-protector-buffer-size"="8" "unsafe-fp-math"="false"
  "use-soft-float"="false" }

LLVM’s IR happens to be very short and to the point. The function prologue and epilogue are not expressed as part of IR blob that gets passed to the machine code generator. Note the function attribute no-frame-pointer-elim being true (meaning frame pointer elimination will not happen).

Now, let’s add in the -fomit-frame-pointer option.

$ clang -S -emit-llvm -Wall -O2 -m64 -fomit-frame-pointer test.c

Now, the relevant IR pieces are:

; Function Attrs: nounwind
define void @test(i32* %x) #0 {
entry:
  tail call void asm sideeffect "lock; incl $0",
    "=*m,*m,~{dirflag},~{fpsr},~{flags}"(i32* %x, i32* %x) #1, !srcloc !1
  ret void
}

attributes #0 = { nounwind uwtable "less-precise-fpmad"="false"
  "no-frame-pointer-elim"="false" "no-infs-fp-math"="false"
  "no-nans-fp-math"="false" "stack-protector-buffer-size"="8"
  "unsafe-fp-math"="false" "use-soft-float"="false" }

The no-frame-pointer-elim attribute changed (from true to false), but the IR of the function itself did not change. (The no-frame-pointer-elim-non-leaf attribute disappeared as well, but it really makes sense since -fomit-frame-pointer is a rather large hammer that just forces frame pointer elimination everywhere and so it doesn’t make sense to differentiate between leaf and non-leaf functions.)

So, to answer Steve’s question, the LLVM IR does not include the function prologue and epilogue. This actually makes a lot of sense given that the IR is architecture independent and the exact details of what the prologue has to do are define by the ABIs.

IR to Assembly

We can of course use llc to convert the IR into real 64-bit x86 assembly code.

$ llc --march=x86-64 test.ll
$ gas -o test.o --64 test.s

Here is the disassembly for clang invocation without -fomit-frame-pointer:

test()
    test:     55                 pushq  %rbp
    test+0x1: 48 89 e5           movq   %rsp,%rbp
    test+0x4: f0 ff 07           lock incl (%rdi)
    test+0x7: 5d                 popq   %rbp
    test+0x8: c3                 ret    

And here is the disassembly for clang invocation with -fomit-frame-pointer:

test()
    test:     f0 ff 07           lock incl (%rdi)
    test+0x3: c3                 ret    

Conclusion

So, it turns out that my previous post simply stumbled across the fact that GCC and clang have different set of optimizations for -O2. GCC includes -fomit-frame-pointer by default, while clang does not.

Inline Assembly & GCC, clang

Recently, I got to write a bit of inline assembly. In the process I got to test my changes by making a small C file which defined test function that called the inline function from the header. Then, I could look at the disassembly to verify all was well.

#define _KERNEL
#define _ASM_INLINES
#include <sys/atomic.h>

void test(uint32_t *x)
{
	atomic_inc_32(x);
}

GCC has been my go to complier for a long time now. So, at first I was using it to debug my inline assembly. I compiled the test programs using:

$ gcc -Wall -O2 -m64 -c test.c

Disassembling the object file yields the rather obvious:

test()
    test:     f0 ff 07           lock incl (%rdi)
    test+0x3: c3                 ret    

I can’t think of any way to make it better :)

Then, at some point I remembered that Clang/LLVM are pretty good as well. I compiled the same file with clang:

$ clang -Wall -O2 -m64 -c test.c

The result was rather disappointing:

test()
    test:     55                 pushq  %rbp
    test+0x1: 48 89 e5           movq   %rsp,%rbp
    test+0x4: f0 ff 07           lock incl (%rdi)
    test+0x7: 5d                 popq   %rbp
    test+0x8: c3                 ret    

For whatever reason, Clang feels the need to push/pop the frame pointer. I did a little bit of searching, and I couldn’t find a way to disable this behavior.

The story for 32-bit output is very similar (just drop the -m64 from the compiler invocation). GCC produced the superior output:

test()
    test:     8b 44 24 04        movl   0x4(%esp),%eax
    test+0x4: f0 ff 00           lock incl (%eax)
    test+0x7: c3                 ret    

While Clang still wanted to muck around with the frame pointer.

test()
    test:     55                 pushl  %ebp
    test+0x1: 89 e5              movl   %esp,%ebp
    test+0x3: 8b 45 08           movl   0x8(%ebp),%eax
    test+0x6: f0 ff 00           lock incl (%eax)
    test+0x9: 5d                 popl   %ebp
    test+0xa: c3                 ret    

For the curious ones, I’m using GCC 4.8.3 and Clang 3.4.2.

I realize this is a bit of a special case (how often to you make a function that simply calls an inline function?), but it makes me worried about what sort of sub-optimal code Clang produces in other cases.

Inlining Atomic Operations

One of the items on my ever growing TODO list (do these ever shrink?) was to see if inlining Illumos’s atomic_* functions would make any difference. (For the record, these functions atomically manipulate variables. You can read more about them in the various man pages — atomic_add, atomic_and, atomic_bits, atomic_cas, atomic_dec, atomic_inc, atomic_or, atomic_swap.) Of course once I looked at the issue deeply enough, I ended up with five cleanup patches. The gist of it is, inlining them caused not only about 1% kernel performance improvement on the benchmarks, but also reduced the kernel size by a couple of kilobytes. You can read all about it in the associated bugs (5042, 5043, 5044, 5045, 5046, 5047) and the patch 0/6 email I sent to the developer list. In this blahg post, I want to talk about how exactly Illumos presents these atomic functions in a stable ABI but at the same time allows for inlines.

Genesis

It should come as no surprise that the “content” of these functions really needs to be written in assembly. The functions are 100% implemented in assembly in usr/src/common/atomic. There, you will find a directory per architecture. For example, in the amd64 directory, we’ll find the code for a 64-bit atomic increment:

	ENTRY(atomic_inc_64)
	ALTENTRY(atomic_inc_ulong)
	lock
	incq	(%rdi)
	ret
	SET_SIZE(atomic_inc_ulong)
	SET_SIZE(atomic_inc_64)

The ENTRY, ALTENTRY, and SET_SIZE macros are C preprocessor macros to make writing assembly functions semi-sane. Anyway, this code is used by both the kernel as well as userspace. I am going to ignore the userspace side of the picture and talk about the kernel only.

These assembly functions, get mangled by the C preprocessor, and then are fed into the assembler. The object file is then linked into the rest of the kernel. When a module binary references these functions the krtld (linker-loader) wires up those references to this code.

Inline

Replacing these function with inline functions (using the GNU definition) would be fine as far as all the code in Illumos is concerned. However doing so would remove the actual functions (as well as the symbol table entries) and so the linker would not be able to wire up any references from modules. Since Illumos cares about not breaking existing external modules (both open source and closed source), this simple approach would be a no-go.

Inline v2

Before I go into the next and final approach, I’m going to make a small detour through C land.

extern inline

First off, let’s say that we have a simple function, add, that returns the sum of the two integer arguments, and we keep it in a file called add.c:

#include "add.h"

int add(int x, int y)
{
	return x + y;
}

In the associated header file, add.h, we may include a prototype like the following to let the compiler know that add exists elsewhere and what types to expect.

extern int add(int, int);

Then, we attempt to call it from a function in, say, test.c:

#include "add.h"

int test()
{
	return add(5, 7);
}

Now, let’s turn these two .c files into a .so. We get the obvious result — test calls add:

test()
    test:     be 07 00 00 00     movl   $0x7,%esi
    test+0x5: bf 05 00 00 00     movl   $0x5,%edi
    test+0xa: e9 b1 fe ff ff     jmp    -0x14f	<0xc90>

And the binary contains both functions:

$ /usr/bin/nm test.so | egrep '(Value|test$|add$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[74]	|                3520|                   4|FUNC |GLOB |0    |13   |add
[65]	|                3536|                  15|FUNC |GLOB |0    |13   |test

Now suppose that we modify the header file to include the following (assuming GCC’s inline definition):

extern int add(int, int);

extern inline int add(int a, int b)
{
	return a + b;
}

If we compile and link the same .so the same way, that is we feed in the object file with the previously used implementation of add, we’ll get a slightly different binary. The invocation of add will use the inlined version:

test()
    test:     b8 0c 00 00 00     movl   $0xc,%eax
    test+0x5: c3                 ret    

But the binary will still include the symbol:

$ /usr/bin/nm test.so | egrep '(Value|test$|add$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[72]	|                3408|                   4|FUNC |GLOB |0    |11   |add
[63]	|                3424|                   6|FUNC |GLOB |0    |11   |test

Neat, eh?

extern inline atomic what?

How does this apply to the atomic functions? Pretty simply. As I pointed out, usr/src/common/atomic contains the pure assembly implementations — these are the functions you’ll always find in the symbol table.

The common header file that defines extern prototypes is usr/src/uts/common/sys/atomic.h.

Now, the trick. If you look carefully at the header file, you’ll spot a check on line 39. If all the conditions are true (kernel code, GCC, inline assembly is allowed, and x86), we include asm/atomic.h — which lives at usr/src/uts/intel/asm/atomic.h. This is where the extern inline versions of the atomic functions get defined.

So, kernel code simply includes <sys/atomic.h>, and if the stars align properly, any atomic function use will get inlined.

Phew! This ended up being longer than I expected. :)

Lua Compatibility

Phew! Yesterday afternoon, I decided to upgrade my laptop’s OpenIndiana from 151a9 to “Hipster”. I did it in a bit convoluted way, and hopefully I’ll write about that some other day. In the end, I ended up with a fresh install of the OS with X11 and Gnome. If you’ve ever seen my monitors, you know that I do not use Gnome — I use Notion. So, of course I had it install it. Sadly, OpenIndiana doesn’t ship it so it was up to me to compile it. After the usual fight to get a piece of software to compile on Illumos (a number of the Solaris-isms are still visible), I got it installed. A quick gdm login later, Notion threw me into a minimal environment because something was exploding.

After far too many hours of fighting it, searching online, and trying random things, I concluded that it was not Notion’s fault. Rather, it was something on the system. Eventually, I figured it out. Lua 5.2 (which is standard on Hipster) is not compatible with Lua 5.1 (which is standard on 151a9)! Specifically, a number of functions have been removed and the behavior of other functions changed. Not being a Lua expert (I just deal with it whevever I need to change my window manager’s configuration), it took longer than it should but eventually I managed to get Notion working like it should be.

So, what sort of incompatibilies did I have to work around?

loadstring

loadstring got renamed to load. This is an easy to fix thing, but still a headache especially if you want to support multiple versions of Lua.

table.maxn

table.maxn got removed. This function returned the largest positive integer key in an associative array (aka. a table) or 0 if there aren’t any. (Lua indexes arrays starting at 1.) The developers decided that it’s so simple that those that want it can write it themselves. Here’s my version:

local function table_maxn(t)
    local mn = 0
    for k, v in pairs(t) do
        if mn < k then
            mn = k
        end
    end
    return mn
end

table.insert

table.insert now checks bounds. There doesn’t appear to be any specific way to get old behavior. In my case, I was lucky. The index/positition for the insertion was one higher than table_maxn returned. So, I could replace:

table.insert(ret, pos, newscreen)

with:

ret[pos] = newscreen

Final Thougths

I can understand wanting to deprecate old crufty interfaces, but I’m not sure that the Lua developers did it right. I really think they should have marked those interfaces as obsolete, make any use spit out a warning, and then in a couple of years remove it. I think that not doing this, will hurt Lua 5.2’s adoption.

Yes, I believe there is some sort of a compile time option for Lua to get legacy interfaces, but not everyone wants to recompile Lua because the system installed version wasn’t compiled quite the way that would make things Just Work™.

Generating Random Data

Over the years, there have been occasions when I needed to generate random data to feed into whatever system. At times, simply using /dev/random or /dev/urandom was sufficient. At other times, I needed to generate random data at a rate that exceeded what I could get out of /dev/random. This morning, I read Chris’s blog entry about his need for generating lots of random data. I decided that I should write my favorite approach so that others can benefit.

The approach is very simple. There are two phases. First, we set up our own random pool. Second we use the random pool. I am going to use an example throughout the rest of this post. Suppose that we want to make repeated 128 kB writes to a block device and we want the data to be random so that the device can’t do anything clever (e.g., compress or dedup). Say that during this benchmark we want to write out 64 GB total. (In other words, we will issue 524288 writes.)

Setup Phase

During the setup phase, we create a pool of random data. The easiest way is to just read /dev/urandom. Here, we want to read enough data so that the pool is large enough. For our 128kB write example, we’d want at least 1 MB. (I’ll explain the sizing later. I would probably go with something like 8 MB because unless I’m on some sort of limited system, the extra 7 MB of RAM won’t be missed.)

“Using the Pool” Phase

Now that we have the pool, we can use it to generate random buffers. The basic idea is to pick a random offset into the pool and just grab the bytes starting at that location. In our example, we’d pick a random offset between zero and pool size minus 128 kB, and use the 128 kB at that offset.

In pseudo code:

#define BUF_SIZE	(128 * 1024)
#define POOL_SIZE	(1024 * 1024)

static char pool[POOL_SIZE];

char *generate()
{
	return &pool[rand() % (POOL_SIZE - BUF_SIZE)];
}

That’s it! You can of course make it more general and let the caller tell you how many bytes they want:

#define POOL_SIZE	(1024 * 1024)

static char pool[POOL_SIZE];

char *generate(size_t len)
{
	return &pool[rand() % (POOL_SIZE - len)];
}

It takes a pretty simple argument to show that even a modest sized pool will be able to generate lots of different random buffers. Let’s say we’re dealing with the 128 kB buffer and 1 MB pool case. The pool can return 128 kB starting at offset 0, or offset 1, or offset 2, … or offset 9175043 ($1~MB - 128~kB - 1B$). This means that there are 917504 possible outputs. Recall, that in our example we were planning on writing out 64 GB in total which was 524288 writes.

$\frac{524288}{917504} = 0.571$

In other words, we are planning on using less than 58% of the possible outputs from our 1 MB pool to write out 64 GB of random data! (An 8 MB pool would yield 6.3% usage.)

If the length is variable, the math gets more complicated, but in a way we get even better results (i.e., lower usage) because to generate the same buffer we would need have the same offset and length. If the caller supplies random (pseudo-random or based on some distribution) lengths, we’re very unlikely to get the same buffer out of the pool.

Observations

Some of you may have noticed that we traded generating 128 kB (or a user supplied length) of random data for generating a random integer. There are two options there, either you can use a fast pseudo-random number generator (like the Wikipedia article: Mersenne twister), or you can reuse same pool! In other words:

#define POOL_SIZE	(1024 * 1024)

static char pool[POOL_SIZE];
static size_t *ridx = (size_t *) pool;

char *generate(size_t len)
{
	if ((uintptr_t) ridx == (uintptr_t) &pool[POOL_SIZE])
		ridx = (size_t *) pool;

	ridx++;

	return &pool[*ridx % (POOL_SIZE - len)];
}

I leave it as an exercise for the reader to either make it multi-thread safe, or to make the index passed in similarly to how rand_r takes an argument.

rand_r Considered Harmful

Since we’re on the topic of random number generation, I thought I’d mention what is already rather widely known fact — libc’s rand and rand_r implementations are terrible. At one point, I tried using them with this approach to generate random buffers, but it didn’t take very long before I got repeats! Caveat emptor.

Powered by blahgd