# Josef “Jeff” Sipek

## Rust Pointers for C Programmers

I’ve been eyeing Rust for about a year now. Here and there, I tried to use it to make a silly little program, or to implement some simple function in it to see for myself how ergonomic it really was, and what sort of machine code rustc spit out. But last weekend I found a need for a tool to clean up some preprocessor mess, and so instead of hacking together some combination of shell and Python, I decided to write it in Rust.

From my earlier attempts, I knew that there are a lot of different “pointers” but I found all the descriptions of them lacking or confusing. Specifically, Rust calls itself a systems programming language, yet I found no clear description of how the different pointers map to C—the systems programming language. Eventually, I stumbled across The Periodic Table of Rust Types, which made things a bit clearer, but I still didn’t feel like I truly understood.

During my weekend expedition to Rust land, I think I’ve grokked things enough to write this explanation of how Rust does things. As always, feedback is welcomed.

I’ll describe what happens in terms of C. To keep things simple, I will:

• assume that you are well-versed in C
• assume that you can read Rust (any intro will teach you enough)
• not bother with const for the C snippets

In the following text, I assume that we have some struct T. The actual contents don’t matter. In other words:

```struct T {
/* some members */
};
```

With that out of the way, let’s dive in!

### *const T and *mut T

These are raw pointers. In general, you shouldn’t use them since only unsafe code can dereference them, and the whole point of Rust is to write as much safe code as possible.

Raw pointers are just like what you have in C. If you make a pointer, you end up using sizeof(struct T *) bytes for the pointer. In other words:

```struct T *ptr;
```

### &T and &mut T

These are borrowed references. They use the same amount of space as raw pointers and behave same exact way in the generated machine code. Consider this trivial example:

```#[no_mangle]
pub fn raw(p: *mut usize) {
unsafe {
*p = 5;
}

}

#[no_mangle]
pub fn safe(p: &mut usize) {
*p = 5;
}
```

A rustc invocation later, we have:

```raw()
raw:     55                 pushq  %rbp
raw+0x1: 48 89 e5           movq   %rsp,%rbp
raw+0x4: 48 c7 07 05 00 00  movq   \$0x5,(%rdi)
00
raw+0xb: 5d                 popq   %rbp
raw+0xc: c3                 ret

safe()
safe:     55                 pushq  %rbp
safe+0x1: 48 89 e5           movq   %rsp,%rbp
safe+0x4: 48 c7 07 05 00 00  movq   \$0x5,(%rdi)
00
safe+0xb: 5d                 popq   %rbp
safe+0xc: c3                 ret
```

Note that the two functions are bit-for-bit identical.

The only differences between borrowed references and raw pointers are:

1. references will never point at bogus addresses (i.e., they are never NULL or uninitialized),
2. the compiler doesn’t let you do arbitrary pointer arithmetic on references,
3. the borrow checker will make you question your life choices for a while.

(#3 gets better over time.)

### Box<T>

These are owned “pointers”. If you are a C++ programmer, you are already familiar with them. Never having truly worked with C++, I had to think about this a bit until it clicked, but it is really easy.

No matter what all the documentation and tutorials out there say, Box<T> is not a pointer but rather a structure containing a pointer to heap allocated memory just big enough to hold T. The heap allocation and freeing is handled automatically. (Allocation is done in the Box::new function, while freeing is done via the Drop trait, but that’s not relevant as far as the memory layout is concerned.) In other words, Box<T> is something like:

```struct box_of_T {
struct T *heap_ptr;
};
```

Then, when you make a new box you end up putting only what amounts to sizeof(struct T *) on the stack and it magically starts pointing to somewhere on the heap. In other words, the Rust code like this:

```let x = Box::new(T { ... });
```

is roughly equivalent to:

```struct box_of_t x;

x.heap_ptr = malloc(sizeof(struct T));
if (!x.heap_ptr)
oom();

*x.heap_ptr = ...;
```

### &[T] and &mut [T]

These are borrowed slices. This is where things get interesting. Even though it looks like they are just references (which, as stated earlier, translates into a simple C-style pointer), they are much more. These types of references use fat pointers—that is, a combination of a pointer and a length.

```struct fat_pointer_to_T {
struct T *ptr;
size_t nelem;
};
```

This is incredibly powerful, since it allows bounds checking at runtime and getting a subset of a slice is essentially free!

### &[T; n] and &mut [T; n]

These are borrowed references to arrays. They are different from borrowed slices. Since the length of an array is a compile-time constant (the compiler will yell at you if n is not a constant), all the bounds checking can be performed statically. And therefore there is no need to pass around the length in a fat pointer. So, they are passed around as plain ol’ pointers.

```struct T *ptr;
```

### T, [T; n], and [T]

While these aren’t pointers, I thought I’d include them here for completeness’s sake.

#### T

Just like in C, a struct uses as much space as its type requires (i.e., sum of the sizes of its members plus padding).

#### [T; n]

Just like in C, an array of structs uses n times the size of the struct.

#### [T]

The simple answer here is that you cannot make a [T]. That actually makes perfect sense when you consider what that type means. It is saying that we have some variable sized slice of memory that we want to access as elements of type T. Since this is variable sized, the compiler cannot possibly reserve space for it at compile time and so we get a compiler error.

The more complicated answer involves the Sized trait, which I’ve skillfully managed to avoid thus far and so you are on your own.

### Summary

That was a lot of text, so I decided to compact it and make the following table. In the table, I assume that our T struct is 100 bytes in size. In other words:

```/* Rust */
struct T {
stuff: [u8; 100],
}

/* C */
struct T {
uint8_t stuff[100];
};
```

Now, the table in its full glory:

Rust C Size on
ILP32/LP64
(bytes)
Value
```let x: T;
```
```struct T x;
```
100/100
Raw pointer
```let x: *const T;
let x: *mut T;
```
```struct T *x;
```
4/8
Reference
```let x: &T;
let x: &mut T;
```
```struct T *x;
```
4/8
Box
```let x: Box<T>;
```
```struct box_of_T {
struct T *heap_ptr;
};

struct box_of_T x;
```
4/8
Array of 2
```let x: [T; 2];
```
```struct T x[2];
```
200/200
Reference to
an array of 2
```let x: &[T; 2];
```
```struct T *x;
```
4/8
A slice
```let x: [T];
```
```struct T x[];
```
unknown at
compile time
A reference
to a slice
```let x: &[T];
```
```struct fat_ptr_to_T {
struct T *ptr;
size_t nelem;
};

struct fat_ptr_to_T x;
```
8/16

A word of caution: I assume that the sizes of the various pointers are actually implementation details and shouldn’t be relied on to be that way. (Well, with the exception of raw pointers - without those being fixed FFI would be unnecessarily complicated.)

I didn’t cover str, &str, String, and Vec<T> since I don’t consider them fundamental types, but rather convenience types built on top of slices, structs, references, and boxes.

Anyway, I hope you found this useful. If you have any feedback (good or bad), let me know.

## Creative xor Use

Last month at work I got to try to optimize a function that takes a number and rounds it up to the next power of 2. The previous implementation used a simple loop. I didn’t dive into obscure bit twiddling, but rather used a helper function that is already in the codebase. Yes, I let the compiler do the heavy lifting of turning easy to understand code into good machine code. The x86 binary that gcc 6.3 produced has an interesting idiom, and that’s why I’m writing this entry.

The new code:

```static inline unsigned int bits_required32(uint32_t num)
{
return num == 0 ? 0 : 32 - __builtin_clz(num);
}

/* Returns x, such that x is the smallest power of 2 >= num. */
uint32_t nearest_power(uint32_t num)
{
if (num == 0)
return 1;

return 1U << bits_required32(num - 1);
}
```

This is a slightly simplified version of the code, but it demonstrates the optimization quite well.

The nearest_power function disassembles as:

```nearest_power()
nearest_power:      8b 54 24 04        movl   0x4(%esp),%edx
nearest_power+0x4:  b8 01 00 00 00     movl   \$0x1,%eax
nearest_power+0x9:  85 d2              testl  %edx,%edx
nearest_power+0xb:  74 14              je     +0x14	<nearest_power+0x21>
nearest_power+0xd:  83 ea 01           subl   \$0x1,%edx
nearest_power+0x10: 74 0f              je     +0xf	<nearest_power+0x21>
nearest_power+0x12: 0f bd d2           bsrl   %edx,%edx
nearest_power+0x15: b9 20 00 00 00     movl   \$0x20,%ecx
nearest_power+0x1a: 83 f2 1f           xorl   \$0x1f,%edx
nearest_power+0x1d: 29 d1              subl   %edx,%ecx
nearest_power+0x1f: d3 e0              shll   %cl,%eax
nearest_power+0x21: c3                 ret
```

The first 6 instructions contain the prologue and deal with num being zero or one—both cases produce the result 1. The remaining 6 instructions make up the epilogue and are where the calculation happens. I’m going to ignore the first half of the function, since the second half is where the interesting things happen.

First, we get the number of leading zeros in num - 1 and stash the value 32 in a register:

```    nearest_power+0x12: 0f bd d2           bsrl   %edx,%edx
nearest_power+0x15: b9 20 00 00 00     movl   \$0x20,%ecx
```

The number of leading zeros (%edx) is in the range 0–31.

Here is the really interesting bit:

```    nearest_power+0x1a: 83 f2 1f           xorl   \$0x1f,%edx
```

This xors the number of leading zeros (i.e., 0–31) with 31. To decipher what this does, I find it easier to consider the top 27 bits and the bottom 5 bits separately.

 operand binary 0x1f 00000000 00000000 00000000 000 11111 edx 00000000 00000000 00000000 000 xxxxx

The xor of the top bits produces 0 since both the constant 31 and the register containing any of the numbers 0–31 have zeros there.

The xor of the bottom bits negates them since the constant has ones there.

When combined, the xor has the same effect as this C expression:

```out = (~in) & 0x1f;
```

This seems very weird and useless, but it is far from it. It turns out that for inputs 0–31 the above expression is the same as:

```out = 31 - in;
```

I think it is really cool that gcc produced this xor instead of a less optimal multi-instruction version.

The remainder of the disassembly just subtracts and shifts to produce the return value.

### Why xor?

I think the reason gcc (and clang for that matter) produce this sort of xor instruction instead of a subtraction is very simple: on x86 the sub instruction’s left hand side and the destination must be the same register. That is, on x86 the sub instruction works as:

```x -= y;
```

Since the destination must be a register, it isn’t possible to express out = 31 - in using just one sub.

Anyway, that’s it for today. I hope you enjoyed this as much as I did.

## 2017-04-22

How the PC Industry Screws Things Up

C to Rust translator

Csmith — a random generator of C programs

ACME Mapper — high-precision general purpose mapping application

## bool bitfield:1

This is the first of hopefully many posts related to interesting pieces of code I’ve stumbled across in the dovecot repository.

Back in 1999, C99 added the bool type. This is old news. The thing I’ve never seen before is what amounts to:

```struct foo {
bool	a:1;
bool	b:1;
};
```

Sure, I’ve seen bitfields before—just never with booleans. Since this is C, the obvious thing happens here. The compiler packs the two bool bits into a single byte. In other words, sizeof(struct foo) is 1 (instead of 2 had we not used bitfields).

The compiler emits pretty compact code as well. For example, suppose we have this simple function:

```void set(struct foo *x)
{
x->b = true;
}
```

We compile it and disassemble:

```\$ gcc -c -O2 -Wall -m64 test.c
\$ dis -F set test.o
disassembly for test.o

set()
set:     80 0f 02           orb    \$0x2,(%rdi)
set+0x3: c3                 ret
```

Had we used non-bitfield booleans, the resulting code would be:

```set()
set:     c6 47 01 01        movb   \$0x1,0x1(%rdi)
set+0x4: c3                 ret
```

There’s not much of a difference in these simple examples, but in more complicated structures with many boolean flags the structure size difference may be significant.

Of course, the usual caveats about bitfields apply (e.g., the machine’s endian matters).

## GNU inline vs. C99 inline

Recently, I’ve been looking at inline functions in C. However instead of just the usual static inlines, I’ve been looking at all the variants. This used to be a pretty straightforward GNU C extension and then C99 introduced the inline keyword officially. Sadly, for whatever reason decided that the semantics would be just different enough to confuse me and everyone else.

GCC documentation has the following to say:

GCC implements three different semantics of declaring a function inline. One is available with -std=gnu89 or -fgnu89-inline or when gnu_inline attribute is present on all inline declarations, another when -std=c99, -std=c11, -std=gnu99 or -std=gnu11 (without -fgnu89-inline), and the third is used when compiling C++.

Dang! Ok, I don’t really care about C++, so there are only two ways inline can behave.

Before diving into the two different behaviors, there are two cases to consider: the use of an inline function, and the inline function itself. The good news is that the use of an inline function behaves the same in both C90 and C99. Where the behavior changes is how the compiler deals with the inline function itself.

After reading the GCC documentation and skimming the C99 standard, I have put it all into the following table. It lists the different ways of using the inline keyword and for each use whether or not a symbol is produced in C90 (with inline extension) and in C99.

 Emit (C90) Emit (C99) inline always never static inline maybe maybe extern inline never always

(“always” means that a global symbol is always produced regardless of if all the uses of it are inlined. “maybe” means that a local symbol will be produced if and only if some uses cannot be inlined. “never” means that no symbols are produced and any non-inlined uses will be dealt with via relocations to an external symbol.)

Note that C99 “switched” the meaning of inline and extern inline. The good news is, static inline is totally unaffected (and generally the most useful).

For whatever reason, I cannot ever remember this difference. My hope is that this post will help me in the future.

### Trying it Out

We can verify this experimentally. We can compile the following C file with -std=gnu89 and -std=gnu99 and compare what symbols the compiler produces:

```static inline void si(int x)
{
}

extern inline void ei(int x)
{
}

inline void i(int x)
{
}
```

And here’s what nm has to say about them:

```test-gcc89:
00000000 T i

test-gcc99:
00000000 T ei
```

This is an extremely simple example where the “never” and “maybe” cases all skip generating a symbol. In a more involved program that has inline functions that use features of C that prevent inlining (e.g., VLAs) we would see either relocations to external symbols or local symbols.

## Tail Call Optimization

I just had an interesting realization about tail call optimization. Often when people talk about it, they simply describe it as an optimization that the compiler does whenever you end a function with a function call whose return value is propagated up as is. Technically this is true. Practically, people use examples like this:

```int foo(int increment)
{
if (increment)
return bar() + 1; /* NOT a tail-call */

return bar(); /* a tail-call */
}
```

It sounds like a very solid example of a tail-call vs. not. I.e., if you “post process” the value before returning it is not a tail-call.

Going back to my realization, I think people often forget about one type of “post processing” — casts. Consider the following code:

```extern short bar();

int foo()
{
return bar();
}
```

Did you spot it? This is not a tail-call.

The integer promotion from short to int is done after bar returns but before foo returns.

For fun, here’s the disassembly:

```\$ gcc -Wall -O2 -c test.c
\$ dis test.o
...
foo()
foo:     55                 pushl  %ebp
foo+0x1: 89 e5              movl   %esp,%ebp
foo+0x3: 83 ec 08           subl   \$0x8,%esp
foo+0x6: e8 fc ff ff ff     call   -0x4	<foo+0x7>
foo+0xb: c9                 leave
foo+0xc: 98                 cwtl
foo+0xd: c3                 ret
```

For completeness, if we change the return value of bar to int:

```\$ gcc -Wall -O2 -c test.c
\$ dis test.o
...
foo()
foo:       e9 fc ff ff ff     jmp    -0x4	<foo+0x1>
```

I wonder how many people think they are tail-call optimizing, but in reality they are “wasting” a stack frame because of this silly reason.

## Inline Assembly & clang

Recently I talked about inline assembly with GCC and clang where I pointed out that LLVM seems to produce rather silly machine code. In a comment, a reader asked if this was LLVM’s IR doing this or if it was the machine code generator being silly. I was going to reply there, but the reply got long enough to deserve its own post.

I’ve dealt with LLVM’s IR for a couple of months during the fall of 2010. It was both interesting and quite painful.

The IR is at the  single static assignment level. It assumes that stack space is cheap and infinite. Since it is a SSA form, it has no notion of registers. The optimization passes transform the IR quite a bit and at the end there is very little (if any!) useless code. In other words, I think it is the machine code generation that is responsible for the unnecessary stack frame push and pop. With that said, it is time to experiment.

Using the same test program as before, of course:

```#define _KERNEL
#define _ASM_INLINES
#include <sys/atomic.h>

void test(uint32_t *x)
{
atomic_inc_32(x);
}
```

### Emitting LLVM IR

Let’s compile it with clang passing in the -emit-llvm option to have it generate test.ll file with the LLVM IR:

```\$ clang -S -emit-llvm -Wall -O2 -m64 test.c
```

There is a fair amount of “stuff” in the file, but the relevant portions are (line-wrapped by me):

```; Function Attrs: nounwind
define void @test(i32* %x) #0 {
entry:
tail call void asm sideeffect "lock; incl \$0",
"=*m,*m,~{dirflag},~{fpsr},~{flags}"(i32* %x, i32* %x) #1, !srcloc !1
ret void
}

attributes #0 = { nounwind uwtable "less-precise-fpmad"="false"
"no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"
"no-infs-fp-math"="false" "no-nans-fp-math"="false"
"stack-protector-buffer-size"="8" "unsafe-fp-math"="false"
"use-soft-float"="false" }
```

LLVM’s IR happens to be very short and to the point. The function prologue and epilogue are not expressed as part of IR blob that gets passed to the machine code generator. Note the function attribute no-frame-pointer-elim being true (meaning frame pointer elimination will not happen).

Now, let’s add in the -fomit-frame-pointer option.

```\$ clang -S -emit-llvm -Wall -O2 -m64 -fomit-frame-pointer test.c
```

Now, the relevant IR pieces are:

```; Function Attrs: nounwind
define void @test(i32* %x) #0 {
entry:
tail call void asm sideeffect "lock; incl \$0",
"=*m,*m,~{dirflag},~{fpsr},~{flags}"(i32* %x, i32* %x) #1, !srcloc !1
ret void
}

attributes #0 = { nounwind uwtable "less-precise-fpmad"="false"
"no-frame-pointer-elim"="false" "no-infs-fp-math"="false"
"no-nans-fp-math"="false" "stack-protector-buffer-size"="8"
"unsafe-fp-math"="false" "use-soft-float"="false" }
```

The no-frame-pointer-elim attribute changed (from true to false), but the IR of the function itself did not change. (The no-frame-pointer-elim-non-leaf attribute disappeared as well, but it really makes sense since -fomit-frame-pointer is a rather large hammer that just forces frame pointer elimination everywhere and so it doesn’t make sense to differentiate between leaf and non-leaf functions.)

So, to answer Steve’s question, the LLVM IR does not include the function prologue and epilogue. This actually makes a lot of sense given that the IR is architecture independent and the exact details of what the prologue has to do are define by the ABIs.

### IR to Assembly

We can of course use llc to convert the IR into real 64-bit x86 assembly code.

```\$ llc --march=x86-64 test.ll
\$ gas -o test.o --64 test.s
```

Here is the disassembly for clang invocation without -fomit-frame-pointer:

```test()
test:     55                 pushq  %rbp
test+0x1: 48 89 e5           movq   %rsp,%rbp
test+0x4: f0 ff 07           lock incl (%rdi)
test+0x7: 5d                 popq   %rbp
test+0x8: c3                 ret
```

And here is the disassembly for clang invocation with -fomit-frame-pointer:

```test()
test:     f0 ff 07           lock incl (%rdi)
test+0x3: c3                 ret
```

### Conclusion

So, it turns out that my previous post simply stumbled across the fact that GCC and clang have different set of optimizations for -O2. GCC includes -fomit-frame-pointer by default, while clang does not.

## Working with Wide Characters

Two weekends ago, I happened to stumble into a situation where I had a use for wide characters. Since I’ve never dealt with them before, it was an interesting experience. I’m hoping to document some of my thoughts and discoveries in this post.

As you may have guessed, I am using OpenIndiana for development so excuse me if I happen to stray from straight up POSIX in favor of Illumos-flavored POSIX.

The program I was working with happens to read a bunch of strings. It then does some mangling on these strings — specifically, it (1) converts these strings between Unicode and EBCDIC, and (2) at times it needs to uppercase a Unicode character. (Yes, technically the Unicode to EBCDIC conversion is lossy since EBCDIC doesn’t have all possible Unicode characters. Practically, the program only cares about a subset of Unicode characters and those all appear in EBCDIC.)

In the past, most of the code I wrote dealt with Unicode by just assuming the world was ASCII. This approach allows UTF-8 to just work in most cases. Assuming you don’t want to mangle the strings in any major way, you’ll be just fine. Concatenation (strcat), ASCII character search (strchr), and substring search (strstr) all work perfectly fine. While other functions will do the wrong thing (e.g., strlen will return number of bytes, not number of characters).

Converting an ASCII string to EBCDIC is pretty easy. For each input character (aka. each input byte), do a lookup in a 256-element array. The output is just a concatenation of all the looked up values.

This simple approach falls apart if the input is UTF-8. There, some characters (e.g., ö) take up multiple bytes (e.g., c3 b6). Iterating over the input bytes won’t work. One way to deal with this is to process as many bytes as necessary to get a full character (1 for ASCII characters, 2–6 for “non-ASCII” Unicode characters), and then covert/uppercase/whatever it instead of the raw bytes. This sort of hoop-jumping is necessary whenever one wants to process characters instead of bytes.

### wchar_t

Another way to deal with this is to store the string as something other than UTF-8. I took this approach. When the program reads in a (UTF-8) string, it promptly converts it into a wide character string. In other words, instead of my strings being char *, they are wchar_t *. On my system, wchar_t is a 32-bit unsigned integer. This trivially makes all Unicode characters the same length — 32 bits. I can go back to assuming that one element of my string corresponds to a single character. I just need to keep in mind that a single character is not one byte. In practice, this means remembering to malloc more memory than before. In other words:

```wchar_t *a, *b;

a = malloc(MAX_LEN);                   /* WRONG */
b = malloc(sizeof(wchar_t) * MAX_LEN); /* CORRECT */
```

Uppercasing a character becomes just as easy as it was with plain ol’ ASCII. For example, to uppercase the ${i}^{\mathrm{th}}$ letter in a string:

```void uppercase_nth(wchar_t *str, int i)
{
str[i] = toupper(str[i]);
}
```

There are however some downsides. First and foremost, if you are dealing mostly with ASCII, then your memory footprint may have just quadrupled. (In my case, the program is so small that I don’t care about the memory footprint increase.) Second, you have to deal with a couple of “silly” syntax to make the (C99) compiler realize what it is you are attempting to do.

```const wchar_t *msg = L"Lorem ipsum";
const wchar_t *letter = L'x';
```

### “str” functions

Arguably, the most visible change involves the “str” functions. With plain old ASCII strings, you use functions like strlen, strcpy, and strcat to, respectively, get the length, copy a string, and concatenate two strings. These functions assume that each byte is a character and that the string is terminated by a null (8-bit 0) so they do not work in the world of wide characters. (Keep in mind that since ASCII consists of characters with values less than 128, a 32-bit integer with that value will have three null bytes in most characters (assuming ASCII text). On big endian systems, you’ll end up with the empty string, while on little endian systems you’ll end up with a string consisting of just the first character.) Thankfully, there are alternatives to the “str” functions that know how to deal with wide character strings — the “ws” functions. Instead of using strlen, strcpy, and strcat, you want to call wslen, wscpy, and wscat. There are of course more. On Illumos, you can look at the wcstring(3c) manpage for many (but not all!) of them.

### printf & friends

Manipulating strings solely with the “str” functions is tedious. Often enough, it is so much simpler to reach for the venerable printf. This is where things get really interesting. The printf family of functions knows how to convert between char * strings and wchar_t * strings. First of all, let’s take a look at snprintf (the same applies to printf and sprintf). Here’s a simple code snippet that dumps a string into a char array. The output is char *, the format string is char *, and the string input is also char *.

```char output[1024];
char *s = "abc";

snprintf(output, sizeof(output), "foo %s bar\n", s);
```

One can use %ls to let snprintf know that the corresponding input string is a wide character string. snprintf will do everything the same, except it transparently converts the wide character string into a regular string before outputting it. For example:

```char output[1024];
wchar_t *s = L"abc";

snprintf(output, sizeof(output), "foo %ls bar\n", s);
```

Will produce the same output as the previous code snippet.

Now, what if you want the output to be a wide character string? Simple, use the wprintf functions! There are fwprintf, wprintf, and swprintf which correspond to fprintf, printf, and snprintf. Do note that the wide-character versions want the format string to be a wide character string. As far as the format string is concerned, the same rules apply as before — %s for char * input and %ls for wchar_t * input:

```wchar_t output[1024];
wchar_t *s1 = L"abc";
char *s2 = "abc";

swprintf(output, sizeof(output), L"foo %ls %s bar\n", s1, s2);
```

Caution! In addition to swprintf there is also wsprintf. This one takes the format string in char * but outputs into a wchar_t * buffer.

Here’s the same information, in a tabular form. The input string type is always determined by the format string contents — %s for char * input and %ls for wchar_t * input:

 Function Output Format string printf, sprintf, snprintf, fprintf char * char * wprintf, swprintf, fwprintf wchar_t * wchar_t * wsprintf wchar_t * char *

### setlocale and Summary

Oh, I almost forgot! You should call setlocale before you start using all these features.

So, to conclude, it is pretty easy to use wide character strings.

• #include <wchar.h>
• #include <widec.h>
• call setlocale in your main
• use wchar_t instead of char
• use %ls in format strings instead of %s
• use L string literal prefix
• beware of wsprintf and swprintf

I wouldn’t want to deal with this sort of code on daily basis, but for a random side project it isn’t so bad. I do like the ability to not worry about the encoding — the 1:1 mapping of characters to array elements is really convenient.

## Inline Assembly & GCC, clang

Recently, I got to write a bit of inline assembly. In the process I got to test my changes by making a small C file which defined test function that called the inline function from the header. Then, I could look at the disassembly to verify all was well.

```#define _KERNEL
#define _ASM_INLINES
#include <sys/atomic.h>

void test(uint32_t *x)
{
atomic_inc_32(x);
}
```

GCC has been my go to complier for a long time now. So, at first I was using it to debug my inline assembly. I compiled the test programs using:

```\$ gcc -Wall -O2 -m64 -c test.c
```

Disassembling the object file yields the rather obvious:

```test()
test:     f0 ff 07           lock incl (%rdi)
test+0x3: c3                 ret
```

I can’t think of any way to make it better :)

Then, at some point I remembered that Clang/LLVM are pretty good as well. I compiled the same file with clang:

```\$ clang -Wall -O2 -m64 -c test.c
```

The result was rather disappointing:

```test()
test:     55                 pushq  %rbp
test+0x1: 48 89 e5           movq   %rsp,%rbp
test+0x4: f0 ff 07           lock incl (%rdi)
test+0x7: 5d                 popq   %rbp
test+0x8: c3                 ret
```

For whatever reason, Clang feels the need to push/pop the frame pointer. I did a little bit of searching, and I couldn’t find a way to disable this behavior.

The story for 32-bit output is very similar (just drop the -m64 from the compiler invocation). GCC produced the superior output:

```test()
test:     8b 44 24 04        movl   0x4(%esp),%eax
test+0x4: f0 ff 00           lock incl (%eax)
test+0x7: c3                 ret
```

While Clang still wanted to muck around with the frame pointer.

```test()
test:     55                 pushl  %ebp
test+0x1: 89 e5              movl   %esp,%ebp
test+0x3: 8b 45 08           movl   0x8(%ebp),%eax
test+0x6: f0 ff 00           lock incl (%eax)
test+0x9: 5d                 popl   %ebp
test+0xa: c3                 ret
```

For the curious ones, I’m using GCC 4.8.3 and Clang 3.4.2.

I realize this is a bit of a special case (how often to you make a function that simply calls an inline function?), but it makes me worried about what sort of sub-optimal code Clang produces in other cases.

## Inlining Atomic Operations

One of the items on my ever growing TODO list (do these ever shrink?) was to see if inlining Illumos’s atomic_* functions would make any difference. (For the record, these functions atomically manipulate variables. You can read more about them in the various man pages — atomic_add, atomic_and, atomic_bits, atomic_cas, atomic_dec, atomic_inc, atomic_or, atomic_swap.) Of course once I looked at the issue deeply enough, I ended up with five cleanup patches. The gist of it is, inlining them caused not only about 1% kernel performance improvement on the benchmarks, but also reduced the kernel size by a couple of kilobytes. You can read all about it in the associated bugs (5042, 5043, 5044, 5045, 5046, 5047) and the patch 0/6 email I sent to the developer list. In this blahg post, I want to talk about how exactly Illumos presents these atomic functions in a stable ABI but at the same time allows for inlines.

### Genesis

It should come as no surprise that the “content” of these functions really needs to be written in assembly. The functions are 100% implemented in assembly in usr/src/common/atomic. There, you will find a directory per architecture. For example, in the amd64 directory, we’ll find the code for a 64-bit atomic increment:

```	ENTRY(atomic_inc_64)
ALTENTRY(atomic_inc_ulong)
lock
incq	(%rdi)
ret
SET_SIZE(atomic_inc_ulong)
SET_SIZE(atomic_inc_64)
```

The ENTRY, ALTENTRY, and SET_SIZE macros are C preprocessor macros to make writing assembly functions semi-sane. Anyway, this code is used by both the kernel as well as userspace. I am going to ignore the userspace side of the picture and talk about the kernel only.

These assembly functions, get mangled by the C preprocessor, and then are fed into the assembler. The object file is then linked into the rest of the kernel. When a module binary references these functions the krtld (linker-loader) wires up those references to this code.

### Inline

Replacing these function with inline functions (using the GNU definition) would be fine as far as all the code in Illumos is concerned. However doing so would remove the actual functions (as well as the symbol table entries) and so the linker would not be able to wire up any references from modules. Since Illumos cares about not breaking existing external modules (both open source and closed source), this simple approach would be a no-go.

### Inline v2

Before I go into the next and final approach, I’m going to make a small detour through C land.

#### extern inline

First off, let’s say that we have a simple function, add, that returns the sum of the two integer arguments, and we keep it in a file called add.c:

```#include "add.h"

{
return x + y;
}
```

In the associated header file, add.h, we may include a prototype like the following to let the compiler know that add exists elsewhere and what types to expect.

```extern int add(int, int);
```

Then, we attempt to call it from a function in, say, test.c:

```#include "add.h"

int test()
{
}
```

Now, let’s turn these two .c files into a .so. We get the obvious result — test calls add:

```test()
test:     be 07 00 00 00     movl   \$0x7,%esi
test+0x5: bf 05 00 00 00     movl   \$0x5,%edi
test+0xa: e9 b1 fe ff ff     jmp    -0x14f	<0xc90>
```

And the binary contains both functions:

```\$ /usr/bin/nm test.so | egrep '(Value|test\$|add\$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[74]	|                3520|                   4|FUNC |GLOB |0    |13   |add
[65]	|                3536|                  15|FUNC |GLOB |0    |13   |test
```

Now suppose that we modify the header file to include the following (assuming GCC’s inline definition):

```extern int add(int, int);

extern inline int add(int a, int b)
{
return a + b;
}
```

If we compile and link the same .so the same way, that is we feed in the object file with the previously used implementation of add, we’ll get a slightly different binary. The invocation of add will use the inlined version:

```test()
test:     b8 0c 00 00 00     movl   \$0xc,%eax
test+0x5: c3                 ret
```

But the binary will still include the symbol:

```\$ /usr/bin/nm test.so | egrep '(Value|test\$|add\$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[72]	|                3408|                   4|FUNC |GLOB |0    |11   |add
[63]	|                3424|                   6|FUNC |GLOB |0    |11   |test
```

Neat, eh?

#### extern inline atomic what?

How does this apply to the atomic functions? Pretty simply. As I pointed out, usr/src/common/atomic contains the pure assembly implementations — these are the functions you’ll always find in the symbol table.

The common header file that defines extern prototypes is usr/src/uts/common/sys/atomic.h.

Now, the trick. If you look carefully at the header file, you’ll spot a check on line 39. If all the conditions are true (kernel code, GCC, inline assembly is allowed, and x86), we include asm/atomic.h — which lives at usr/src/uts/intel/asm/atomic.h. This is where the extern inline versions of the atomic functions get defined.

So, kernel code simply includes <sys/atomic.h>, and if the stars align properly, any atomic function use will get inlined.

Phew! This ended up being longer than I expected. :)