Josef “Jeff” Sipek

Working with Wide Characters

Two weekends ago, I happened to stumble into a situation where I had a use for wide characters. Since I’ve never dealt with them before, it was an interesting experience. I’m hoping to document some of my thoughts and discoveries in this post.

As you may have guessed, I am using OpenIndiana for development so excuse me if I happen to stray from straight up POSIX in favor of Illumos-flavored POSIX.

The program I was working with happens to read a bunch of strings. It then does some mangling on these strings — specifically, it (1) converts these strings between Unicode and EBCDIC, and (2) at times it needs to uppercase a Unicode character. (Yes, technically the Unicode to EBCDIC conversion is lossy since EBCDIC doesn’t have all possible Unicode characters. Practically, the program only cares about a subset of Unicode characters and those all appear in EBCDIC.)

In the past, most of the code I wrote dealt with Unicode by just assuming the world was ASCII. This approach allows UTF-8 to just work in most cases. Assuming you don’t want to mangle the strings in any major way, you’ll be just fine. Concatenation (strcat), ASCII character search (strchr), and substring search (strstr) all work perfectly fine. While other functions will do the wrong thing (e.g., strlen will return number of bytes, not number of characters).

Converting an ASCII string to EBCDIC is pretty easy. For each input character (aka. each input byte), do a lookup in a 256-element array. The output is just a concatenation of all the looked up values.

This simple approach falls apart if the input is UTF-8. There, some characters (e.g., ö) take up multiple bytes (e.g., c3 b6). Iterating over the input bytes won’t work. One way to deal with this is to process as many bytes as necessary to get a full character (1 for ASCII characters, 2–6 for “non-ASCII” Unicode characters), and then covert/uppercase/whatever it instead of the raw bytes. This sort of hoop-jumping is necessary whenever one wants to process characters instead of bytes.

wchar_t

Another way to deal with this is to store the string as something other than UTF-8. I took this approach. When the program reads in a (UTF-8) string, it promptly converts it into a wide character string. In other words, instead of my strings being char *, they are wchar_t *. On my system, wchar_t is a 32-bit unsigned integer. This trivially makes all Unicode characters the same length — 32 bits. I can go back to assuming that one element of my string corresponds to a single character. I just need to keep in mind that a single character is not one byte. In practice, this means remembering to malloc more memory than before. In other words:

wchar_t *a, *b;

a = malloc(MAX_LEN);                   /* WRONG */
b = malloc(sizeof(wchar_t) * MAX_LEN); /* CORRECT */

Uppercasing a character becomes just as easy as it was with plain ol’ ASCII. For example, to uppercase the $i^{th}$ letter in a string:

void uppercase_nth(wchar_t *str, int i)
{
	str[i] = toupper(str[i]);
}

There are however some downsides. First and foremost, if you are dealing mostly with ASCII, then your memory footprint may have just quadrupled. (In my case, the program is so small that I don’t care about the memory footprint increase.) Second, you have to deal with a couple of “silly” syntax to make the (C99) compiler realize what it is you are attempting to do.

const wchar_t *msg = L"Lorem ipsum";
const wchar_t *letter = L'x';

“str” functions

Arguably, the most visible change involves the “str” functions. With plain old ASCII strings, you use functions like strlen, strcpy, and strcat to, respectively, get the length, copy a string, and concatenate two strings. These functions assume that each byte is a character and that the string is terminated by a null (8-bit 0) so they do not work in the world of wide characters. (Keep in mind that since ASCII consists of characters with values less than 128, a 32-bit integer with that value will have three null bytes in most characters (assuming ASCII text). On big endian systems, you’ll end up with the empty string, while on little endian systems you’ll end up with a string consisting of just the first character.) Thankfully, there are alternatives to the “str” functions that know how to deal with wide character strings — the “ws” functions. Instead of using strlen, strcpy, and strcat, you want to call wslen, wscpy, and wscat. There are of course more. On Illumos, you can look at the wcstring(3c) manpage for many (but not all!) of them.

printf & friends

Manipulating strings solely with the “str” functions is tedious. Often enough, it is so much simpler to reach for the venerable printf. This is where things get really interesting. The printf family of functions knows how to convert between char * strings and wchar_t * strings. First of all, let’s take a look at snprintf (the same applies to printf and sprintf). Here’s a simple code snippet that dumps a string into a char array. The output is char *, the format string is char *, and the string input is also char *.

char output[1024];
char *s = "abc";

snprintf(output, sizeof(output), "foo %s bar\n", s);

One can use %ls to let snprintf know that the corresponding input string is a wide character string. snprintf will do everything the same, except it transparently converts the wide character string into a regular string before outputting it. For example:

char output[1024];
wchar_t *s = L"abc";

snprintf(output, sizeof(output), "foo %ls bar\n", s);

Will produce the same output as the previous code snippet.

Now, what if you want the output to be a wide character string? Simple, use the wprintf functions! There are fwprintf, wprintf, and swprintf which correspond to fprintf, printf, and snprintf. Do note that the wide-character versions want the format string to be a wide character string. As far as the format string is concerned, the same rules apply as before — %s for char * input and %ls for wchar_t * input:

wchar_t output[1024];
wchar_t *s1 = L"abc";
char *s2 = "abc";

swprintf(output, sizeof(output), L"foo %ls %s bar\n", s1, s2);

Caution! In addition to swprintf there is also wsprintf. This one takes the format string in char * but outputs into a wchar_t * buffer.

Here’s the same information, in a tabular form. The input string type is always determined by the format string contents — %s for char * input and %ls for wchar_t * input:

Function Output Format string
printf, sprintf, snprintf, fprintf char * char *
wprintf, swprintf, fwprintf wchar_t * wchar_t *
wsprintf wchar_t * char *

setlocale and Summary

Oh, I almost forgot! You should call setlocale before you start using all these features.

So, to conclude, it is pretty easy to use wide character strings.

  • #include <wchar.h>
  • #include <widec.h>
  • call setlocale in your main
  • use wchar_t instead of char
  • use %ls in format strings instead of %s
  • use L string literal prefix
  • beware of wsprintf and swprintf

I wouldn’t want to deal with this sort of code on daily basis, but for a random side project it isn’t so bad. I do like the ability to not worry about the encoding — the 1:1 mapping of characters to array elements is really convenient.

Generating Random Data

Over the years, there have been occasions when I needed to generate random data to feed into whatever system. At times, simply using /dev/random or /dev/urandom was sufficient. At other times, I needed to generate random data at a rate that exceeded what I could get out of /dev/random. This morning, I read Chris’s blog entry about his need for generating lots of random data. I decided that I should write my favorite approach so that others can benefit.

The approach is very simple. There are two phases. First, we set up our own random pool. Second we use the random pool. I am going to use an example throughout the rest of this post. Suppose that we want to make repeated 128 kB writes to a block device and we want the data to be random so that the device can’t do anything clever (e.g., compress or dedup). Say that during this benchmark we want to write out 64 GB total. (In other words, we will issue 524288 writes.)

Setup Phase

During the setup phase, we create a pool of random data. The easiest way is to just read /dev/urandom. Here, we want to read enough data so that the pool is large enough. For our 128kB write example, we’d want at least 1 MB. (I’ll explain the sizing later. I would probably go with something like 8 MB because unless I’m on some sort of limited system, the extra 7 MB of RAM won’t be missed.)

“Using the Pool” Phase

Now that we have the pool, we can use it to generate random buffers. The basic idea is to pick a random offset into the pool and just grab the bytes starting at that location. In our example, we’d pick a random offset between zero and pool size minus 128 kB, and use the 128 kB at that offset.

In pseudo code:

#define BUF_SIZE	(128 * 1024)
#define POOL_SIZE	(1024 * 1024)

static char pool[POOL_SIZE];

char *generate()
{
	return &pool[rand() % (POOL_SIZE - BUF_SIZE)];
}

That’s it! You can of course make it more general and let the caller tell you how many bytes they want:

#define POOL_SIZE	(1024 * 1024)

static char pool[POOL_SIZE];

char *generate(size_t len)
{
	return &pool[rand() % (POOL_SIZE - len)];
}

It takes a pretty simple argument to show that even a modest sized pool will be able to generate lots of different random buffers. Let’s say we’re dealing with the 128 kB buffer and 1 MB pool case. The pool can return 128 kB starting at offset 0, or offset 1, or offset 2, … or offset 9175043 ($1~MB - 128~kB - 1B$). This means that there are 917504 possible outputs. Recall, that in our example we were planning on writing out 64 GB in total which was 524288 writes.

$\frac{524288}{917504} = 0.571$

In other words, we are planning on using less than 58% of the possible outputs from our 1 MB pool to write out 64 GB of random data! (An 8 MB pool would yield 6.3% usage.)

If the length is variable, the math gets more complicated, but in a way we get even better results (i.e., lower usage) because to generate the same buffer we would need have the same offset and length. If the caller supplies random (pseudo-random or based on some distribution) lengths, we’re very unlikely to get the same buffer out of the pool.

Observations

Some of you may have noticed that we traded generating 128 kB (or a user supplied length) of random data for generating a random integer. There are two options there, either you can use a fast pseudo-random number generator (like the Wikipedia article: Mersenne twister), or you can reuse same pool! In other words:

#define POOL_SIZE	(1024 * 1024)

static char pool[POOL_SIZE];
static size_t *ridx = (size_t *) pool;

char *generate(size_t len)
{
	if ((uintptr_t) ridx == (uintptr_t) &pool[POOL_SIZE])
		ridx = (size_t *) pool;

	ridx++;

	return &pool[*ridx % (POOL_SIZE - len)];
}

I leave it as an exercise for the reader to either make it multi-thread safe, or to make the index passed in similarly to how rand_r takes an argument.

rand_r Considered Harmful

Since we’re on the topic of random number generation, I thought I’d mention what is already rather widely known fact — libc’s rand and rand_r implementations are terrible. At one point, I tried using them with this approach to generate random buffers, but it didn’t take very long before I got repeats! Caveat emptor.

Unix Shared Memory

While investigating whether some memory management code was still in use (I’ll blahg about this in the future), I ended up learning quite a bit about shared memory on Unix systems. Since I managed to run into a couple of non-obvious snags while trying to get a simple test program running, I thought I’d share my findings here for my future self.

All in all, there are three ways to share memory between processes on a modern Unix system.

System V shm

This is the oldest of the three. First you call shmget to set up a shared memory segment and then you call shmat to map it into your address space. Here’s a quick example that does not do any error checking or cleanup:

void sysv_shm()
{
        int ret;
        void *ptr;

        ret = shmget(0x1234, 4096, IPC_CREAT);
        printf("shmget returned %d (%d: %s)\n", ret, errno,
               strerror(errno));

        ptr = shmat(ret, NULL, SHM_PAGEABLE | SHM_RND);
        printf("shmat returned %p (%d: %s)\n", ptr, errno, strerror(errno));
}

What’s so tricky about this? Well, by default Illumos’s shmat will return EPERM unless you are root. This sort of makes sense given how this flavor of shared memory is implemented. (Hint: it’s all in the kernel)

POSIX shm

As is frequently the case, POSIX came up with a different interface and different semantics for shared memory. Here’s the POSIX shm version of the above function:

void posix_shm()
{
	int fd;
	void *ptr;

	fd = shm_open("/blah", O_RDWR | O_CREAT, 0666);
	printf("shm_open returned %d (%d: %s)\n", fd, errno,
	       strerror(errno));

	ftruncate(fd, 4096); /* IMPORTANT! */

	ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	printf("mmap returned %p (%d: %s)\n", ptr, errno, strerror(errno));
}

The very important part here is the ftruncate call. Without it, shm_open may create an empty file and mmaping an empty file won’t work very well. (Well, on Illumos mmap succeeds, but you effectively have a 0-length mapping so any loads or stores will result in a SIGBUS. I haven’t tried other OSes.)

Aside from the funny looking path (it must start with a slash, but cannot contain any other slashes), shm_open looks remarkably like the open system call. It turns out that at least on Illumos, shm_open is implemented entirely in libc. The implementation creates a file in /tmp based on the path provided and the file descriptor that it returns is actually a file descriptor for this file in /tmp. For example, “/blah” input translates into “/tmp/.SHMDblah”. (There is a second file “/tmp/.SHMLblah” that doesn’t live very long. I think it is a lock file.) The subsequent mmap call doesn’t have any idea that this file is special in any way.

Does this mean that you can reach around shm_open and manipulate the object directly? Not exactly. POSIX states: “It is unspecified whether the name appears in the file system and is visible to other functions that take pathnames as arguments.”

The big difference between POSIX and SysV shared memory is how you refer to the segment — SysV uses a numeric key, while POSIX uses a path.

mmap

The last way of sharing memory involves no specialized APIs. It’s just plain ol’ mmap on an open file. For completeness, here’s the function:

void mmap_shm()
{
	int fd;
	void *ptr;

	fd = open("/tmp/blah", O_RDWR | O_CREAT, 0666);
	printf("open returned %d (%d: %s)\n", fd, errno, strerror(errno));

	ftruncate(fd, 4096); /* IMPORTANT! */

	ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	printf("mmap returned %p (%d: %s)\n", ptr, errno, strerror(errno));
}

It is very similar to the POSIX shm code example. As before, we need the ftruncate to make the shared file non-empty.

pmap

In case you’ve wondered what SysV or POSIX shm segments look like on Illumos, here’s the pmap output for a process that basically runs the first two examples above.

6343:	./a.out
0000000000400000          8K r-x--  /storage/home/jeffpc/src/shm/a.out
0000000000411000          4K rw---  /storage/home/jeffpc/src/shm/a.out
0000000000412000         16K rw---    [ heap ]
FFFFFD7FFF160000          4K rwxs-    [ dism shmid=0x13 ]
FFFFFD7FFF170000          4K rw-s-  /tmp/.SHMDblah
FFFFFD7FFF180000         24K rwx--    [ anon ]
FFFFFD7FFF190000          4K rwx--    [ anon ]
FFFFFD7FFF1A0000       1596K r-x--  /lib/amd64/libc.so.1
FFFFFD7FFF33F000         52K rw---  /lib/amd64/libc.so.1
FFFFFD7FFF34C000          8K rw---  /lib/amd64/libc.so.1
FFFFFD7FFF350000          4K rwx--    [ anon ]
FFFFFD7FFF360000          4K rwx--    [ anon ]
FFFFFD7FFF370000          4K rw---    [ anon ]
FFFFFD7FFF380000          4K rw---    [ anon ]
FFFFFD7FFF390000          4K rwx--    [ anon ]
FFFFFD7FFF393000        348K r-x--  /lib/amd64/ld.so.1
FFFFFD7FFF3FA000         12K rwx--  /lib/amd64/ld.so.1
FFFFFD7FFF3FD000          8K rwx--  /lib/amd64/ld.so.1
FFFFFD7FFFDFD000         12K rw---    [ stack ]
         total         2120K

You can see that the POSIX shm file got mapped in the standard way (address FFFFFD7FFF170000). The SysV shm segment is special — it is not a plain old memory map (address FFFFFD7FFF160000).

That’s it for today. I’m going to talk about segment types in the different post in the near future.

nftw(3)

I just found out about nftw — a libc function to walk a file tree. I did not realize that libc had such a high-level function. I bet it’ll end up saving me time (and code) at some point in the future.

int nftw(const char *path, int (*fn) (const char *, const struct stat *,
				      int, struct FTW *), int depth,
	 int flags);

Given a path, it executes the callback for each file and directory in that tree. Very cool.

Ok, as I write this post, I am told that nftw is a great way to write dangerous code. Aside from easily writing dangerous things equivalent to rm -rf, I could see not specifying the FTW_PHYS to be dangerous as symlinks will get followed without any notification.

I guess I’ll play around with it a little to see what the right way (if any?) to use it is.

Powered by blahgd