Josef “Jeff” Sipek

OpenIndiana: UFS Root and FreeBSD Loader

Recently, I had a couple of uses for an illumos-based system that does not use ZFS for the root file system. The first time around, I was trying to test a ZFS code change and it is significantly easier to do when the root file system is not the same as the file system likely broken by my changes. The second time around, I did not want to deal with the ARC eating up all the RAM on the test system.

It is still possible to use UFS for the root file system, it just requires a bit of manual labor as illumos distros do not let you install directly to a UFS volume these days. I am using this post to write down the steps necessary to get a VM with UFS root file system up and running.

Virtual Hardware

First of all, let’s talk about the virtual hardware necessary. We will need a NIC that we can dedicate to the VM so it can reach the internet. Second, we will need two disks. Here, I am using a VNIC and two zvols to accomplish this. Finally, we will need an ISO with OpenIndiana Hipster. I am using the ISO from October 2015.

Let’s create all the virtual hardware:

host# dladm create-vnic -l e1000g0 ufsroot0
host# zfs create storage/kvm/ufsroot
host# zfs create -V 10g -s storage/kvm/ufsroot/root 
host# zfs create -V 10g -s storage/kvm/ufsroot/swap

The overall process will involve borrowing the swap disk to install Hipster on it normally (with ZFS as the root file system), and then we will copy everything over to the other disk where we will use UFS. Once we are up and running from the UFS disk, we will nuke the borrowed disk contents and set it up as a swap device. So, let’s fire up the VM. To make it easier to distinguish between the two disks, I am setting up the root disk as a virtio device while the other as an IDE device. (We will change the swap device to be a virtio device once we are ready to reclaim it for swap.)

host# /usr/bin/qemu-kvm \
	-enable-kvm \
	-vnc 0.0.0.0:42 \
	-m 2048 \
	-no-hpet \
	-drive file=/dev/zvol/rdsk/storage/kvm/ufsroot/root,if=virtio,index=0 \
	-drive file=/dev/zvol/rdsk/storage/kvm/ufsroot/swap,if=ide,index=1 \
	-net nic,vlan=0,name=net0,model=virtio,macaddr=2:8:20:a:46:54 \
	-net vnic,vlan=0,name=net0,ifname=ufsroot0,macaddr=2:8:20:a:46:54 \
	-cdrom OI-hipster-text-20151003.iso -boot once=d \
	-smp 2 \
	-vga std \
	-serial stdio

Installing

I am not going to go step by step through the installation. All I am going to say is that you should install it on the IDE disk. For me it shows up as c0d1. (c2t0d0 is the virtio disk.)

Once the system is installed, boot it. (From this point on, we do not need the ISO, so you can remove the -cdrom from the command line.) After Hipster boots, configure networking and ssh.

Updating

Now that we have a very boring stock Hipster install, we should at the very least update it to the latest packages (via pkg update). I am updating to “jeffix” which includes a number of goodies like Toomas Soome’s port of the FreeBSD loader to illumos. If you are using stock Hipster, you will have to figure out how to convince GRUB to do the right thing on your own.

ufsroot# pkg set-publisher --non-sticky openindiana.org
ufsroot# pkg set-publisher -P -g http://pkg.31bits.net/jeffix/2015/ \
	jeffix.31bits.net
ufsroot# pkg update
            Packages to remove:  31
           Packages to install:  14
            Packages to update: 518
           Mediators to change:   1
       Create boot environment: Yes
Create backup boot environment:  No

DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                            563/563     8785/8785  239.7/239.7  1.3M/s

PHASE                                          ITEMS
Removing old actions                       7292/7292
Installing new actions                     5384/5384
Updating modified actions                10976/10976
Updating package state database                 Done 
Updating package cache                       549/549 
Updating image state                            Done 
Creating fast lookup database                   Done 

A clone of openindiana exists and has been updated and activated.
On the next boot the Boot Environment openindiana-1 will be
mounted on '/'.  Reboot when ready to switch to this updated BE.


---------------------------------------------------------------------------
NOTE: Please review release notes posted at:

http://wiki.openindiana.org/display/oi/oi_hipster
---------------------------------------------------------------------------

Reboot into the new boot environment and double check that the update really updated everything it was supposed to.

ufsroot# uname -a
SunOS ufsroot 5.11 jeffix-20160219T162922Z i86pc i386 i86pc Solaris

Great!

Partitioning

First, we need to partition the virtio disk. Let’s be fancy and use a GPT partition table. The easiest way to create one is to create a whole-disk zfs pool on the virtio disk and immediately destroy it.

ufsroot# zpool create temp-pool c2t0d0
ufsroot# zpool destroy temp-pool

This creates an (almost) empty GPT partition table. We need to add two partitions—one tiny partition for the boot block and one for the UFS file system.

ufsroot# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0d1 <QEMU HARDDISK=QM00002-QM00002-0001-10.00GB>
          /pci@0,0/pci-ide@1,1/ide@0/cmdk@1,0
       1. c2t0d0 <Virtio-Block Device-0000-10.00GB>
          /pci@0,0/pci1af4,2@4/blkdev@0,0
Specify disk (enter its number): 1
selecting c2t0d0
No defect list found
[disk formatted, no defect list found]
...
format> partition

I like to align partitions to 1MB boundaries, that is why I specified 2048 as the starting sector. 1MB is plenty of space for the boot block, and it automatically makes the next partition 1MB aligned. It is important to specify “boot” for the partition id tag. Without it, we will end up getting an error when we try to install the loader’s boot block.

partition> 0
Part      Tag    Flag     First Sector        Size        Last Sector
  0        usr    wm               256       9.99GB         20955102    

Enter partition id tag[usr]: boot
Enter partition permission flags[wm]: 
Enter new starting Sector[256]: 2048
Enter partition size[20954847b, 20956894e, 10231mb, 9gb, 0tb]: 1m

Since I am planning on using a separate disk for swap, I am using the rest of this disk for the root partition.

partition> 1
Part      Tag    Flag     First Sector        Size        Last Sector
  1 unassigned    wm                 0          0              0    

Enter partition id tag[usr]: root
Enter partition permission flags[wm]: 
Enter new starting Sector[4096]: 
Enter partition size[0b, 4095e, 0mb, 0gb, 0tb]: 20955102e
partition> print
Current partition table (unnamed):
Total disk sectors available: 20955069 + 16384 (reserved sectors)

Part      Tag    Flag     First Sector        Size        Last Sector
  0       boot    wm              2048       1.00MB         4095    
  1       root    wm              4096       9.99GB         20955102    
  2 unassigned    wm                 0          0              0    
  3 unassigned    wm                 0          0              0    
  4 unassigned    wm                 0          0              0    
  5 unassigned    wm                 0          0              0    
  6 unassigned    wm                 0          0              0    
  8   reserved    wm          20955103       8.00MB         20971486    

When done, do not forget to run the label command:

partition> label
Ready to label disk, continue? yes

Format and Copy

Now that we have the partitions all set up, we can start using them.

ufsroot# newfs /dev/rdsk/c2t0d0s1
newfs: construct a new file system /dev/rdsk/c2t0d0s1: (y/n)? y
Warning: 34 sector(s) in last cylinder unallocated
/dev/rdsk/c2t0d0s1:     20951006 sectors in 3410 cylinders of 48 tracks, 128 sectors
        10230.0MB in 214 cyl groups (16 c/g, 48.00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920,
 20055584, 20154016, 20252448, 20350880, 20449312, 20547744, 20646176,
 20744608, 20843040, 20941472
ufsroot# mkdir /a
ufsroot# mount /dev/dsk/c2t0d0s1 /a

To get a consistent snapshot of the current (ZFS) root, I just make a new boot environment. I could take a recursive ZFS snapshot manually, but why do that if beadm can do it for me automatically? (The warning about missing menu.lst is a consequence of me using jeffix which includes the FreeBSD loader. It can be safely ignored.)

ufsroot# beadm create tmp
WARNING: menu.lst file /rpool/boot/menu.lst does not exist,
         generating a new menu.lst file
Created successfully
ufsroot# beadm mount tmp
Mounted successfully on: '/tmp/tmp._haOBb'

Now, we need to copy everything to the UFS file system. I use rsync but all that matters is that the program you use will preserve permissions and can cope with all the file types found on the root file system.

In addition to copying the mounted boot environment, we want to copy /export. (Recall that /export is outside of the boot environment, and therefore it will not appear under the temporary mount.)

ufsroot# rsync -a /tmp/tmp._haOBb/ /a/
ufsroot# rsync -a /export/ /a/export/

At this point, we are done with the temporary boot environment. Let’s at least unmount it. We could destroy it too, but it does not matter since we will eventually throw away the entire ZFS root disk anyway.

ufsroot# beadm umount tmp

Configuration Tweaks

Before we can use our shiny new UFS root file system to boot the system, we need to do a couple of cleanups.

First, we need to nuke the ZFS zpool.cache:

ufsroot# rm /a/etc/zfs/zpool.cache 

Second, we need to modify vfstab to indicate the root file system and comment out the zvol based swap device.

ufsroot# vim /a/etc/vfstab

So, we add this line:

/dev/dsk/c2t0d0s1       -       /               ufs     -       yes     -

and either remove or comment out this line:

#/dev/zvol/dsk/rpool/swap       -               -               swap    -       no      -

Third, we need to update boot properties file (bootenv.rc) to tell the kernel where to find the root file system (i.e., the boot path) and the type of the root file system. To find the boot path, I like to use good ol’ ls:

ufsroot# ls -lh /dev/dsk/c2t0d0s1
lrwxrwxrwx 1 root root 46 Feb 21 17:43 /dev/dsk/c2t0d0s1 -> ../../devices/pci@0,0/pci1af4,2@4/blkdev@0,0:b

The symlink target is the boot path—well, you need to strip the fluff at the beginning.

So, we need to add these two lines to /a/boot/solaris/bootenv.rc:

setprop fstype 'ufs'
setprop bootpath '/pci@0,0/pci1af4,2@4/blkdev@0,0:b'

Ok! Everything is configured properly; we have to rebuild the boot archive. This should result in /a/platform/i86pc/boot_archive getting updated and a /a/platform/i86pc/boot_archive.hash getting created.

ufsroot# bootadm update-archive -R /a
ufsroot# ls -lh /a/platform/i86pc/
total 33M
drwxr-xr-x  4 root sys  512 Feb 21 18:38 amd64
drwxr-xr-x  6 root root 512 Feb 21 17:44 archive_cache
-rw-r--r--  1 root root 33M Feb 21 18:37 boot_archive
-rw-r--r--  1 root root  41 Feb 21 18:37 boot_archive.hash
drwxr-xr-x 10 root sys  512 Feb 21 18:00 kernel
-rwxr-xr-x  1 root sys  44K Feb 21 18:00 multiboot
drwxr-xr-x  4 root sys  512 Feb 21 17:44 ucode
drwxr-xr-x  2 root root 512 Feb 21 18:04 updates

Installing Boot Blocks

We have one step left! We need to install the boot block to the boot partition on the new disk. That one simple-ish command. Note that the device we are giving is the UFS partition device. Further note that the installboot finds the boot partition automatically based on the “boot” partition id tag.

ufsroot# installboot -m /a/boot/pmbr /a/boot/gptzfsboot /dev/rdsk/c2t0d0s1
Updating master boot sector destroys existing boot managers (if any).
continue (y/n)? y
bootblock written for /dev/rdsk/c2t0d0s0, 214 sectors starting at 1 (abs 2049)
stage1 written to slice 0 sector 0 (abs 2048)
stage1 written to slice 1 sector 0 (abs 4096)
stage1 written to master boot sector

If you are using GRUB instead, you will want to install GRUB on the disk…somehow.

Booting from UFS

Now we can shut down, change the swap disk’s type to virtio and boot back up.

host# /usr/bin/qemu-kvm \
	-enable-kvm \
	-vnc 0.0.0.0:42 \
	-m 2048 \
	-no-hpet \
	-drive file=/dev/zvol/rdsk/storage/kvm/ufsroot/root,if=virtio,index=0 \
	-drive file=/dev/zvol/rdsk/storage/kvm/ufsroot/swap,if=virtio,index=1 \
	-net nic,vlan=0,name=net0,model=virtio,macaddr=2:8:20:a:46:54 \
	-net vnic,vlan=0,name=net0,ifname=ufsroot0,macaddr=2:8:20:a:46:54 \
	-smp 2 \
	-vga std \
	-serial stdio

Once the VM comes back up, we can add a swap device. The swap disk shows up for me as c3t0d0.

ufsroot# swap -a /dev/dsk/c3t0d0p0
operating system crash dump was previously disabled --
invoking dumpadm(1M) -d swap to select new dump device

We also need to add the description of the swap device to /etc/vfstab. So, fire up vim and add the following line:

/dev/dsk/c3t0d0p0       -       -               swap    -       no      -

That’s it! Now you can bask in the glory that is UFS root!

ufsroot$ df -h
Filesystem         Size  Used Avail Use% Mounted on
/dev/dsk/c2t0d0s1  9.9G  2.9G  6.9G  30% /
swap               1.5G 1016K  1.5G   1% /etc/svc/volatile
swap               1.5G  4.0K  1.5G   1% /tmp
swap               1.5G   52K  1.5G   1% /var/run
ufsroot$ zfs list
no datasets available

Caveat

Unfortunately, pkg assumes that the root file system is ZFS. So, updating certain packages (anything that would normally create a new boot environment) will likely fail.

Interactivity During nightly(1), part 2

Back in May, I talked about how I increase the priority of Firefox in order to get decent response times while killing my laptop with a nightly build of Illumos. Specifically, I have been increasing the priority of Firefox so that it would get to run in a timely manner. I have been doing this by setting it to the real-time (RT) scheduling class which has higher priority than most things on the system. This, of course, requires extra privileges.

Today, I realized that I was thinking about the problem the wrong way. What I really should be doing is lowering the priority of the build. This requires no special privileges. How do I do this? In my environment file, I include the following line:

priocntl -s -c FX -p 0 $$

This sets the nightly build script’s scheduling class to fixed (FX) and manually sets the priority to 0. From that point on, the nightly script and any processes it spawns run with a lower priority (zero) than everything else (which tends to be in the 40-59 range).

Interactivity During nightly(1)

Every so often, I do a nightly build of illumos on my laptop. This is a long and very CPU intensive process. During the build (which takes about 2.75 hours), the load average is rarely below 8 and most of the time it hovers in the low twenties. (This is a full debug and non-debug build with lint and all the other checking. If I need a build quickly, I can limit it to just what I need and then we’re talking minutes or seconds.)

Anyway, as you might imagine this sort of load puts some pressure on the CPUs. As a result, some interactive processes suffer a bit. Specifically, Firefox doesn’t get enough cycles to render the modern web (read: JavaScript cesspool). Instead of suffering for almost three hours, I just change Firefox’s scheduling class from IA (interactive) to RT (real time):

# priocntl -s -c RT `pgrep firefox`

This allows Firefox to preempt just about everything on my laptop. This works because Firefox actually yields the CPU properly. This will probably bite me sometime in the future when I end up on a page with such a busted JavaScript turd that it takes over a CPU and won’t let go. Till then, I have a pretty good workaround.

Working with Wide Characters

Two weekends ago, I happened to stumble into a situation where I had a use for wide characters. Since I’ve never dealt with them before, it was an interesting experience. I’m hoping to document some of my thoughts and discoveries in this post.

As you may have guessed, I am using OpenIndiana for development so excuse me if I happen to stray from straight up POSIX in favor of Illumos-flavored POSIX.

The program I was working with happens to read a bunch of strings. It then does some mangling on these strings — specifically, it (1) converts these strings between Unicode and EBCDIC, and (2) at times it needs to uppercase a Unicode character. (Yes, technically the Unicode to EBCDIC conversion is lossy since EBCDIC doesn’t have all possible Unicode characters. Practically, the program only cares about a subset of Unicode characters and those all appear in EBCDIC.)

In the past, most of the code I wrote dealt with Unicode by just assuming the world was ASCII. This approach allows UTF-8 to just work in most cases. Assuming you don’t want to mangle the strings in any major way, you’ll be just fine. Concatenation (strcat), ASCII character search (strchr), and substring search (strstr) all work perfectly fine. While other functions will do the wrong thing (e.g., strlen will return number of bytes, not number of characters).

Converting an ASCII string to EBCDIC is pretty easy. For each input character (aka. each input byte), do a lookup in a 256-element array. The output is just a concatenation of all the looked up values.

This simple approach falls apart if the input is UTF-8. There, some characters (e.g., ö) take up multiple bytes (e.g., c3 b6). Iterating over the input bytes won’t work. One way to deal with this is to process as many bytes as necessary to get a full character (1 for ASCII characters, 2–6 for “non-ASCII” Unicode characters), and then covert/uppercase/whatever it instead of the raw bytes. This sort of hoop-jumping is necessary whenever one wants to process characters instead of bytes.

wchar_t

Another way to deal with this is to store the string as something other than UTF-8. I took this approach. When the program reads in a (UTF-8) string, it promptly converts it into a wide character string. In other words, instead of my strings being char *, they are wchar_t *. On my system, wchar_t is a 32-bit unsigned integer. This trivially makes all Unicode characters the same length — 32 bits. I can go back to assuming that one element of my string corresponds to a single character. I just need to keep in mind that a single character is not one byte. In practice, this means remembering to malloc more memory than before. In other words:

wchar_t *a, *b;

a = malloc(MAX_LEN);                   /* WRONG */
b = malloc(sizeof(wchar_t) * MAX_LEN); /* CORRECT */

Uppercasing a character becomes just as easy as it was with plain ol’ ASCII. For example, to uppercase the $i^{th}$ letter in a string:

void uppercase_nth(wchar_t *str, int i)
{
	str[i] = toupper(str[i]);
}

There are however some downsides. First and foremost, if you are dealing mostly with ASCII, then your memory footprint may have just quadrupled. (In my case, the program is so small that I don’t care about the memory footprint increase.) Second, you have to deal with a couple of “silly” syntax to make the (C99) compiler realize what it is you are attempting to do.

const wchar_t *msg = L"Lorem ipsum";
const wchar_t *letter = L'x';

“str” functions

Arguably, the most visible change involves the “str” functions. With plain old ASCII strings, you use functions like strlen, strcpy, and strcat to, respectively, get the length, copy a string, and concatenate two strings. These functions assume that each byte is a character and that the string is terminated by a null (8-bit 0) so they do not work in the world of wide characters. (Keep in mind that since ASCII consists of characters with values less than 128, a 32-bit integer with that value will have three null bytes in most characters (assuming ASCII text). On big endian systems, you’ll end up with the empty string, while on little endian systems you’ll end up with a string consisting of just the first character.) Thankfully, there are alternatives to the “str” functions that know how to deal with wide character strings — the “ws” functions. Instead of using strlen, strcpy, and strcat, you want to call wslen, wscpy, and wscat. There are of course more. On Illumos, you can look at the wcstring(3c) manpage for many (but not all!) of them.

printf & friends

Manipulating strings solely with the “str” functions is tedious. Often enough, it is so much simpler to reach for the venerable printf. This is where things get really interesting. The printf family of functions knows how to convert between char * strings and wchar_t * strings. First of all, let’s take a look at snprintf (the same applies to printf and sprintf). Here’s a simple code snippet that dumps a string into a char array. The output is char *, the format string is char *, and the string input is also char *.

char output[1024];
char *s = "abc";

snprintf(output, sizeof(output), "foo %s bar\n", s);

One can use %ls to let snprintf know that the corresponding input string is a wide character string. snprintf will do everything the same, except it transparently converts the wide character string into a regular string before outputting it. For example:

char output[1024];
wchar_t *s = L"abc";

snprintf(output, sizeof(output), "foo %ls bar\n", s);

Will produce the same output as the previous code snippet.

Now, what if you want the output to be a wide character string? Simple, use the wprintf functions! There are fwprintf, wprintf, and swprintf which correspond to fprintf, printf, and snprintf. Do note that the wide-character versions want the format string to be a wide character string. As far as the format string is concerned, the same rules apply as before — %s for char * input and %ls for wchar_t * input:

wchar_t output[1024];
wchar_t *s1 = L"abc";
char *s2 = "abc";

swprintf(output, sizeof(output), L"foo %ls %s bar\n", s1, s2);

Caution! In addition to swprintf there is also wsprintf. This one takes the format string in char * but outputs into a wchar_t * buffer.

Here’s the same information, in a tabular form. The input string type is always determined by the format string contents — %s for char * input and %ls for wchar_t * input:

Function Output Format string
printf, sprintf, snprintf, fprintf char * char *
wprintf, swprintf, fwprintf wchar_t * wchar_t *
wsprintf wchar_t * char *

setlocale and Summary

Oh, I almost forgot! You should call setlocale before you start using all these features.

So, to conclude, it is pretty easy to use wide character strings.

  • #include <wchar.h>
  • #include <widec.h>
  • call setlocale in your main
  • use wchar_t instead of char
  • use %ls in format strings instead of %s
  • use L string literal prefix
  • beware of wsprintf and swprintf

I wouldn’t want to deal with this sort of code on daily basis, but for a random side project it isn’t so bad. I do like the ability to not worry about the encoding — the 1:1 mapping of characters to array elements is really convenient.

Delegating mount/umount Privileges

Recently, I was doing some file system changes. Obviously, I wanted to run them as an unprivileged user. Unfortunately, the test involved mounting and unmounting a filesystem (tmpfs to be specific). At first I was going to set up a sudo rule to allow mount and umount to run without asking for a password. Then I remembered that I should be able to give the unprivileged user the additional privileges. It turns out that there is only one privilege (sys_mount) necessary to delegate…and it is easy to do!

$ usermod -K defaultpriv=basic,sys_mount jeffpc

Then it’s a matter of logging out and back in. We can check using ppriv:

$ ppriv $$
925:    bash
flags = <none>
        E: basic,sys_mount
        I: basic,sys_mount
        P: basic,sys_mount
        L: all

At this point, mounting and unmounting works without sudo or similar user switching:

$ mkdir tmp
$ mount -F tmpfs none /tmp/tmp
$ df -h /tmp/tmp
Filesystem      Size  Used Avail Use% Mounted on
swap            2.6G     0  2.6G   0% /tmp/tmp

Serial Console in a Zone

In the past, I’ve talked about serial consoles. I have described how to set up a serial console on Solaris/OpenIndiana. I’ve talked about Grub’s composite console in Illumos-based distros. This time, I’m going do describe the one trick necessary to get tip(1) in a zone working.

In my case, I am using SmartOS to run my zones. Sadly, SmartOS doesn’t support device pass-through of this sort, so I have to tweak the zone config after I create the zone with vmadm.

Let’s assume that the serial port I want to pass through is /dev/term/a. Passing it through into a zone is as easy as:

[root@isis ~]# zonecfg -z 7cff99f6-2b01-464d-9f72-d0ef16ce48af
zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af> add device
zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af:device> set match=/dev/term/a
zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af:device> end
zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af> commit

At this point, you’ll probably want to reboot the zone (I don’t remember if it is strictly necessary). Once it is back up, you’ll want to get into the zone and point your software of choice at /dev/term/a. It doesn’t matter that you are in a zone. The same configuration rules apply — in my case, it’s the same change to /etc/remote as I described previously.

Inlining Atomic Operations

One of the items on my ever growing TODO list (do these ever shrink?) was to see if inlining Illumos’s atomic_* functions would make any difference. (For the record, these functions atomically manipulate variables. You can read more about them in the various man pages — atomic_add, atomic_and, atomic_bits, atomic_cas, atomic_dec, atomic_inc, atomic_or, atomic_swap.) Of course once I looked at the issue deeply enough, I ended up with five cleanup patches. The gist of it is, inlining them caused not only about 1% kernel performance improvement on the benchmarks, but also reduced the kernel size by a couple of kilobytes. You can read all about it in the associated bugs (5042, 5043, 5044, 5045, 5046, 5047) and the patch 0/6 email I sent to the developer list. In this blahg post, I want to talk about how exactly Illumos presents these atomic functions in a stable ABI but at the same time allows for inlines.

Genesis

It should come as no surprise that the “content” of these functions really needs to be written in assembly. The functions are 100% implemented in assembly in usr/src/common/atomic. There, you will find a directory per architecture. For example, in the amd64 directory, we’ll find the code for a 64-bit atomic increment:

	ENTRY(atomic_inc_64)
	ALTENTRY(atomic_inc_ulong)
	lock
	incq	(%rdi)
	ret
	SET_SIZE(atomic_inc_ulong)
	SET_SIZE(atomic_inc_64)

The ENTRY, ALTENTRY, and SET_SIZE macros are C preprocessor macros to make writing assembly functions semi-sane. Anyway, this code is used by both the kernel as well as userspace. I am going to ignore the userspace side of the picture and talk about the kernel only.

These assembly functions, get mangled by the C preprocessor, and then are fed into the assembler. The object file is then linked into the rest of the kernel. When a module binary references these functions the krtld (linker-loader) wires up those references to this code.

Inline

Replacing these function with inline functions (using the GNU definition) would be fine as far as all the code in Illumos is concerned. However doing so would remove the actual functions (as well as the symbol table entries) and so the linker would not be able to wire up any references from modules. Since Illumos cares about not breaking existing external modules (both open source and closed source), this simple approach would be a no-go.

Inline v2

Before I go into the next and final approach, I’m going to make a small detour through C land.

extern inline

First off, let’s say that we have a simple function, add, that returns the sum of the two integer arguments, and we keep it in a file called add.c:

#include "add.h"

int add(int x, int y)
{
	return x + y;
}

In the associated header file, add.h, we may include a prototype like the following to let the compiler know that add exists elsewhere and what types to expect.

extern int add(int, int);

Then, we attempt to call it from a function in, say, test.c:

#include "add.h"

int test()
{
	return add(5, 7);
}

Now, let’s turn these two .c files into a .so. We get the obvious result — test calls add:

test()
    test:     be 07 00 00 00     movl   $0x7,%esi
    test+0x5: bf 05 00 00 00     movl   $0x5,%edi
    test+0xa: e9 b1 fe ff ff     jmp    -0x14f	<0xc90>

And the binary contains both functions:

$ /usr/bin/nm test.so | egrep '(Value|test$|add$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[74]	|                3520|                   4|FUNC |GLOB |0    |13   |add
[65]	|                3536|                  15|FUNC |GLOB |0    |13   |test

Now suppose that we modify the header file to include the following (assuming GCC’s inline definition):

extern int add(int, int);

extern inline int add(int a, int b)
{
	return a + b;
}

If we compile and link the same .so the same way, that is we feed in the object file with the previously used implementation of add, we’ll get a slightly different binary. The invocation of add will use the inlined version:

test()
    test:     b8 0c 00 00 00     movl   $0xc,%eax
    test+0x5: c3                 ret    

But the binary will still include the symbol:

$ /usr/bin/nm test.so | egrep '(Value|test$|add$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[72]	|                3408|                   4|FUNC |GLOB |0    |11   |add
[63]	|                3424|                   6|FUNC |GLOB |0    |11   |test

Neat, eh?

extern inline atomic what?

How does this apply to the atomic functions? Pretty simply. As I pointed out, usr/src/common/atomic contains the pure assembly implementations — these are the functions you’ll always find in the symbol table.

The common header file that defines extern prototypes is usr/src/uts/common/sys/atomic.h.

Now, the trick. If you look carefully at the header file, you’ll spot a check on line 39. If all the conditions are true (kernel code, GCC, inline assembly is allowed, and x86), we include asm/atomic.h — which lives at usr/src/uts/intel/asm/atomic.h. This is where the extern inline versions of the atomic functions get defined.

So, kernel code simply includes <sys/atomic.h>, and if the stars align properly, any atomic function use will get inlined.

Phew! This ended up being longer than I expected. :)

Grub Composite Console

In the past, I’ve described how to get a serial console going on Illumos based systems. If you ever used a serial console in Grub (regardless of the OS you ended up booting), you probably know that telling Grub to output to a serial port causes the VGA console to become totally useless — it’s blank.

Well, if you are using Illumos, you are in luck. About 5 months ago, Joyent integrated a “composite console” in Grub. You can read the full description in the bug report/feature request. The short version is: all grub output can be sent to both the VGA console as well as over a serial port.

It is very easy to configure. In your menu.lst, change the terminal to composite. For example, this comes from my test box’s config file (omitting the uninteresting bits):

serial --unit=0 --speed=115200
terminal composite

Note the use of composite instead of serial. That’s all there is to it.

Lua Compatibility

Phew! Yesterday afternoon, I decided to upgrade my laptop’s OpenIndiana from 151a9 to “Hipster”. I did it in a bit convoluted way, and hopefully I’ll write about that some other day. In the end, I ended up with a fresh install of the OS with X11 and Gnome. If you’ve ever seen my monitors, you know that I do not use Gnome — I use Notion. So, of course I had it install it. Sadly, OpenIndiana doesn’t ship it so it was up to me to compile it. After the usual fight to get a piece of software to compile on Illumos (a number of the Solaris-isms are still visible), I got it installed. A quick gdm login later, Notion threw me into a minimal environment because something was exploding.

After far too many hours of fighting it, searching online, and trying random things, I concluded that it was not Notion’s fault. Rather, it was something on the system. Eventually, I figured it out. Lua 5.2 (which is standard on Hipster) is not compatible with Lua 5.1 (which is standard on 151a9)! Specifically, a number of functions have been removed and the behavior of other functions changed. Not being a Lua expert (I just deal with it whevever I need to change my window manager’s configuration), it took longer than it should but eventually I managed to get Notion working like it should be.

So, what sort of incompatibilies did I have to work around?

loadstring

loadstring got renamed to load. This is an easy to fix thing, but still a headache especially if you want to support multiple versions of Lua.

table.maxn

table.maxn got removed. This function returned the largest positive integer key in an associative array (aka. a table) or 0 if there aren’t any. (Lua indexes arrays starting at 1.) The developers decided that it’s so simple that those that want it can write it themselves. Here’s my version:

local function table_maxn(t)
    local mn = 0
    for k, v in pairs(t) do
        if mn < k then
            mn = k
        end
    end
    return mn
end

table.insert

table.insert now checks bounds. There doesn’t appear to be any specific way to get old behavior. In my case, I was lucky. The index/positition for the insertion was one higher than table_maxn returned. So, I could replace:

table.insert(ret, pos, newscreen)

with:

ret[pos] = newscreen

Final Thougths

I can understand wanting to deprecate old crufty interfaces, but I’m not sure that the Lua developers did it right. I really think they should have marked those interfaces as obsolete, make any use spit out a warning, and then in a couple of years remove it. I think that not doing this, will hurt Lua 5.2’s adoption.

Yes, I believe there is some sort of a compile time option for Lua to get legacy interfaces, but not everyone wants to recompile Lua because the system installed version wasn’t compiled quite the way that would make things Just Work™.

Rebooter

I briefly mentioned that I was debugging a boot hang. Since the hang does not happen every time I try to boot, it may take a couple of reboots to get the kernel to hang. Doing this manually is tedious. Thankfully it can be scripted. Therefore, I made a simple script and a SMF manifest that runs the script at the end of boot. If the system boots fine, my script reboots it. If the system hangs mid-boot, well my script never executes leaving the system in a hung state. Then, I can break into the kernel debugger (mdb) and investigate.

I’m sharing the two here mostly for my benefit… in case one day in the future I decide that I need my system automatically rebooted over and over again.

The script is pretty simple. Hopefully, 60 seconds is long enough to log in and disable the service if necessary. (In reality, I setup a separate boot environment that’s the default choice in Grub. I can just select my normal boot environment and get back to non-timebomb system.)

#!/bin/sh

sleep 60

reboot -p

The tricky part is of course in the manifest. Not because it is hard, but because XML is … verbose.

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type='manifest' name='rebooter'>
	<service name='site/rebooter' type='service' version='1'>
		<dependency name='booted'
		    grouping='require_all'
		    restart_on='none'
		    type='service'>
			<service_fmri
			    value='svc:/milestone/multi-user-server:default'/>
		</dependency>

		<property_group name="startd" type="framework">
			<propval name="duration" type="astring" value="child"/>
			<propval name="ignore_error" type="astring"
				value="core,signal"/>
		</property_group>

		<instance name='system' enabled='true'>
			<exec_method
				type='method'
				name='start'
				exec='/home/jeffpc/illumos/rebooter/script.sh'
				timeout_seconds='0' />

			<exec_method
				type='method'
				name='stop'
				exec=':true'
				timeout_seconds='0' />
		</instance>

		<stability value='Unstable' />
	</service>
</service_bundle>

That’s all, carry on what you were doing. :)

Powered by blahgd