Josef “Jeff” Sipek

Rebooter

I briefly mentioned that I was debugging a boot hang. Since the hang does not happen every time I try to boot, it may take a couple of reboots to get the kernel to hang. Doing this manually is tedious. Thankfully it can be scripted. Therefore, I made a simple script and a SMF manifest that runs the script at the end of boot. If the system boots fine, my script reboots it. If the system hangs mid-boot, well my script never executes leaving the system in a hung state. Then, I can break into the kernel debugger (mdb) and investigate.

I’m sharing the two here mostly for my benefit… in case one day in the future I decide that I need my system automatically rebooted over and over again.

The script is pretty simple. Hopefully, 60 seconds is long enough to log in and disable the service if necessary. (In reality, I setup a separate boot environment that’s the default choice in Grub. I can just select my normal boot environment and get back to non-timebomb system.)

#!/bin/sh

sleep 60

reboot -p

The tricky part is of course in the manifest. Not because it is hard, but because XML is … verbose.

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type='manifest' name='rebooter'>
	<service name='site/rebooter' type='service' version='1'>
		<dependency name='booted'
		    grouping='require_all'
		    restart_on='none'
		    type='service'>
			<service_fmri
			    value='svc:/milestone/multi-user-server:default'/>
		</dependency>

		<property_group name="startd" type="framework">
			<propval name="duration" type="astring" value="child"/>
			<propval name="ignore_error" type="astring"
				value="core,signal"/>
		</property_group>

		<instance name='system' enabled='true'>
			<exec_method
				type='method'
				name='start'
				exec='/home/jeffpc/illumos/rebooter/script.sh'
				timeout_seconds='0' />

			<exec_method
				type='method'
				name='stop'
				exec=':true'
				timeout_seconds='0' />
		</instance>

		<stability value='Unstable' />
	</service>
</service_bundle>

That’s all, carry on what you were doing. :)

iSCSI boot - Success

In my previous post, I documented some steps necessary to get OpenIndiana to boot from iSCSI.

I finally managed to get it to work cleanly. So, here are the remaining details necessary to boot your OI box from iSCSI.

Installation

First, boot from one of the OI installation media. I used a USB flash drive. Then, before starting the installer, drop into a shell and connect to the target.

# iscsiadm add discovery-address 172.16.0.1
# iscsiadm modify discovery -t enable

At this point, you should have all the LUs accessible:

# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c5t600144F000000000000052A4B4CE0002d0 <SUN-COMSTAR-1.0 cyl 13052 alt 2 hd 255 sec 63>
          /scsi_vhci/disk@g600144f000000000000052a4b4ce0002
Specify disk (enter its number): 

Exit the shell and start the installer.

Now, the tricky part… When you get to the network configuration page, you must select the “None” option. Selecting “Automatically” will cause nwam to try to start on boot and it’ll step onto the already configured network interface. That’s it. Finish installation normally. Once you’re ready to reboot, either configure your network card or use iPXE as I’ve shared before.

e1000g

For the curious, here’s what the iSCSI booted (from the e1000g NIC) system looks like:

# svcs network/physical
STATE          STIME    FMRI
disabled       17:13:10 svc:/network/physical:nwam
online         17:13:15 svc:/network/physical:default
# dladm show-link
LINK        CLASS     MTU    STATE    BRIDGE     OVER
e1000g0     phys      1500   up       --         --
# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
e1000g0/?         static   ok           172.16.0.179/24
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128

nge

Does switching back to the on-board nge NICs work now? No. We still get a lovely panic:

WARNING: Cannot plumb network device 19

panic[cpu0]/thread=fffffffffbc2f400: vfs_mountroot: cannot mount root

Warning - stack not written to the dump buffer
fffffffffbc71ae0 genunix:vfs_mountroot+75 ()
fffffffffbc71b10 genunix:main+136 ()
fffffffffbc71b20 unix:_locore_start+90 ()

iSCSI boot

I decided a couple of days ago to try to see if OpenIndiana would still fail to boot from iSCSI like it did about two years ago. This post exists to remind me later what I did. If you find it helpful, great.

First, I got to set up the target. There is a bunch of documentation how to use COMSTAR to export a LU, so I won’t explain. I made a 100 GB LU.

I dug up an older system to act as my test box and disconnected its SATA disk. Booting from the OI USB image was uneventful. Before starting the installer, dropped into a shell and connected to the target (using iscsiadm). Then I installed OI onto the LU. Then, I dropped back into the shell to modify Grub’s menu.lst to use the serial port for both the Grub menu as well as make the kernel direct console output there.

Since the two on-board NICs can’t boot off iSCSI, I ended up using iPXE to boot off iSCSI. First, I made a script file:

#!ipxe

dhcp
sanboot iscsi:172.16.0.1:::0:iqn.2010-08.org.illumos:02:oi-test

Then it was time to grab the source and build it. I did run into a simple problem in a test file, so I patched it trivially.

$ git clone git://git.ipxe.org/ipxe.git
$ cd ipxe
$ cat /tmp/ipxe.patch
diff --git a/src/tests/vsprintf_test.c b/src/tests/vsprintf_test.c
index 11512ec..2231574 100644
--- a/src/tests/vsprintf_test.c
+++ b/src/tests/vsprintf_test.c
@@ -66,7 +66,7 @@ static void vsprintf_test_exec ( void ) {
 	/* Basic format specifiers */
 	snprintf_ok ( 16, "%", "%%" );
 	snprintf_ok ( 16, "ABC", "%c%c%c", 'A', 'B', 'C' );
-	snprintf_ok ( 16, "abc", "%lc%lc%lc", L'a', L'b', L'c' );
+	//snprintf_ok ( 16, "abc", "%lc%lc%lc", L'a', L'b', L'c' );
 	snprintf_ok ( 16, "Hello world", "%s %s", "Hello", "world" );
 	snprintf_ok ( 16, "Goodbye world", "%ls %s", L"Goodbye", "world" );
 	snprintf_ok ( 16, "0x1234abcd", "%p", ( ( void * ) 0x1234abcd ) );
$ patch -p1 < /tmp/ipxe.patch
$ make bin/ipxe.usb EMBED=/tmp/ipxe.script
$ sudo dd if=bin/ipxe.usb of=/dev/rdsk/c8t0d0p0 bs=1M

Now, I had a USB flash drive with iPXE that’d get a DHCP lease and then proceed to boot from my iSCSI target.

Did the system boot? Partially. iPXE did everything right — DHCP, storing the iSCSI information in the Wikipedia article: iBFT, reading from the LU and handing control over to Grub. Grub did the right thing too. Sadly, once within kernel, things didn’t quite work out the way they should.

iBFT

Was the iBFT getting parsed properly? After reading the code for a while and using mdb to examine the state, I found a convenient tunable (read: global int that can be set using the debugger) that will cause the iSCSI boot parameters to be dumped to the console. It is called iscsi_print_bootprop. Setting it to non-zero will produce nice output:

Welcome to kmdb
kmdb: unable to determine terminal type: assuming `vt100'
Loaded modules: [ unix krtld genunix ]
[0]> iscsi_print_bootprop/W 1
iscsi_print_bootprop:           0               =       0x1
[0]> :c
OpenIndiana Build oi_151a7 64-bit (illumos 13815:61cf2631639d)
SunOS Release 5.11 - Copyright 1983-2010 Oracle and/or its affiliates.
All rights reserved. Use is subject to license terms.
Initiator Name : iqn.2010-04.org.ipxe:00020003-0004-0005-0006-000700080009
Local IP addr  : 172.16.0.179
Local gateway  : 172.16.0.1
Local DHCP     : 0.0.0.0
Local MAC      : 00:02:b3:a8:66:0c
Target Name    : iqn.2010-08.org.illumos:02:oi-test
Target IP      : 172.16.0.1
Target Port    : 3260
Boot LUN       : 0000-0000-0000-0000

nge vs. e1000g

So, the iBFT was getting parsed properly. The only “error” message to indicate that something was wrong was the “Cannot plumb network device 19”. Searching the code reveals that this is in the rootconf function. After more tracing, it became apparent that the kernel was trying to set up the NIC but was failing to find a device with the MAC address iBFT indicated. (19 is ENODEV)

At this point, it dawned on me that the on-board NICs are mere nge devices. I popped in a PCI-X e1000g moved the cable over and rebooted. Things got a lot farther!

unable to connect

Currently, I’m looking at this output.

NOTICE: Configuring iSCSI boot session...
NOTICE: iscsi connection(5) unable to connect to target iqn.2010-08.org.illumos:02:oi-test
Loading smf(5) service descriptions: 171/171
Hostname: oi-test
Configuring devices.
Loading smf(5) service descriptions: 6/6
NOTICE: iscsi connection(12) unable to connect to target iqn.2010-08.org.illumos:02:oi-test

The odd thing is, while these appear SMF is busy loading manifests and tracing the iSCSI traffic to the target shows that the kernel is doing a bunch of reads and writes. I suspect that all the successful I/O was done over one connection and then something happens and we lose the link. This is where I am now.

Meili upgrades

A couple of months ago, I decided to update my almost two and a half year old laptop. Twice.

First, I got more RAM. This upped it to 12 GB. While still on the low side for a box which actually gets to see some heavy usage (compiling illumos takes a couple of hours and generates a couple of GB of binaries), it was better than the 4 GB I used for way too long.

Second, I decided to bite the bullet and replaced the 320 GB disk with a 256 GB SSD (Samsung 840 Pro). Sadly, in the process I had the pleasure of reinstalling the system — both Windows 7 and OpenIndiana. Overall, the installation was uneventful as my Windows partition has no user data and my OI storage is split into two pools (one for system and one for my data).

The nice thing about reinstalling OI was getting back to a stock OI setup. A while ago, I managed to play with software packaging a bit too much and before I knew it I was using a customized fork of OI that I had no intention of maintaining. Of course, I didn’t realize this until it was too late to rollback. Oops. (Specifically, I had a custom pkg build which was incompatible with all versions OI ever released.)

One of the painful things about my messed-up-OI install was that I was running a debug build of illumos. This made some things pretty slow. One such thing was boot. The ZFS related pieces took about a minute alone to complete. The whole boot procedure took about 2.5 minutes. Currently, with a non-debug build and an SSD, my laptop goes from Grub prompt to gdm login in about 40 seconds. I realize that this is an apples to oranges comparison.

I knew SSDs were supposed to be blazing fast, but I resisted getting one for the longest time mostly due to reliability concerns. What changed my mind? I got to use a couple of SSDs in my workstation at work. I saw the performance and I figured that ZFS would take care of alerting me of any corruption. Since most of my work is version controlled, chances are that I wouldn’t lose anything. Lastly, SSDs got a fair amount of improvements over the past few years.

Isis

After several years of having a desktop at home that’s been unplugged and unused I decided that it was time to make a home server to do some of my development on and just to keep files stored safely and redundantly. This was in August 2011. A lot has happened since then. First of all, I rebuilt the OpenIndiana (an Illumos-based distribution) setup with SmartOS (another Illumos-based distribution). Since I wrote most of this a long time ago, some of the information below is obsolete. I am sharing it anyway since others may find it useful. Toward the end of the post, I’ll go over SmartOS rebuild. As you may have guessed, the hostname for this box ended up being Wikipedia article: Isis.

First of all, I should list my goals.

storage box
The obvious mix for digital photos, source code repositories, assorted documents, and email backup is easy enough to store. It however becomes a nightmare if you need to keep track where they are (i.e., which of the two external disks, public server (Odin), laptop drives, desktop drives they are on). Since none of them are explicitly public, it makes sense to keep them near home instead on my public server that’s in a data-center with a fairly slow uplink (1 Mbit/s burstable to 10 Mbits/s, billed at 95th percentile).
dev box
I have a fast enough laptop (Thinkpad T520), but a beefier system that I can let compile large amounts of code is always nice. It will also let me run several virtual machines and zones comfortably — for development, system administration experiments, and other fun stuff.
router
I have an old Linksys WRT54G (rev. 3) that has served me well for the years. Sadly, it is getting a bit in my way — IPv6 tunneling over IPv4 is difficult, the 100 Mbit/s switch makes it harder to transfer files between computers, etc. If I am making a server that will be always on, it should handle effortlessly NAT’ing my Comcast internet connection. Having a full-fledged server doing the routing will also let me do better traffic shaping & filtering to make the connection feel better.

Now that you know what sort of goals I have, let’s take a closer look at the requirments for the hardware.

  1. reliable
  2. friendly to OpenIndiana and ZFS
  3. low-power
  4. fast
  5. virtualization assists (to support run virtual machines at reasonable speed)
  6. cheap
  7. quiet
  8. spacious (storage-wise)

While each one of them is pretty easy to accomplish, their combination is much harder to achieve. Also note that it is ordered from most to least important. As you will see, reliability dictated many of my choices.

The Shopping List

CPU
Intel Xeon E3-1230 Sandy Bridge 3.2GHz LGA 1155 80W Quad-Core Server Processor BX80623E31230
RAM (4)
Kingston ValueRAM 4GB 240-Pin DDR3 SDRAM DDR3 1333 ECC Unbuffered Server Memory Model KVR1333D3E9S/4G
Motherboard
SUPERMICRO MBD-X9SCL-O LGA 1155 Intel C202 Micro ATX Intel Xeon E3 Server Motherboard
Case
SUPERMICRO CSE-743T-500B Black Pedestal Server Case
Data Drives (3)
Seagate Barracuda Green ST2000DL003 2TB 5900 RPM SATA 6.0Gb/s 3.5"
System Drives (2)
Western Digital WD1600BEVT 160 GB 5400RPM SATA 8 MB 2.5-Inch Notebook Hard Drive
Additional NIC
Intel EXPI9301CT 10/100/1000Mbps PCI-Express Desktop Adapter Gigabit CT

To measure the power utilization, I got a P3 International P4400 Kill A Watt Electricity Usage Monitor. All my power usage numbers are based on watching the digital display.

Intel vs. AMD

I’ve read Constantin’s OpenSolaris ZFS Home Server Reference Design and I couldn’t help but agree that ECC should be a standard feature on all processors. Constantin pointed out that many more AMD processors support ECC and that as long as you got a motherboard that supported it as well you are set. I started looking around at AMD processors but my search was derailed by Joyent’s announcement that they ported KVM to Illumos — the core of OpenIndiana including the kernel. Unfortunately for AMD, this port supports only Intel CPUs. I switched gears and started looking at Intel CPUs.

In a way I wish I had a better reason for choosing Intel over AMD but that’s the truth. I didn’t want to wait for AMD’s processors to be supported by the KVM port.

So, why did I get a 3.2GHz Xeon (E3-1230)? I actually started by looking for motherboards. At first, I looked at desktop (read: cheap) motherboards. Sadly, none of the Intel-based boards I’ve seen supported ECC memory. Looking at server-class boards made the search for ECC support trivial. I was surprised to find a Supermicro motherboard (MBD-X9SCL-O) for $160. It supports up to 32 GB of ECC RAM (4x 8 GB DIMMs). Rather cheap, ECC memory, dual gigabit LAN (even though one of the LAN ports uses the Intel 82579 which was unsupported by OpenIndiana at the time), 6 SATA II ports — a nice board by any standard. This motherboard uses the LGA 1155 socket. That more or less means that I was “stuck” with getting a Sandy Bridge processor. :-D The E3-1230 is one of the slower E3 series processors, but it is still very fast compared to most of the other processors in the same price range. Additionally, it’s “only” 80 Watt chip compared to many 95 or even 130 Watt chips from the previous series.

There you have it. The processor was more or less determined by the motherboard choice. Well, that’s being rather unfair. It just ended up being a good combination of processor and motherboard — a cheap server board and near-bottom-of-the-line processor that happens to be really sweet.

Now that I had a processor and a motherboard picked out, it was time to get RAM. In the past, I’ve had good luck with Kingston, and since it happened to be the cheapest ECC 4 GB DIMMs on NewEgg, I got 4 — for a grand total of 16 GB.

Case

I will let you know a secret. I love hotswap drive bays. They just make your life easier — from being able to lift a case up high to put it on a shelf without having to lift all those heavy drives at the same time, to quickly replacing a dead drive without taking the whole system down.

I like my public server’s case (Supermicro CSE-743T-645B) but the 645 Watt power supply is really an overkill for my needs. The four 5000 RPM fans on the midplane are pretty loud when they go full speed. I looked around, and I found a 500 Watt (80%+ efficiency) variant of the case (CSE-743-500B). Still a beefy power supply but closer to what one sees in high end desktops. With this case, I get eight 3.5" hot-swap bays, and three 5.25" external (non-hotswap) bays. This case shouldn’t be a limiting factor in any way.

I intended to move my DVD+RW drive from my desktop but that didn’t work out as well as I hoped.

Storage

At the time I was constructing Isis, I was experimenting with Wikipedia article: ZFS on OpenIndiana. I was more than impressed, and I wanted it to manage the storage on my home sever. ZFS is more than just a filesystem, it is also a volume manager. In other words, you can give it multiple disks and tell it to put your data on them in several different ways that closely resemble RAID levels. It can stripe, mirror, or calculate one to three parities. Wikipedia has a nice article outlining ZFS’s features. Anyway, I strongly support ZFS’s attitude toward losing data — do everything to prevent it in the first place.

Hard drives are very interesting devices. Their reliability varies with so many variables (e.g., manufacturing defects, firmware bugs). In general, manufacturers give you fairly meaningless looking, yet impressive sounding numbers about their drives reliability. Richard Elling made a great blog post where he analyzed ZFS RAID space versus Mean-Time-To-Data-Loss, or MTTDL for short. (Later, he analyzed a different MTTDL model.)

The short version of the story is nicely summed up by this graph (taken from Richard’s blog):

While this scatter plot is for a specific model of a high-end server, it applies to storage in general. I like how the various types of redundancy clump up.

Anyway, how much do I care about my files? Most of my code lives in distributed version control systems, so losing one machine wouldn’t be a problem for those. The other files would be a bigger problem. While it wouldn’t be a complete end of the world if I lost all my photos, I’d rather not lose them. This goes back to the requirements list — I prefer reliable over spacious. That’s why I went with 3-way mirror of 2 TB Seagate Barracuda Green drives. It gets me only 2 TB of usable space, but at the same time I should be able to keep my files forever. These are the data drives. I also got two 2.5" 160 GB Western Digital laptop drives to hold the system files — mirrored of course.

Around the same time I was discovering that the only sane way to keep your files was mirroring, I stumbled across Constantin’s RAID Greed post. He basically says the same thing — use 3-way mirror and your files will be happy.

Now, you might be asking… 2 TB, that’s not a lot of space. What if you out grow it? My answer is simple: ZFS handles that for me. I can easily buy three more drives, plug them in and add them as a second 3-way mirror and ZFS will happily stripe across the two mirrors. I considered buying 6 disks right away, but realized that it’ll probably be at least 6-9 months before I’ll have more than 2 TB of data. So, if I postpone the purchase of the 3 additional drives, I can save money. It turns out that a year and a half later, I’m still below 70% of the 2 TB.

Miscellaneous

I knew that one of the on-board LAN ports was not yet supported by Illumos, and so I threw a PCI-e Gigabit ethernet card into the shopping cart. I went with an Intel gigabit card. Illumos has since gained support for 82579-based NICs, but I’m lazy and so I’m still using the PCI-e NIC.

Base System

As the ordered components started showing up, I started assembling them. Thankfully, the CPU, RAM, motherboard, and case showed up at the same time preventing me from going crazy. The CPU came with a stock Intel heatsink.

The system started up fine. I went into the BIOS and did the usual new-system tweaking — make sure SATA ports are in AHCI mode, stagger the disk spinup to prevent unnecessary load peaks at boot, change the boot order to skip PXE, etc. While roaming around the menu options, I discovered that the motherboard can boot from iSCSI. Pretty neat, but useless for me on this system.

The BIOS has a menu screen that displays the fan speeds and the system and processor temperatures. With the fan on the heatsink and only one midplane fan connected the system ran at about 1°C higher than room temperature and the CPU was about 7°C higher than room temperature.

OS Installation

Anyway, it was time to install OpenIndiana. I put my desktop’s DVD+RW in the case and then realized that the motherboard doesn’t have any IDE ports! Oh well, time to use a USB flash drive instead. At this point, I had only the 2 system drives. I connected one to the first SATA port, put a 151 development snapshot (text installer) on my only USB flash drive. The installer booted just fine. Installation was uneventful. The one potentially out of the ordinary thing I did was to not configure any networking. Instead, I set it up manually after the first boot, but more about that later.

With OI installed on one disk, it was time to set up the rpool mirror. I used Constantin’s Mirroring Your ZFS Root Pool as the general guide even though it is pretty straight forward — duplicate the partition (and slice) scheme on the second disk, add the new slice to the root pool, and then install grub on it. Everything worked out nicely.

# zpool status rpool
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0 in 0h5m with 0 errors on Sun Sep 18 14:15:24 2011
config:

        NAME          STATE     READ WRITE CKSUM
        rpool         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            c2t0d0s0  ONLINE       0     0     0
            c2t1d0s0  ONLINE       0     0     0

errors: No known data errors

Networking

Since I wanted this box to act as a router, the network setup was a bit more…complicated (and quite possibly over-engineered). This is why I elected to do all the network setup by hand later than having to “fix” whatever damage the installer did. :)

I powered it off, put in the extra ethernet card I got, and powered it back on. To my surprise, the new device didn’t show up in dladm. I remembered that I should trigger the device reconfiguration. A short touch /reconfigure && reboot later, dladm listed two physical NICs.

network diagram

As you can see, I decided that the routing should be done in a zone. This way, all the routing settings are nicely contained in a single place that does nothing else.

Setting up the virtual interfaces was pretty easy thanks to dladm. Setting the static IP on the global zone was equally trivial.

# dladm create-vlan -l e1000g0 -v 11 vlan11
# dladm create-vnic -l e1000g0 vlan0
# dladm create-vnic -l e1000g0 internal0
# dladm create-vnic -l e1000g1 isp0
# dladm create-etherstub zoneswitch0
# dladm create-vnic -l zoneswitch0 zone_router0

# ipadm create-if internal0
# ipadm create-addr -T static -a local=10.0.0.2/24 internal/v4

You might be wondering about the vlan11 interface that’s on a separate Wikipedia article: VLAN. The idea was to have my WRT54G continue serving as a wifi access point, but have all the traffic end up on VLAN #11. The router zone would then get to decide whether the user is worthy of LAN or Internet access. I never finished poking around the WRT54G to figure out how to have it dump everything on a VLAN #11 instead of the default #0.

Router zone

OpenSolaris (and therefore all Illumos derivatives) has a wonderful feature called Wikipedia article: zones. It is essentially a super-lightweight virtualization mechanism. While talking to a couple of people on IRC, I decided that I, like them, would use a dedicated zone as a router.

Just before I set up the router zone, the storage disks arrived. The router zone ended up being stored on this array. See the storage section below for details about this storage pool.

After installing the zone via zonecfg and zoneadm, it was time to set up the routing and firewalling. First, install the ipfilter package (pkg install pkg:/network/ipfilter). Now, it is time to configure the NAT and filter rules.

NAT is easy to set up. Just plop a couple of lines into /etc/ipf/ipnat.conf:

map isp0 10.0.0.0/24 -> 0/32 proxy port ftp ftp/tcp
map isp0 10.0.0.0/24 -> 0/32 portmap tcp/udp auto
map isp0 10.0.0.0/24 -> 0/32

map isp0 10.11.0.0/24 -> 0/32 proxy port ftp ftp/tcp
map isp0 10.11.0.0/24 -> 0/32 portmap tcp/udp auto
map isp0 10.11.0.0/24 -> 0/32

map isp0 10.1.0.0/24 -> 0/32 proxy port ftp ftp/tcp
map isp0 10.1.0.0/24 -> 0/32 portmap tcp/udp auto
map isp0 10.1.0.0/24 -> 0/32

IPFilter is a bit trickier to set up. The rules need to handle more cases. In general, I tried to be a bit paranoid about the rules. For example, I drop all traffic for IP addresses that don’t belong on that interface (I should never see 10.0.0.0/24 addresses on my ISP interface). The only snag was in the defaults for the ipfilter Wikipedia article: SMF service. By default, it expects you to put your rules into SMF properties. I wanted to use the more old-school approach of using a config file. Thankfully, I quickly found a blog post which hepled me with it.

Storage, part 2

As the list of components implies, I wanted to make two arrays. I already mentioned the rpool mirror. Once the three 2 TB disks arrived, I hooked them up and created a 3-way mirror (zpool create storage mirror c2t3d0 c2t4d0 c2t5d0).

# zpool status storage
  pool: storage
 state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Sun Sep 18 14:10:22 2011
config:

        NAME        STATE     READ WRITE CKSUM
        storage     ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c2t3d0  ONLINE       0     0     0
            c2t4d0  ONLINE       0     0     0
            c2t5d0  ONLINE       0     0     0

errors: No known data errors

Deduplication & Compression

I suspected that there would be enough files that would be stored several times — system binaries for zones, clones of source trees, etc. ZFS has built-in online Wikipedia article: deduplication. This stores each unique block only once. It’s easy enough to turn on: zfs set dedup=on storage.

Additionally, ZFS has transparent data (and metadata) compression featuring Wikipedia article: LZJB and gzip algorithms.

I enabled dedup and kept compression off. Dedup did take care of the duplicate binaries between all the zones. It even took care of duplicates in my photo stash. (At some point, I managed to end up with several diverged copies of my photo stash. One of the first things I did with Isis, was to dump all of them in the same place and start sorting them. Adobe Lightroom helped here quite a bit.)

After a while, I came to the realization that for most workloads I run, dedup was wasteful and I would be better off disabling dedup and enabling light compression (i.e., LZJB).

$HOME

The installer puts the non-privileged user’s home directory onto the root pool. I did not want to keep it there since I now had the storage pool. After a bit of thought, I decided to zfs create storage/home and then transfer over the current home directory. I could have used cp(1) or rsync(1), but I thought it would be more fun (and a learning experience) to use zfs send and zfs recv. It went something like this:

# zfs snapshot rpool/export/home/jeffpc@snap
# zfs send rpool/export/home/jeffpc@snap | zfs recv storage/home/jeffpc

In theory, any modifications to my home directory after the snapshot got lost, but since I was just ssh’d in there wasn’t much that changed. (I am ok with losing the last update to .bash_history this one time.) The last thing that needed changing is /etc/auto_home — which tells the automounter where my $HOME really is. This is the resulting file after the change (without the copyright comment):

jeffpc	localhost:/storage/home/&
+auto_home

For good measure, I rebooted to make sure things would come up properly — they did.

Since the server is not intended just for me, I created the other user account with a home directory in storage/home/holly.

Zones

I intend to use zones extensively. To keep their files out of the way, I decided on storage/zones/$ZONE_NAME. I’ll talk more about the zones I set up later in the Zones section.

SMB

Local storage is great, but there is only so much you can do with it. Sooner or later, you will want to access it from a different computer. There are many different ways to “export” your data, but as one might expert, they all have their benefits and drawbacks. ZFS makes it really easy to export data via NFS and SMB. After a lot of thought, I decided that SMB would work a bit better. The major benefit of SMB over NFS is that it Just Works™ on all the major operating systems. That’s not to say that NFS does not work, but rather that it needs a bit more…convincing at times. This is especially true on Windows.

I followed the documentation for enabling SMB on Solaris 11. Yes, I know, OpenIndiana isn’t Solaris 11, but this aspect was the same. This ended with me enabling sharing of several datasets like this:

# zfs set sharesmb=name=photos storage/photos

ACLs

The home directory shares are all done. The photos share, however, needs a bit more work. Specifically, it should be fully accessible to the users that are supposed to have access (i.e., jeffpc & holly). The easiest way I can find is to use ZFS ACLs.

First, I set the aclmode to passthrough (zfs set aclmode=passthough storage). This will prevent a chmod(1) on a file or directory from blowing away all the ACEs (Access Control Entries?). Then on the share directory, I added two ACL entries that allow everything.

# /usr/bin/ls -dV /share/photos
drwxr-xr-x   2 jeffpc   root           4 Sep 23 09:12 /share/photos
                 owner@:rwxp--aARWcCos:-------:allow
                 group@:r-x---a-R-c--s:-------:allow
              everyone@:r-x---a-R-c--s:-------:allow
# /usr/bin/chmod A+user:jeffpc:rwxpdDaARWcCos:fd:allow /share/photos
# /usr/bin/chmod A+user:holly:rwxpdDaARWcCos:fd:allow /share/photos
# /usr/bin/chmod A2- /share/photos # get rid of user
# /usr/bin/chmod A2- /share/photos # get rid of group
# /usr/bin/chmod A2- /share/photos # get rid of everyone
# /usr/bin/ls -dV /share/photos
drwx------+  2 jeffpc   root           4 Sep 23 09:12 /share/photos
            user:jeffpc:rwxpdDaARWcCos:fd-----:allow
             user:holly:rwxpdDaARWcCos:fd-----:allow

The first two chmod commands prepend two ACEs. The next three remove ACE number 2 (the third entry). Since the directory started of with three ACEs (representing the standard Unix permissions), the second set of chmods removes those, leaving only the two user ACEs behind.

Clients

That was easy! In case you are wondering, the Solaris/Illumos SMB service does not allow guest access. You must login to use any of the shares.

Anyway, here’s the end result:

Pretty neat, eh?

Zones

Aside from the router zone, there were a number of other zones. Most of them were for Illumos and OpenIndiana development.

I don’t remember much of the details since this predates the SmartOS conversion.

Power

When I first measured the system, it was drawing about 40-45 Watts while idle. Now, I have Isis along with the WRT54G and a gigabit switch on a UPS that tells me that I’m using about 60 Watts when idle. The load can spike up quite a bit if I put load on the 4 Xeon cores and give the disks something to do. (Afterall, it is an 80 Watt CPU!) While this is by no means super low-power, it is low enough and at the same time I have the capability to actually get work done instead of waiting for hours for something to compile.

SmartOS

As I already mentioned, I ended up rebuilding the system with SmartOS. SmartOS is not a general purpose distro. Rather, it strives to be a hypervisor with utilities that make guest management trivial. Guests can either be zones, or KVM-powered virtual machines. Here are the major changes from the OpenIndiana setup.

Storage — pools

SmartOS is one of those distros you do not install. It always netboots, boots from a USB stick or a CD. As a result, you do not need a system drive. This immediately obsoleted the two laptop drives. Conveniently, around the same time, Holly’s laptop suffered from a disk failure so Isis got to donate one of the unused 2.5" system disks.

SmartOS calls its data pool “zones”, which took a little bit of getting used to. There’s a way to import other pools, but wanted to keep the settings as vanilla as possible.

At some point, I threw in a Intel 160 GB SSD to use for L2ARC and Wikipedia article: ZIL.

Here’s what the pool looks like:

# zpool status
  pool: zones
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0 in 2h59m with 0 errors on Sun Jan 13 08:37:37 2013
config:

        NAME        STATE     READ WRITE CKSUM
        zones       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
            c1t3d0  ONLINE       0     0     0
        logs
          c1t1d0s0  ONLINE       0     0     0
        cache
          c1t1d0s1  ONLINE       0     0     0

errors: No known data errors

In case you are wondering about the features related status message, I created the zones pool way back when Illumos (and therefore SmartOS) had only two ZFS features. Since then, Illumos added one and Joyent added one to SmartOS.

# zpool get all zones | /usr/xpg4/bin/grep -E '(PROP|feature)'
NAME   PROPERTY                   VALUE                      SOURCE
zones  feature@async_destroy      enabled                    local
zones  feature@empty_bpobj        active                     local
zones  feature@lz4_compress       disabled                   local
zones  feature@filesystem_limits  disabled                   local

I haven’t experimented with either enough to enable it on a production system I rely on so much.

Storage — deduplication & compression

The rebuild gave me a chance to start with a clean slate. Specifically, it gave me a chance to get rid off the dedup table. (The dedup table, DDT, is built as writes happen to the filesystem with dedup enabled.) Data deduplication relies on some form of data structure (the most trivial one is a hash table) that maps the hash of the data to the data. In ZFS, the DDT maps the Wikipedia article: SHA-256 of the block to the block address.

The reason I stopped using dedup on my systems was pretty straight forward (and not specific to ZFS). Every entry in the DDT has an overhead. So, ideally, every entry in the DDT is referenced at least twice. If a block is referenced only once, then one would be better off without the block taking up an entry in the DDT. Additionally, every time a reference is taken or released, the DDT needs to be updated. This causes very nasty random I/O under which spinning disks want to weep. It turns out, that a “normal” user will have mostly unique data rendering deduplication impractical.

That’s why I stopped using dedup. Instead, I became convinced that most of the time light compression is the way to go. Lightly compressing the data will result in I/O bandwidth savings as well as capacity savings with little overhead given today’s processor speeds versus I/O latencies. Since I haven’t had time to experiment with the recently integrated LZ4, I still use LZJB.

IPS: The Manifest

In the past, I have mentioned that IPS is great. I think it is about time I gave you more information about it. This time, I’ll talk about the manifest and some core IPS ideals.

IPS, Image Packaging System, has some really neat ideas. Each package contains a manifest. The manifest is a file which list actions. Some very common actions are “install a file at path X,” “create a symlink from X to Y,” as well as “create user account X.” The great thing about this, is that the manifest completely describes what needs to be done to the system to install a package. Uninstalling a package simply undoes the actions — delete files, symlinks, users. (This is where the “image” in IPS comes from — you can assemble the system image from the manifests.)

For example, here is the (slightly hand edited) manifest for OpenIndiana’s rsync package:

set name=pkg.fmri value=pkg://openindiana.org/network/rsync@3.0.9,5.11-0.151.1.7:20121003T221151Z
set name=org.opensolaris.consolidation value=sfw
set name=variant.opensolaris.zone value=global value=nonglobal
set name=description value="rsync - faster, flexible replacement for rcp"
set name=variant.arch value=i386
set name=pkg.summary value="rsync - faster, flexible replacement for rcp"
set name=pkg.description value="rsync - A utility that provides fast incremental file transfer and copy."
set name=info.classification value="org.opensolaris.category.2008:Applications/System Utilities"
dir group=sys mode=0755 owner=root path=usr
dir group=bin mode=0755 owner=root path=usr/bin
dir group=sys mode=0755 owner=root path=usr/share
dir group=bin mode=0755 owner=root path=usr/share/man
dir group=bin mode=0755 owner=root path=usr/share/man/man1
dir group=bin mode=0755 owner=root path=usr/share/man/man5
license 88142ae0b65e59112954efdf152bb2342e43f5e7
	chash=3b72b91c9315427c1994ebc5287dbe451c0731dc
	license=SUNWrsync.copyright pkg.csize=12402 pkg.size=35791
file 02f1be6412dd2c47776a62f6e765ad04d4eb328c
	chash=945deb12b17a9fd37461df4db7e2551ad814f88b
	elfarch=i386 elfbits=32
	elfhash=1d3feb5e8532868b099e8ec373dbe0bea4f218f1
	group=bin mode=0555 owner=root path=usr/bin/rsync
	pkg.csize=191690 pkg.size=395556
file 7bc01c64331c5937d2d552fd93268580d5dd7c66
	chash=328e86655be05511b2612c7b5504091756ef7e61
	group=bin mode=0444 owner=root
	path=usr/share/man/man1/rsync.1 pkg.csize=50628
	pkg.size=165934
file 006fa773e1be3fecf7bbfb6c708ba25ddcb0005e
	chash=9e403b4965ec233c5e734e6fcf829a034d22aba9
	group=bin mode=0444 owner=root
	path=usr/share/man/man5/rsyncd.conf.5
	pkg.csize=12610 pkg.size=37410
depend fmri=consolidation/sfw/sfw-incorporation type=require
depend fmri=system/library@0.5.11-0.151.1.7 type=require

The manifest is very easily readable. It is obvious that there are several sets of actions:

metadata
specifies the FMRI, description, and architecture among others
directories
lists all the directories that need to be created/deleted during installation/removal
license
specifies the file with the text of the license for the package
files
in general, most actions are file actions — each installs a file
dependencies
lastly, rsync depends on system/library and sfw-incorporation

The above example is missing symlinks, hardlinks, user accounts, services, and device driver related actions.

Many package management systems have the ability to execute arbitrary scripts after installation or prior to removal. IPS does not allow this since it would violate the idea that the manifest completely describes the package. This means (in theory), that one can tell IPS to install the base packages into a directory somewhere, and at the end one has a working system.

It all sounds good, doesn’t it? As always, the devil is in the details.

First of all, sometimes there’s just no clean way to perform all package setup at install time. One just needs a script to run to take care of the post-install configuration. Since IPS doesn’t support this, package developers often create a transient Wikipedia article: SMF manifest and let SMF run the script after the installation completes. This is just ugly, but not the end of the world.

Requests?

I’m going to try something new. Instead of posting a random thought every so often, I’m going to take requests. What do you want me to talk about next?

Serial Console

Over the past couple of days, I’ve been testing my changes to the crashdump core in Illumos. (Here’s why.) I do most of my development on my laptop — either directly, or I use it to ssh into a dev box. For Illumos development, I use the ssh approach. Often, I end up using my ancient desktop (pre-HyperThreading era 2GHz Pentium 4) as a test machine. It gets pretty annoying to have a physical keyboard and monitor to deal with when the system crashes. The obvious solution is to use a serial console. Sadly, all the “Solaris serial console howtos” leave a lot to be desired. As a result, I am going to document the steps here. I’m connecting from Solaris to Solaris. If you use Linux on one of the boxes, you will have to do it a little differently.

Test Box

First, let’s change the console speed from the default 9600 to a more reasonable 115200. In /etc/ttydefs change the console line to:

console:115200 hupcl opost onlcr:115200::console

Second, we need to tell the kernel to use the serial port as a console. Here, I’m going to assume that you are using the first serial port (i.e., ttya). So, open up your Grub config (/rpool/boot/grub/menu.lst assuming your root pool is rpool) and find the currently active entry.

You’ll see something like this:

title openindiana-8
findroot (pool_rpool,0,a)
bootfs rpool/ROOT/openindiana-8
splashimage /boot/splashimage.xpm
foreground FF0000
background A8A8A8
kernel$ /platform/i86pc/kernel/$ISADIR/unix -B $ZFS-BOOTFS
module$ /platform/i86pc/$ISADIR/boot_archive

We need to add two options. One to tell the kernel to use the serial port as a console, and one to tell it the serial config (rate, parity, etc.).

You’ll want to change the kernel$ line to:

kernel$ /platform/i86pc/kernel/$ISADIR/unix -B $ZFS-BOOTFS,console=ttya,ttya-mode="115200,8,n,1,-" -k

Note that we appended the options with commas to the existing -B. If you do not already have a -B, just add it and the two new options. The -k will make the kernel drop into the debugger when bad things happen. You can omit it if you just want a serial console without the debugger getting loaded.

There’s one last thing left to do. Let’s tell grub to use the same serial port and not use a splash image. This can be done by adding these lines to the top of your menu.lst:

serial --unit=0 --speed=115200
terminal serial

and removing (commenting out) the splashimage line.

So, what happens if you make all these changes and then beadm creates a new BE? The right thing! beadm will copy over all the kernel options so your new BE will just work.

Dev Box

I use OpenIndiana on my dev box. I could have used minicom, but I find minicom to be a huge pain unless you have a modem you want to talk to. I’m told that screen can talk to serial ports as well. I decided to keep things super-simple and configured tip.

First, one edits /etc/remote. I just changed the definition for hardwire to point to the first serial port (/dev/term/a) and use the right speed (115200):

hardwire:\
	:dv=/dev/term/a:br#115200:el=^C^S^Q^U^D:ie=%$:oe=^D:

Then, I can just run a simple command to get the other system:

$ tip hardwire

DTrace: The utmp_update Debugger

For the past 2 years, I am a happy user of rxvt-unicode, aka urxvt. Recently, I noticed that my logs contained rather mysterious error messages:

Nov  3 22:46:03 meili utmp_update[1613]: [ID 845426 user.error] Wrong number of
	arguments or invalid user 

Sometimes, there were a dozen of these. Of course I filed a bug with the Illumos folks. Rich Lowe suggested using DTrace to figure out what is actually going on. It was time to look at the exit codes for utmp_update.

syscall::rexit:entry
/execname=="utmp_update"/
{
	printf("utmp_update exited with code %d", arg0);
	@[arg0] = count();
}

tick-60sec
{
	printa(@);
}

Since utmp is involved, it had something to do with terminals, so I tried to open some terminals and close them. That did it!

# dtrace -s catch-errors.d 
dtrace: script 'catch-errors.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  0     49                      rexit:entry utmp_update exited with code 0
  1     49                      rexit:entry utmp_update exited with code 0
  0     49                      rexit:entry utmp_update exited with code 7
  0     49                      rexit:entry utmp_update exited with code 7
  1     49                      rexit:entry utmp_update exited with code 0
  5     49                      rexit:entry utmp_update exited with code 0
  1     49                      rexit:entry utmp_update exited with code 0
  2     49                      rexit:entry utmp_update exited with code 0
  3     49                      rexit:entry utmp_update exited with code 7
  1  67549                      :tick-60sec 
                7                3
                0                6

It turns out that every time I closed a terminal, utmp_update exited with error 7. A quick glance at usr/src/cmd/utmp_update/utmp_update.c reveals:

/*
 * Return codes
 */
#define	NORMAL_EXIT		0
#define	BAD_ARGS		1
#define	PUTUTXLINE_FAILURE	2
#define	FORK_FAILURE		3
#define	SETSID_FAILURE		4
#define	ALREADY_DEAD		5
#define	ENTRY_NOTFOUND		6
#define	ILLEGAL_ARGUMENT	7
#define	DEVICE_ERROR		8

Aha! It really is an invalid argument. At this point, Rich pointed me to setutxline in libc.so. Sadly, for whatever reason, the probe pid*:libc.so.1:pututxline:entry didn’t work (it didn’t match anything). Rich suggested the following DTrace script:

proc:::exec
/strstr(args[0], "utmp") != NULL/
{
	trace(execname);
}

Pretty straightforward — the output told me that it was urxvt causing all this trouble.

Now, I knew to watch out for pututxline in urxvt. I tried to set a probe pid$target::pututxline:entry and use the new print function in DTrace, but due to a user error (read: sometimes I write stupid code) it didn’t work. Rich helped me navigate through mdb to get a print-out of the utx structure. At this point, it was a bit too late in the night and so I went to bed.

The next morning, I tried the print function again and this time I used it right and it printed out the structure:

struct utmpx {
    char [32] ut_user = [ "" ]
    char [4] ut_id = [ 'v', 't', '0', '2' ]
    char [32] ut_line = [ "pts/2" ]
    pid_t ut_pid = 0x193ec
    short ut_type = 0x8
    struct exit_status ut_exit = {
        short e_termination = 0
        short e_exit = 0
    }
    struct timeval ut_tv = {
        time_t tv_sec = 0x4eb55d2a
        suseconds_t tv_usec = 0x18adc
    }
    int ut_session = 0
    int [5] pad = [ 0, 0, 0, 0, 0 ]
    short ut_syslen = 0
    char [257] ut_host = [ "" ]
}

Everything looks right, except that the ut_user field is blank. I wonder if this could be the cause of it. Time to look at the urxvt code! (The ustack() action in a DTrace probe for pututxline:entry will tell you where to look.) Here’s a snippet from rxvt-unicode-9.12/libptytty/src/logging.C:

/*
 * remove utmp and wtmp entries
 */
void
ptytty_unix::logout ()
{
  ...
#ifdef HAVE_STRUCT_UTMPX
  setutxent ();
  tmputx = getutxid (utx);
  if (tmputx && tmputx->ut_pid == cmd_pid)
    pututxline (utx);
  endutxent ();
#endif
  ...
}

Ok, so it gets a utx struct and then it puts a different one. Let’s see how different those two are:

# cat getutxid.d
#include <utmpx.h>

pid$target::getutxid:return
{
	ustack();
	print(*(struct utmpx*)copyin(arg1, sizeof(struct utmpx)));
}
# dtrace -Cs getutxid.d -p 103403
dtrace: script 'getutxid.d' matched 1 probes
dtrace: pid 103403 has exited
CPU     ID                    FUNCTION:NAME
  4  67555                  getutxid:return 
              libc.so.1`getutxid+0xf1
              urxvt`_ZN11ptytty_unixD0Ev+0x16
              urxvt`_ZN9rxvt_termD1Ev+0x59b
              urxvt`_ZN9rxvt_term10destroy_cbERN2ev4idleEi+0x70
              urxvt`_ZN2ev4baseI7ev_idleNS_4idleEE12method_thunkI9rxvt_termXadL_ZNS5_10destroy_cbERS2_iEEEEvPS1_i+0x27
              urxvt`ev_invoke_pending+0x35
              urxvt`ev_run+0x520
              urxvt`main+0x29b
              urxvt`_start+0x83
struct utmpx {
    char [32] ut_user = [ "jeffpc" ]
    char [4] ut_id = [ 'v', 't', '0', '2' ]
    char [32] ut_line = [ "pts/2" ]
    pid_t ut_pid = 0x193ec
    short ut_type = 0x7
    struct exit_status ut_exit = {
        short e_termination = 0
        short e_exit = 0x2
    }
    struct timeval ut_tv = {
        time_t tv_sec = 0x4eb55d1f
        suseconds_t tv_usec = 0x18adc
    }
    int ut_session = 0
    int [5] pad = [ 0, 0, 0, 0x303a0005, 0x302e ]
    short ut_syslen = 0
    char [257] ut_host = [ "" ]
}

Ok, it’s more or less the same. It does, however, have a username filled in. I wonder what would happen if I filled in the username. (The urxvt code seems to fill it in only on login updates and it leaves the field empty on logout updates.) Now, I had 3 choices…

  1. Change the urxvt code to fill in the username on logout updates.
  2. Set a breakpoint in gdb or mdb and then tweak the structure before it is passed to utmp_update.
  3. Use DTraces “destructive” option to allow me to modify the process’s memory.

I chose #3.

Here’s the script in all its glory:

#pragma D option destructive

#include <utmpx.h>

pid$target::getutxid:return
{
	ustack();
	print(*(struct utmpx*)copyin(arg1, sizeof(struct utmpx)));
}

pid$target::pututxline:entry
{
	ustack();
	print(*(struct utmpx*)copyin(arg0, sizeof(struct utmpx)));
	printf("\nFIXING...\n");
	copyout("jeffpc\0", (uintptr_t)&((struct utmpx*)arg0)->ut_user[0], 7);
	print(*(struct utmpx*)copyin(arg0, sizeof(struct utmpx)));
}

pid$target::pututxline:return
{
	printf("pututxline returned %p", arg1);
}

And here’s the output:

# dtrace -Cs foo.d -p 103403
dtrace: script 'foo.d' matched 3 probes
dtrace: allowing destructive actions
dtrace: pid 103403 has exited
CPU     ID                    FUNCTION:NAME
  4  67555                  getutxid:return 
              libc.so.1`getutxid+0xf1
              urxvt`_ZN11ptytty_unixD0Ev+0x16
              urxvt`_ZN9rxvt_termD1Ev+0x59b
              urxvt`_ZN9rxvt_term10destroy_cbERN2ev4idleEi+0x70
              urxvt`_ZN2ev4baseI7ev_idleNS_4idleEE12method_thunkI9rxvt_termXadL_ZNS5_10destroy_cbERS2_iEEEEvPS1_i+0x27
              urxvt`ev_invoke_pending+0x35
              urxvt`ev_run+0x520
              urxvt`main+0x29b
              urxvt`_start+0x83
struct utmpx {
    char [32] ut_user = [ "jeffpc" ]
    char [4] ut_id = [ 'v', 't', '0', '2' ]
    char [32] ut_line = [ "pts/2" ]
    pid_t ut_pid = 0x193ec
    short ut_type = 0x7
    struct exit_status ut_exit = {
        short e_termination = 0
        short e_exit = 0x2
    }
    struct timeval ut_tv = {
        time_t tv_sec = 0x4eb55d1f
        suseconds_t tv_usec = 0x18adc
    }
    int ut_session = 0
    int [5] pad = [ 0, 0, 0, 0x303a0005, 0x302e ]
    short ut_syslen = 0
    char [257] ut_host = [ "" ]
}
  4  67556                 pututxline:entry 
              libc.so.1`pututxline
              urxvt`_ZN11ptytty_unix6logoutEv+0x15c
              urxvt`_ZN11ptytty_unixD0Ev+0x16
              urxvt`_ZN9rxvt_termD1Ev+0x59b
              urxvt`_ZN9rxvt_term10destroy_cbERN2ev4idleEi+0x70
              urxvt`_ZN2ev4baseI7ev_idleNS_4idleEE12method_thunkI9rxvt_termXadL_ZNS5_10destroy_cbERS2_iEEEEvPS1_i+0x27
              urxvt`ev_invoke_pending+0x35
              urxvt`ev_run+0x520
              urxvt`main+0x29b
              urxvt`_start+0x83
struct utmpx {
    char [32] ut_user = [ "" ]
    char [4] ut_id = [ 'v', 't', '0', '2' ]
    char [32] ut_line = [ "pts/2" ]
    pid_t ut_pid = 0x193ec
    short ut_type = 0x8
    struct exit_status ut_exit = {
        short e_termination = 0
        short e_exit = 0
    }
    struct timeval ut_tv = {
        time_t tv_sec = 0x4eb55d2a
        suseconds_t tv_usec = 0x18adc
    }
    int ut_session = 0
    int [5] pad = [ 0, 0, 0, 0, 0 ]
    short ut_syslen = 0
    char [257] ut_host = [ "" ]
}
FIXING...
struct utmpx {
    char [32] ut_user = [ "jeffpc" ]
    char [4] ut_id = [ 'v', 't', '0', '2' ]
    char [32] ut_line = [ "pts/2" ]
    pid_t ut_pid = 0x193ec
    short ut_type = 0x8
    struct exit_status ut_exit = {
        short e_termination = 0
        short e_exit = 0
    }
    struct timeval ut_tv = {
        time_t tv_sec = 0x4eb55d2a
        suseconds_t tv_usec = 0x18adc
    }
    int ut_session = 0
    int [5] pad = [ 0, 0, 0, 0, 0 ]
    short ut_syslen = 0
    char [257] ut_host = [ "" ]
}
  4  67557                pututxline:return pututxline returned fef6ecf0

We can see the getutxid return a reasonable utmpx. Then we see pututxline get a utmpx without a username set. Then there is the fixed up tmpx. Finally, we see that the pututxline returned a non-NULL pointer. (It returns NULL on error — which does indeed happen without the fix-up.)

There you have it, folks. DTrace let me debug an issue without annoying change-compile-install cycles. Now, all I have to do is fix up urxvt in OpenIndiana and possibly, if it applies to other systems, push the fix upstream.

Timesavers: ZFS & BE

I’ve mentioned Boot Environments before. Well, earlier this week BEs and ZFS snapshots saved me a bunch of time. Here’s what happened.

I was in the middle of installing some package (pkg install foo) when my laptop locked up. I had to power cycle it the hard way. When it booted back up, I retried the install, but pkg complained that some state file was corrupted and it didn’t want to do anything. Uh oh. I’ve had similar issue happen to me on Debian with aptitude, so I knew that the hard way of fixing this issue was going to take more time than I’d like to dedicate to it (read: none). Thankfully, I use OpenIndiana which has ZFS and BEs.

  1. Reboot into a BE from a while ago (openindiana-3). The latest BE (openindiana-4) was created by pkg about a month ago as a clone of openindiana-3 during a major upgrade.
  2. Figure out which automatic ZFS snapshot I want to revert to. A matter of running zfs list -t all rpool/ROOT/openindiana-4 | tail -5 and picking the latest snapshot which I believe is from before pkg messed it all up. I ended up going an hour back just to make sure.
  3. Revert the BE. beadm rollback openindiana-4@zfs-auto-snap_hourly-2011-10-25-19h11
  4. Reboot back into openindiana-4.

After the final reboot, everything worked just fine. (Since the home directories are on a different dataset, they were left untouched.)

Total downtime: 5 minutes
Ease of repair: trivial

Your Turn

Do you have a corrupt package manager war story? Did you just restore from backup? Let me know in a comment.

Recursion

Last night while reading about DTrace, I came across an example that involved tracing a simple recursive factorial program. I pointed it out to my girlfriend, Holly, since I thought that she’d find it interesting — the class she teaches has a section about recursion.

Here’s the original code:

extern int fac (int n);

int main(int argc, char **argv)
{
	int f;
	f = fac(6);
	return 0;
}

int fac (int n)
{
	if (n <= 1)
		return 1;
	else
		return n * fac (n - 1);
}

Pretty simple. I compiled it, and ran it with DTrace:

$ gcc -o fac orig.c 
$ cat fac.d 
pid$target::fac:entry
{
        trace (arg0);
}
pid$target::fac:return
{
        trace (arg1);
}
$ sudo dtrace -Fs fac.d -c ./fac
dtrace: script 'fac.d' matched 2 probes
dtrace: pid 1146 has exited
CPU FUNCTION                                 
  0  -> fac                                                   6
  0    -> fac                                                 5
  0      -> fac                                               4
  0        -> fac                                             3
  0          -> fac                                           2
  0            -> fac                                         1
  0            <- fac                                         1
  0          <- fac                                           2
  0        <- fac                                             6
  0      <- fac                                              24
  0    <- fac                                               120
  0  <- fac                                                 720

Cool! I thought I was done. Holly asked whether it would work with tail-recursion. Interesting, I thought…it might - depending on how gcc handles the function prologue and epilogue. The D script is a little different to dump both of the arguments. Here’s the result:

$ gcc -o fact tail.c
$ cat fact.d 
pid$target::fac:entry
{
        trace (arg0); trace(arg1);
}
pid$target::fac:return
{
        trace (arg1);
}
$ sudo dtrace -Fs fact.d -c ./fact
dtrace: script 'fact.d' matched 2 probes
dtrace: pid 1233 has exited
CPU FUNCTION                                 
  2  -> fac                                                   6                1
  2    -> fac                                                 5                6
  2      -> fac                                               4               30
  2        -> fac                                             3              120
  2          -> fac                                           2              360
  2            -> fac                                         1              720
  2            <- fac                                       720
  2          <- fac                                         720
  2        <- fac                                           720
  2      <- fac                                             720
  2    <- fac                                               720
  2  <- fac                                                 720

You can see the values getting calculated and passed down versus them getting calculated on during the return. That was fun to see. But wait! If this function is tail recursive, why are we seeing all these returns? It should be just one return! Doh! I didn’t compile with optimizations so gcc emitted the stupidest possible code. Easy enough to fix:

$ gcc -o fact -O2 tail.c 
$ sudo dtrace -Fs fact.d -c ./fact
dtrace: script 'fact.d' matched 2 probes
dtrace: pid 1304 has exited

$

Huh… nothing happened! fac was never called. Let’s see what gcc emitted:

$ objdump -S fact
...
Disassembly of section .text.startup:

08050d10 <main>:
 8050d10:       31 c0                   xor    %eax,%eax
 8050d12:       c3                      ret  

Great, the main function turned into a return 0 because the value returned by fac() was never used. Easy enough, let’s just return the value from main.

$ cat tail.c 
extern int fac (int n);

int main(int argc, char **argv)
{
        return fac(6);
}

int fac (int n)
{
        if (n <= 1)
                return 1;
        else
                return n * fac (n - 1);
}
$ gcc -o fact -O2 tail.c 
$ objdump -S fact
...
Disassembly of section .text.startup:

08050d10 <main>:
 8050d10:       b8 d0 02 00 00          mov    $0x2d0,%eax
 8050d15:       c3                      ret    

Argh! Thanks gcc, you replaced my fac(6) function call with the value 720 — that is the factorial of 6. Fine, let’s do this the hard way: get an int from the first argument and print it out. Also, to prevent inlining, let’s put it in a separate file. So, now we have:

$ cat fac.c
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

extern int fac (int n);

int main(int argc, char **argv)
{
        printf("%d\n", fac(atoi(argv[1])));
        return 0;
}
$ cat fac_2.c
int fac (int n)
{
        if (n <= 1)
                return 1;
        else
                return n * fac (n - 1);
}
$ cat fact.c
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

extern int fac (int n, int v);

int main(int argc, char **argv)
{
        printf("%d\n", fac(atoi(argv[1]), 1));
        return 0;
}
$ cat fact_2.c
int fac (int n, int v)
{
        if (n <= 1)
                return v;
        else
                return fac(n-1, n*v);
}

Now, if we run the two variants compiled with -O2, we get:

$ sudo dtrace -Fs fac.d -c "./fac 6"
dtrace: script 'fac.d' matched 2 probes
720
dtrace: pid 1397 has exited
CPU FUNCTION                                 
  2  -> fac                                                   6
  2  <- fac                                                 720

$ sudo dtrace -Fs fact.d -c "./fact 6"
dtrace: script 'fact.d' matched 2 probes
720
dtrace: pid 1399 has exited
CPU FUNCTION                                 
  5  -> fac                                                   6                1
  5  <- fac                                                 720

Weird, for both we can see only one function entry and return. Let’s try with -O1:

$ sudo dtrace -Fs fac.d -c "./fac 6"
dtrace: script 'fac.d' matched 2 probes
720
dtrace: pid 1393 has exited
CPU FUNCTION                                 
  6  -> fac                                                   6
  6    -> fac                                                 5
  6      -> fac                                               4
  6        -> fac                                             3
  6          -> fac                                           2
  6            -> fac                                         1
  6            <- fac                                         1
  6          <- fac                                           2
  6        <- fac                                             6
  6      <- fac                                              24
  6    <- fac                                               120
  6  <- fac                                                 720

$ sudo dtrace -Fs fact.d -c "./fact 6"
dtrace: script 'fact.d' matched 2 probes
720
dtrace: pid 1395 has exited
CPU FUNCTION                                 
  2  -> fac                                                   6                1
  2    -> fac                                                 5                6
  2      -> fac                                               4               30
  2        -> fac                                             3              120
  2          -> fac                                           2              360
  2            -> fac                                         1              720
  2            <- fac                                       720
  2          <- fac                                         720
  2        <- fac                                           720
  2      <- fac                                             720
  2    <- fac                                               720
  2  <- fac                                                 720

Ok, now were back to having call and return instructions for both cases — the tail recursive function is not actually tail recursing when it should. So, first moral of the story is: -O1 is not enough to make tail recursive functions tail recurse. The odd behavior of the non-tail recursive code with -O2 is still weird. Let’s disassemble it; first the simple recursive code:

08050d00 <fac>:
 8050d00:       8b 54 24 04             mov    0x4(%esp),%edx
 8050d04:       b8 01 00 00 00          mov    $0x1,%eax
 8050d09:       83 fa 01                cmp    $0x1,%edx
 8050d0c:       7e 0d                   jle    8050d1b <fac+0x1b>
 8050d0e:       66 90                   xchg   %ax,%ax
 8050d10:       0f af c2                imul   %edx,%eax
 8050d13:       83 ea 01                sub    $0x1,%edx
 8050d16:       83 fa 01                cmp    $0x1,%edx
 8050d19:       75 f5                   jne    8050d10 <fac+0x10>
 8050d1b:       f3 c3                   repz ret 
 8050d1d:       90                      nop    
 8050d1e:       90                      nop    
 8050d1f:       90                      nop    

Whoa! gcc turned the plain recursive code into a tail-recursive one. For comparison, here’s the disassembly of the explicitly-coded-as-tail-recursive function:

08050d00 <fac>:
 8050d00:       8b 54 24 04             mov    0x4(%esp),%edx
 8050d04:       8b 44 24 08             mov    0x8(%esp),%eax
 8050d08:       83 fa 01                cmp    $0x1,%edx
 8050d0b:       7e 0e                   jle    8050d1b <fac+0x1b>
 8050d0d:       8d 76 00                lea    0x0(%esi),%esi
 8050d10:       0f af c2                imul   %edx,%eax
 8050d13:       83 ea 01                sub    $0x1,%edx
 8050d16:       83 fa 01                cmp    $0x1,%edx
 8050d19:       75 f5                   jne    8050d10 <fac+0x10>
 8050d1b:       f3 c3                   repz ret 
 8050d1d:       90                      nop    
 8050d1e:       90                      nop    
 8050d1f:       90                      nop

Do you see it? It’s virtually identical to what gcc emitted for the naive code.

So there you have it folks. The compiler is smarter than you, more consistent than you, and less likely to screw up compared to you when converting a recursive function into a tail-recursive one. In general, you should not prematurely optimize.

In case you care, I’ve used gcc 4.6.1 for these experiments on OpenIndiana.

Your Turn

Do you have an interesting compiler optimization story? Share it in a comment!

Powered by blahgd