Josef “Jeff” Sipek

OpenIndiana: UFS Root and FreeBSD Loader

Recently, I had a couple of uses for an illumos-based system that does not use ZFS for the root file system. The first time around, I was trying to test a ZFS code change and it is significantly easier to do when the root file system is not the same as the file system likely broken by my changes. The second time around, I did not want to deal with the ARC eating up all the RAM on the test system.

It is still possible to use UFS for the root file system, it just requires a bit of manual labor as illumos distros do not let you install directly to a UFS volume these days. I am using this post to write down the steps necessary to get a VM with UFS root file system up and running.

Virtual Hardware

First of all, let’s talk about the virtual hardware necessary. We will need a NIC that we can dedicate to the VM so it can reach the internet. Second, we will need two disks. Here, I am using a VNIC and two zvols to accomplish this. Finally, we will need an ISO with OpenIndiana Hipster. I am using the ISO from October 2015.

Let’s create all the virtual hardware:

host# dladm create-vnic -l e1000g0 ufsroot0
host# zfs create storage/kvm/ufsroot
host# zfs create -V 10g -s storage/kvm/ufsroot/root 
host# zfs create -V 10g -s storage/kvm/ufsroot/swap

The overall process will involve borrowing the swap disk to install Hipster on it normally (with ZFS as the root file system), and then we will copy everything over to the other disk where we will use UFS. Once we are up and running from the UFS disk, we will nuke the borrowed disk contents and set it up as a swap device. So, let’s fire up the VM. To make it easier to distinguish between the two disks, I am setting up the root disk as a virtio device while the other as an IDE device. (We will change the swap device to be a virtio device once we are ready to reclaim it for swap.)

host# /usr/bin/qemu-kvm \
	-enable-kvm \
	-vnc 0.0.0.0:42 \
	-m 2048 \
	-no-hpet \
	-drive file=/dev/zvol/rdsk/storage/kvm/ufsroot/root,if=virtio,index=0 \
	-drive file=/dev/zvol/rdsk/storage/kvm/ufsroot/swap,if=ide,index=1 \
	-net nic,vlan=0,name=net0,model=virtio,macaddr=2:8:20:a:46:54 \
	-net vnic,vlan=0,name=net0,ifname=ufsroot0,macaddr=2:8:20:a:46:54 \
	-cdrom OI-hipster-text-20151003.iso -boot once=d \
	-smp 2 \
	-vga std \
	-serial stdio

Installing

I am not going to go step by step through the installation. All I am going to say is that you should install it on the IDE disk. For me it shows up as c0d1. (c2t0d0 is the virtio disk.)

Once the system is installed, boot it. (From this point on, we do not need the ISO, so you can remove the -cdrom from the command line.) After Hipster boots, configure networking and ssh.

Updating

Now that we have a very boring stock Hipster install, we should at the very least update it to the latest packages (via pkg update). I am updating to “jeffix” which includes a number of goodies like Toomas Soome’s port of the FreeBSD loader to illumos. If you are using stock Hipster, you will have to figure out how to convince GRUB to do the right thing on your own.

ufsroot# pkg set-publisher --non-sticky openindiana.org
ufsroot# pkg set-publisher -P -g http://pkg.31bits.net/jeffix/2015/ \
	jeffix.31bits.net
ufsroot# pkg update
            Packages to remove:  31
           Packages to install:  14
            Packages to update: 518
           Mediators to change:   1
       Create boot environment: Yes
Create backup boot environment:  No

DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                            563/563     8785/8785  239.7/239.7  1.3M/s

PHASE                                          ITEMS
Removing old actions                       7292/7292
Installing new actions                     5384/5384
Updating modified actions                10976/10976
Updating package state database                 Done 
Updating package cache                       549/549 
Updating image state                            Done 
Creating fast lookup database                   Done 

A clone of openindiana exists and has been updated and activated.
On the next boot the Boot Environment openindiana-1 will be
mounted on '/'.  Reboot when ready to switch to this updated BE.


---------------------------------------------------------------------------
NOTE: Please review release notes posted at:

http://wiki.openindiana.org/display/oi/oi_hipster
---------------------------------------------------------------------------

Reboot into the new boot environment and double check that the update really updated everything it was supposed to.

ufsroot# uname -a
SunOS ufsroot 5.11 jeffix-20160219T162922Z i86pc i386 i86pc Solaris

Great!

Partitioning

First, we need to partition the virtio disk. Let’s be fancy and use a GPT partition table. The easiest way to create one is to create a whole-disk zfs pool on the virtio disk and immediately destroy it.

ufsroot# zpool create temp-pool c2t0d0
ufsroot# zpool destroy temp-pool

This creates an (almost) empty GPT partition table. We need to add two partitions—one tiny partition for the boot block and one for the UFS file system.

ufsroot# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0d1 <QEMU HARDDISK=QM00002-QM00002-0001-10.00GB>
          /pci@0,0/pci-ide@1,1/ide@0/cmdk@1,0
       1. c2t0d0 <Virtio-Block Device-0000-10.00GB>
          /pci@0,0/pci1af4,2@4/blkdev@0,0
Specify disk (enter its number): 1
selecting c2t0d0
No defect list found
[disk formatted, no defect list found]
...
format> partition

I like to align partitions to 1MB boundaries, that is why I specified 2048 as the starting sector. 1MB is plenty of space for the boot block, and it automatically makes the next partition 1MB aligned. It is important to specify “boot” for the partition id tag. Without it, we will end up getting an error when we try to install the loader’s boot block.

partition> 0
Part      Tag    Flag     First Sector        Size        Last Sector
  0        usr    wm               256       9.99GB         20955102    

Enter partition id tag[usr]: boot
Enter partition permission flags[wm]: 
Enter new starting Sector[256]: 2048
Enter partition size[20954847b, 20956894e, 10231mb, 9gb, 0tb]: 1m

Since I am planning on using a separate disk for swap, I am using the rest of this disk for the root partition.

partition> 1
Part      Tag    Flag     First Sector        Size        Last Sector
  1 unassigned    wm                 0          0              0    

Enter partition id tag[usr]: root
Enter partition permission flags[wm]: 
Enter new starting Sector[4096]: 
Enter partition size[0b, 4095e, 0mb, 0gb, 0tb]: 20955102e
partition> print
Current partition table (unnamed):
Total disk sectors available: 20955069 + 16384 (reserved sectors)

Part      Tag    Flag     First Sector        Size        Last Sector
  0       boot    wm              2048       1.00MB         4095    
  1       root    wm              4096       9.99GB         20955102    
  2 unassigned    wm                 0          0              0    
  3 unassigned    wm                 0          0              0    
  4 unassigned    wm                 0          0              0    
  5 unassigned    wm                 0          0              0    
  6 unassigned    wm                 0          0              0    
  8   reserved    wm          20955103       8.00MB         20971486    

When done, do not forget to run the label command:

partition> label
Ready to label disk, continue? yes

Format and Copy

Now that we have the partitions all set up, we can start using them.

ufsroot# newfs /dev/rdsk/c2t0d0s1
newfs: construct a new file system /dev/rdsk/c2t0d0s1: (y/n)? y
Warning: 34 sector(s) in last cylinder unallocated
/dev/rdsk/c2t0d0s1:     20951006 sectors in 3410 cylinders of 48 tracks, 128 sectors
        10230.0MB in 214 cyl groups (16 c/g, 48.00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920,
 20055584, 20154016, 20252448, 20350880, 20449312, 20547744, 20646176,
 20744608, 20843040, 20941472
ufsroot# mkdir /a
ufsroot# mount /dev/dsk/c2t0d0s1 /a

To get a consistent snapshot of the current (ZFS) root, I just make a new boot environment. I could take a recursive ZFS snapshot manually, but why do that if beadm can do it for me automatically? (The warning about missing menu.lst is a consequence of me using jeffix which includes the FreeBSD loader. It can be safely ignored.)

ufsroot# beadm create tmp
WARNING: menu.lst file /rpool/boot/menu.lst does not exist,
         generating a new menu.lst file
Created successfully
ufsroot# beadm mount tmp
Mounted successfully on: '/tmp/tmp._haOBb'

Now, we need to copy everything to the UFS file system. I use rsync but all that matters is that the program you use will preserve permissions and can cope with all the file types found on the root file system.

In addition to copying the mounted boot environment, we want to copy /export. (Recall that /export is outside of the boot environment, and therefore it will not appear under the temporary mount.)

ufsroot# rsync -a /tmp/tmp._haOBb/ /a/
ufsroot# rsync -a /export/ /a/export/

At this point, we are done with the temporary boot environment. Let’s at least unmount it. We could destroy it too, but it does not matter since we will eventually throw away the entire ZFS root disk anyway.

ufsroot# beadm umount tmp

Configuration Tweaks

Before we can use our shiny new UFS root file system to boot the system, we need to do a couple of cleanups.

First, we need to nuke the ZFS zpool.cache:

ufsroot# rm /a/etc/zfs/zpool.cache 

Second, we need to modify vfstab to indicate the root file system and comment out the zvol based swap device.

ufsroot# vim /a/etc/vfstab

So, we add this line:

/dev/dsk/c2t0d0s1       -       /               ufs     -       yes     -

and either remove or comment out this line:

#/dev/zvol/dsk/rpool/swap       -               -               swap    -       no      -

Third, we need to update boot properties file (bootenv.rc) to tell the kernel where to find the root file system (i.e., the boot path) and the type of the root file system. To find the boot path, I like to use good ol’ ls:

ufsroot# ls -lh /dev/dsk/c2t0d0s1
lrwxrwxrwx 1 root root 46 Feb 21 17:43 /dev/dsk/c2t0d0s1 -> ../../devices/pci@0,0/pci1af4,2@4/blkdev@0,0:b

The symlink target is the boot path—well, you need to strip the fluff at the beginning.

So, we need to add these two lines to /a/boot/solaris/bootenv.rc:

setprop fstype 'ufs'
setprop bootpath '/pci@0,0/pci1af4,2@4/blkdev@0,0:b'

Ok! Everything is configured properly; we have to rebuild the boot archive. This should result in /a/platform/i86pc/boot_archive getting updated and a /a/platform/i86pc/boot_archive.hash getting created.

ufsroot# bootadm update-archive -R /a
ufsroot# ls -lh /a/platform/i86pc/
total 33M
drwxr-xr-x  4 root sys  512 Feb 21 18:38 amd64
drwxr-xr-x  6 root root 512 Feb 21 17:44 archive_cache
-rw-r--r--  1 root root 33M Feb 21 18:37 boot_archive
-rw-r--r--  1 root root  41 Feb 21 18:37 boot_archive.hash
drwxr-xr-x 10 root sys  512 Feb 21 18:00 kernel
-rwxr-xr-x  1 root sys  44K Feb 21 18:00 multiboot
drwxr-xr-x  4 root sys  512 Feb 21 17:44 ucode
drwxr-xr-x  2 root root 512 Feb 21 18:04 updates

Installing Boot Blocks

We have one step left! We need to install the boot block to the boot partition on the new disk. That one simple-ish command. Note that the device we are giving is the UFS partition device. Further note that the installboot finds the boot partition automatically based on the “boot” partition id tag.

ufsroot# installboot -m /a/boot/pmbr /a/boot/gptzfsboot /dev/rdsk/c2t0d0s1
Updating master boot sector destroys existing boot managers (if any).
continue (y/n)? y
bootblock written for /dev/rdsk/c2t0d0s0, 214 sectors starting at 1 (abs 2049)
stage1 written to slice 0 sector 0 (abs 2048)
stage1 written to slice 1 sector 0 (abs 4096)
stage1 written to master boot sector

If you are using GRUB instead, you will want to install GRUB on the disk…somehow.

Booting from UFS

Now we can shut down, change the swap disk’s type to virtio and boot back up.

host# /usr/bin/qemu-kvm \
	-enable-kvm \
	-vnc 0.0.0.0:42 \
	-m 2048 \
	-no-hpet \
	-drive file=/dev/zvol/rdsk/storage/kvm/ufsroot/root,if=virtio,index=0 \
	-drive file=/dev/zvol/rdsk/storage/kvm/ufsroot/swap,if=virtio,index=1 \
	-net nic,vlan=0,name=net0,model=virtio,macaddr=2:8:20:a:46:54 \
	-net vnic,vlan=0,name=net0,ifname=ufsroot0,macaddr=2:8:20:a:46:54 \
	-smp 2 \
	-vga std \
	-serial stdio

Once the VM comes back up, we can add a swap device. The swap disk shows up for me as c3t0d0.

ufsroot# swap -a /dev/dsk/c3t0d0p0
operating system crash dump was previously disabled --
invoking dumpadm(1M) -d swap to select new dump device

We also need to add the description of the swap device to /etc/vfstab. So, fire up vim and add the following line:

/dev/dsk/c3t0d0p0       -       -               swap    -       no      -

That’s it! Now you can bask in the glory that is UFS root!

ufsroot$ df -h
Filesystem         Size  Used Avail Use% Mounted on
/dev/dsk/c2t0d0s1  9.9G  2.9G  6.9G  30% /
swap               1.5G 1016K  1.5G   1% /etc/svc/volatile
swap               1.5G  4.0K  1.5G   1% /tmp
swap               1.5G   52K  1.5G   1% /var/run
ufsroot$ zfs list
no datasets available

Caveat

Unfortunately, pkg assumes that the root file system is ZFS. So, updating certain packages (anything that would normally create a new boot environment) will likely fail.

Meili upgrades

A couple of months ago, I decided to update my almost two and a half year old laptop. Twice.

First, I got more RAM. This upped it to 12 GB. While still on the low side for a box which actually gets to see some heavy usage (compiling illumos takes a couple of hours and generates a couple of GB of binaries), it was better than the 4 GB I used for way too long.

Second, I decided to bite the bullet and replaced the 320 GB disk with a 256 GB SSD (Samsung 840 Pro). Sadly, in the process I had the pleasure of reinstalling the system — both Windows 7 and OpenIndiana. Overall, the installation was uneventful as my Windows partition has no user data and my OI storage is split into two pools (one for system and one for my data).

The nice thing about reinstalling OI was getting back to a stock OI setup. A while ago, I managed to play with software packaging a bit too much and before I knew it I was using a customized fork of OI that I had no intention of maintaining. Of course, I didn’t realize this until it was too late to rollback. Oops. (Specifically, I had a custom pkg build which was incompatible with all versions OI ever released.)

One of the painful things about my messed-up-OI install was that I was running a debug build of illumos. This made some things pretty slow. One such thing was boot. The ZFS related pieces took about a minute alone to complete. The whole boot procedure took about 2.5 minutes. Currently, with a non-debug build and an SSD, my laptop goes from Grub prompt to gdm login in about 40 seconds. I realize that this is an apples to oranges comparison.

I knew SSDs were supposed to be blazing fast, but I resisted getting one for the longest time mostly due to reliability concerns. What changed my mind? I got to use a couple of SSDs in my workstation at work. I saw the performance and I figured that ZFS would take care of alerting me of any corruption. Since most of my work is version controlled, chances are that I wouldn’t lose anything. Lastly, SSDs got a fair amount of improvements over the past few years.

Isis

After several years of having a desktop at home that’s been unplugged and unused I decided that it was time to make a home server to do some of my development on and just to keep files stored safely and redundantly. This was in August 2011. A lot has happened since then. First of all, I rebuilt the OpenIndiana (an Illumos-based distribution) setup with SmartOS (another Illumos-based distribution). Since I wrote most of this a long time ago, some of the information below is obsolete. I am sharing it anyway since others may find it useful. Toward the end of the post, I’ll go over SmartOS rebuild. As you may have guessed, the hostname for this box ended up being Wikipedia article: Isis.

First of all, I should list my goals.

storage box
The obvious mix for digital photos, source code repositories, assorted documents, and email backup is easy enough to store. It however becomes a nightmare if you need to keep track where they are (i.e., which of the two external disks, public server (Odin), laptop drives, desktop drives they are on). Since none of them are explicitly public, it makes sense to keep them near home instead on my public server that’s in a data-center with a fairly slow uplink (1 Mbit/s burstable to 10 Mbits/s, billed at 95th percentile).
dev box
I have a fast enough laptop (Thinkpad T520), but a beefier system that I can let compile large amounts of code is always nice. It will also let me run several virtual machines and zones comfortably — for development, system administration experiments, and other fun stuff.
router
I have an old Linksys WRT54G (rev. 3) that has served me well for the years. Sadly, it is getting a bit in my way — IPv6 tunneling over IPv4 is difficult, the 100 Mbit/s switch makes it harder to transfer files between computers, etc. If I am making a server that will be always on, it should handle effortlessly NAT’ing my Comcast internet connection. Having a full-fledged server doing the routing will also let me do better traffic shaping & filtering to make the connection feel better.

Now that you know what sort of goals I have, let’s take a closer look at the requirments for the hardware.

  1. reliable
  2. friendly to OpenIndiana and ZFS
  3. low-power
  4. fast
  5. virtualization assists (to support run virtual machines at reasonable speed)
  6. cheap
  7. quiet
  8. spacious (storage-wise)

While each one of them is pretty easy to accomplish, their combination is much harder to achieve. Also note that it is ordered from most to least important. As you will see, reliability dictated many of my choices.

The Shopping List

CPU
Intel Xeon E3-1230 Sandy Bridge 3.2GHz LGA 1155 80W Quad-Core Server Processor BX80623E31230
RAM (4)
Kingston ValueRAM 4GB 240-Pin DDR3 SDRAM DDR3 1333 ECC Unbuffered Server Memory Model KVR1333D3E9S/4G
Motherboard
SUPERMICRO MBD-X9SCL-O LGA 1155 Intel C202 Micro ATX Intel Xeon E3 Server Motherboard
Case
SUPERMICRO CSE-743T-500B Black Pedestal Server Case
Data Drives (3)
Seagate Barracuda Green ST2000DL003 2TB 5900 RPM SATA 6.0Gb/s 3.5"
System Drives (2)
Western Digital WD1600BEVT 160 GB 5400RPM SATA 8 MB 2.5-Inch Notebook Hard Drive
Additional NIC
Intel EXPI9301CT 10/100/1000Mbps PCI-Express Desktop Adapter Gigabit CT

To measure the power utilization, I got a P3 International P4400 Kill A Watt Electricity Usage Monitor. All my power usage numbers are based on watching the digital display.

Intel vs. AMD

I’ve read Constantin’s OpenSolaris ZFS Home Server Reference Design and I couldn’t help but agree that ECC should be a standard feature on all processors. Constantin pointed out that many more AMD processors support ECC and that as long as you got a motherboard that supported it as well you are set. I started looking around at AMD processors but my search was derailed by Joyent’s announcement that they ported KVM to Illumos — the core of OpenIndiana including the kernel. Unfortunately for AMD, this port supports only Intel CPUs. I switched gears and started looking at Intel CPUs.

In a way I wish I had a better reason for choosing Intel over AMD but that’s the truth. I didn’t want to wait for AMD’s processors to be supported by the KVM port.

So, why did I get a 3.2GHz Xeon (E3-1230)? I actually started by looking for motherboards. At first, I looked at desktop (read: cheap) motherboards. Sadly, none of the Intel-based boards I’ve seen supported ECC memory. Looking at server-class boards made the search for ECC support trivial. I was surprised to find a Supermicro motherboard (MBD-X9SCL-O) for $160. It supports up to 32 GB of ECC RAM (4x 8 GB DIMMs). Rather cheap, ECC memory, dual gigabit LAN (even though one of the LAN ports uses the Intel 82579 which was unsupported by OpenIndiana at the time), 6 SATA II ports — a nice board by any standard. This motherboard uses the LGA 1155 socket. That more or less means that I was “stuck” with getting a Sandy Bridge processor. :-D The E3-1230 is one of the slower E3 series processors, but it is still very fast compared to most of the other processors in the same price range. Additionally, it’s “only” 80 Watt chip compared to many 95 or even 130 Watt chips from the previous series.

There you have it. The processor was more or less determined by the motherboard choice. Well, that’s being rather unfair. It just ended up being a good combination of processor and motherboard — a cheap server board and near-bottom-of-the-line processor that happens to be really sweet.

Now that I had a processor and a motherboard picked out, it was time to get RAM. In the past, I’ve had good luck with Kingston, and since it happened to be the cheapest ECC 4 GB DIMMs on NewEgg, I got 4 — for a grand total of 16 GB.

Case

I will let you know a secret. I love hotswap drive bays. They just make your life easier — from being able to lift a case up high to put it on a shelf without having to lift all those heavy drives at the same time, to quickly replacing a dead drive without taking the whole system down.

I like my public server’s case (Supermicro CSE-743T-645B) but the 645 Watt power supply is really an overkill for my needs. The four 5000 RPM fans on the midplane are pretty loud when they go full speed. I looked around, and I found a 500 Watt (80%+ efficiency) variant of the case (CSE-743-500B). Still a beefy power supply but closer to what one sees in high end desktops. With this case, I get eight 3.5" hot-swap bays, and three 5.25" external (non-hotswap) bays. This case shouldn’t be a limiting factor in any way.

I intended to move my DVD+RW drive from my desktop but that didn’t work out as well as I hoped.

Storage

At the time I was constructing Isis, I was experimenting with Wikipedia article: ZFS on OpenIndiana. I was more than impressed, and I wanted it to manage the storage on my home sever. ZFS is more than just a filesystem, it is also a volume manager. In other words, you can give it multiple disks and tell it to put your data on them in several different ways that closely resemble RAID levels. It can stripe, mirror, or calculate one to three parities. Wikipedia has a nice article outlining ZFS’s features. Anyway, I strongly support ZFS’s attitude toward losing data — do everything to prevent it in the first place.

Hard drives are very interesting devices. Their reliability varies with so many variables (e.g., manufacturing defects, firmware bugs). In general, manufacturers give you fairly meaningless looking, yet impressive sounding numbers about their drives reliability. Richard Elling made a great blog post where he analyzed ZFS RAID space versus Mean-Time-To-Data-Loss, or MTTDL for short. (Later, he analyzed a different MTTDL model.)

The short version of the story is nicely summed up by this graph (taken from Richard’s blog):

While this scatter plot is for a specific model of a high-end server, it applies to storage in general. I like how the various types of redundancy clump up.

Anyway, how much do I care about my files? Most of my code lives in distributed version control systems, so losing one machine wouldn’t be a problem for those. The other files would be a bigger problem. While it wouldn’t be a complete end of the world if I lost all my photos, I’d rather not lose them. This goes back to the requirements list — I prefer reliable over spacious. That’s why I went with 3-way mirror of 2 TB Seagate Barracuda Green drives. It gets me only 2 TB of usable space, but at the same time I should be able to keep my files forever. These are the data drives. I also got two 2.5" 160 GB Western Digital laptop drives to hold the system files — mirrored of course.

Around the same time I was discovering that the only sane way to keep your files was mirroring, I stumbled across Constantin’s RAID Greed post. He basically says the same thing — use 3-way mirror and your files will be happy.

Now, you might be asking… 2 TB, that’s not a lot of space. What if you out grow it? My answer is simple: ZFS handles that for me. I can easily buy three more drives, plug them in and add them as a second 3-way mirror and ZFS will happily stripe across the two mirrors. I considered buying 6 disks right away, but realized that it’ll probably be at least 6-9 months before I’ll have more than 2 TB of data. So, if I postpone the purchase of the 3 additional drives, I can save money. It turns out that a year and a half later, I’m still below 70% of the 2 TB.

Miscellaneous

I knew that one of the on-board LAN ports was not yet supported by Illumos, and so I threw a PCI-e Gigabit ethernet card into the shopping cart. I went with an Intel gigabit card. Illumos has since gained support for 82579-based NICs, but I’m lazy and so I’m still using the PCI-e NIC.

Base System

As the ordered components started showing up, I started assembling them. Thankfully, the CPU, RAM, motherboard, and case showed up at the same time preventing me from going crazy. The CPU came with a stock Intel heatsink.

The system started up fine. I went into the BIOS and did the usual new-system tweaking — make sure SATA ports are in AHCI mode, stagger the disk spinup to prevent unnecessary load peaks at boot, change the boot order to skip PXE, etc. While roaming around the menu options, I discovered that the motherboard can boot from iSCSI. Pretty neat, but useless for me on this system.

The BIOS has a menu screen that displays the fan speeds and the system and processor temperatures. With the fan on the heatsink and only one midplane fan connected the system ran at about 1°C higher than room temperature and the CPU was about 7°C higher than room temperature.

OS Installation

Anyway, it was time to install OpenIndiana. I put my desktop’s DVD+RW in the case and then realized that the motherboard doesn’t have any IDE ports! Oh well, time to use a USB flash drive instead. At this point, I had only the 2 system drives. I connected one to the first SATA port, put a 151 development snapshot (text installer) on my only USB flash drive. The installer booted just fine. Installation was uneventful. The one potentially out of the ordinary thing I did was to not configure any networking. Instead, I set it up manually after the first boot, but more about that later.

With OI installed on one disk, it was time to set up the rpool mirror. I used Constantin’s Mirroring Your ZFS Root Pool as the general guide even though it is pretty straight forward — duplicate the partition (and slice) scheme on the second disk, add the new slice to the root pool, and then install grub on it. Everything worked out nicely.

# zpool status rpool
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0 in 0h5m with 0 errors on Sun Sep 18 14:15:24 2011
config:

        NAME          STATE     READ WRITE CKSUM
        rpool         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            c2t0d0s0  ONLINE       0     0     0
            c2t1d0s0  ONLINE       0     0     0

errors: No known data errors

Networking

Since I wanted this box to act as a router, the network setup was a bit more…complicated (and quite possibly over-engineered). This is why I elected to do all the network setup by hand later than having to “fix” whatever damage the installer did. :)

I powered it off, put in the extra ethernet card I got, and powered it back on. To my surprise, the new device didn’t show up in dladm. I remembered that I should trigger the device reconfiguration. A short touch /reconfigure && reboot later, dladm listed two physical NICs.

network diagram

As you can see, I decided that the routing should be done in a zone. This way, all the routing settings are nicely contained in a single place that does nothing else.

Setting up the virtual interfaces was pretty easy thanks to dladm. Setting the static IP on the global zone was equally trivial.

# dladm create-vlan -l e1000g0 -v 11 vlan11
# dladm create-vnic -l e1000g0 vlan0
# dladm create-vnic -l e1000g0 internal0
# dladm create-vnic -l e1000g1 isp0
# dladm create-etherstub zoneswitch0
# dladm create-vnic -l zoneswitch0 zone_router0

# ipadm create-if internal0
# ipadm create-addr -T static -a local=10.0.0.2/24 internal/v4

You might be wondering about the vlan11 interface that’s on a separate Wikipedia article: VLAN. The idea was to have my WRT54G continue serving as a wifi access point, but have all the traffic end up on VLAN #11. The router zone would then get to decide whether the user is worthy of LAN or Internet access. I never finished poking around the WRT54G to figure out how to have it dump everything on a VLAN #11 instead of the default #0.

Router zone

OpenSolaris (and therefore all Illumos derivatives) has a wonderful feature called Wikipedia article: zones. It is essentially a super-lightweight virtualization mechanism. While talking to a couple of people on IRC, I decided that I, like them, would use a dedicated zone as a router.

Just before I set up the router zone, the storage disks arrived. The router zone ended up being stored on this array. See the storage section below for details about this storage pool.

After installing the zone via zonecfg and zoneadm, it was time to set up the routing and firewalling. First, install the ipfilter package (pkg install pkg:/network/ipfilter). Now, it is time to configure the NAT and filter rules.

NAT is easy to set up. Just plop a couple of lines into /etc/ipf/ipnat.conf:

map isp0 10.0.0.0/24 -> 0/32 proxy port ftp ftp/tcp
map isp0 10.0.0.0/24 -> 0/32 portmap tcp/udp auto
map isp0 10.0.0.0/24 -> 0/32

map isp0 10.11.0.0/24 -> 0/32 proxy port ftp ftp/tcp
map isp0 10.11.0.0/24 -> 0/32 portmap tcp/udp auto
map isp0 10.11.0.0/24 -> 0/32

map isp0 10.1.0.0/24 -> 0/32 proxy port ftp ftp/tcp
map isp0 10.1.0.0/24 -> 0/32 portmap tcp/udp auto
map isp0 10.1.0.0/24 -> 0/32

IPFilter is a bit trickier to set up. The rules need to handle more cases. In general, I tried to be a bit paranoid about the rules. For example, I drop all traffic for IP addresses that don’t belong on that interface (I should never see 10.0.0.0/24 addresses on my ISP interface). The only snag was in the defaults for the ipfilter Wikipedia article: SMF service. By default, it expects you to put your rules into SMF properties. I wanted to use the more old-school approach of using a config file. Thankfully, I quickly found a blog post which hepled me with it.

Storage, part 2

As the list of components implies, I wanted to make two arrays. I already mentioned the rpool mirror. Once the three 2 TB disks arrived, I hooked them up and created a 3-way mirror (zpool create storage mirror c2t3d0 c2t4d0 c2t5d0).

# zpool status storage
  pool: storage
 state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Sun Sep 18 14:10:22 2011
config:

        NAME        STATE     READ WRITE CKSUM
        storage     ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c2t3d0  ONLINE       0     0     0
            c2t4d0  ONLINE       0     0     0
            c2t5d0  ONLINE       0     0     0

errors: No known data errors

Deduplication & Compression

I suspected that there would be enough files that would be stored several times — system binaries for zones, clones of source trees, etc. ZFS has built-in online Wikipedia article: deduplication. This stores each unique block only once. It’s easy enough to turn on: zfs set dedup=on storage.

Additionally, ZFS has transparent data (and metadata) compression featuring Wikipedia article: LZJB and gzip algorithms.

I enabled dedup and kept compression off. Dedup did take care of the duplicate binaries between all the zones. It even took care of duplicates in my photo stash. (At some point, I managed to end up with several diverged copies of my photo stash. One of the first things I did with Isis, was to dump all of them in the same place and start sorting them. Adobe Lightroom helped here quite a bit.)

After a while, I came to the realization that for most workloads I run, dedup was wasteful and I would be better off disabling dedup and enabling light compression (i.e., LZJB).

$HOME

The installer puts the non-privileged user’s home directory onto the root pool. I did not want to keep it there since I now had the storage pool. After a bit of thought, I decided to zfs create storage/home and then transfer over the current home directory. I could have used cp(1) or rsync(1), but I thought it would be more fun (and a learning experience) to use zfs send and zfs recv. It went something like this:

# zfs snapshot rpool/export/home/jeffpc@snap
# zfs send rpool/export/home/jeffpc@snap | zfs recv storage/home/jeffpc

In theory, any modifications to my home directory after the snapshot got lost, but since I was just ssh’d in there wasn’t much that changed. (I am ok with losing the last update to .bash_history this one time.) The last thing that needed changing is /etc/auto_home — which tells the automounter where my $HOME really is. This is the resulting file after the change (without the copyright comment):

jeffpc	localhost:/storage/home/&
+auto_home

For good measure, I rebooted to make sure things would come up properly — they did.

Since the server is not intended just for me, I created the other user account with a home directory in storage/home/holly.

Zones

I intend to use zones extensively. To keep their files out of the way, I decided on storage/zones/$ZONE_NAME. I’ll talk more about the zones I set up later in the Zones section.

SMB

Local storage is great, but there is only so much you can do with it. Sooner or later, you will want to access it from a different computer. There are many different ways to “export” your data, but as one might expert, they all have their benefits and drawbacks. ZFS makes it really easy to export data via NFS and SMB. After a lot of thought, I decided that SMB would work a bit better. The major benefit of SMB over NFS is that it Just Works™ on all the major operating systems. That’s not to say that NFS does not work, but rather that it needs a bit more…convincing at times. This is especially true on Windows.

I followed the documentation for enabling SMB on Solaris 11. Yes, I know, OpenIndiana isn’t Solaris 11, but this aspect was the same. This ended with me enabling sharing of several datasets like this:

# zfs set sharesmb=name=photos storage/photos

ACLs

The home directory shares are all done. The photos share, however, needs a bit more work. Specifically, it should be fully accessible to the users that are supposed to have access (i.e., jeffpc & holly). The easiest way I can find is to use ZFS ACLs.

First, I set the aclmode to passthrough (zfs set aclmode=passthough storage). This will prevent a chmod(1) on a file or directory from blowing away all the ACEs (Access Control Entries?). Then on the share directory, I added two ACL entries that allow everything.

# /usr/bin/ls -dV /share/photos
drwxr-xr-x   2 jeffpc   root           4 Sep 23 09:12 /share/photos
                 owner@:rwxp--aARWcCos:-------:allow
                 group@:r-x---a-R-c--s:-------:allow
              everyone@:r-x---a-R-c--s:-------:allow
# /usr/bin/chmod A+user:jeffpc:rwxpdDaARWcCos:fd:allow /share/photos
# /usr/bin/chmod A+user:holly:rwxpdDaARWcCos:fd:allow /share/photos
# /usr/bin/chmod A2- /share/photos # get rid of user
# /usr/bin/chmod A2- /share/photos # get rid of group
# /usr/bin/chmod A2- /share/photos # get rid of everyone
# /usr/bin/ls -dV /share/photos
drwx------+  2 jeffpc   root           4 Sep 23 09:12 /share/photos
            user:jeffpc:rwxpdDaARWcCos:fd-----:allow
             user:holly:rwxpdDaARWcCos:fd-----:allow

The first two chmod commands prepend two ACEs. The next three remove ACE number 2 (the third entry). Since the directory started of with three ACEs (representing the standard Unix permissions), the second set of chmods removes those, leaving only the two user ACEs behind.

Clients

That was easy! In case you are wondering, the Solaris/Illumos SMB service does not allow guest access. You must login to use any of the shares.

Anyway, here’s the end result:

Pretty neat, eh?

Zones

Aside from the router zone, there were a number of other zones. Most of them were for Illumos and OpenIndiana development.

I don’t remember much of the details since this predates the SmartOS conversion.

Power

When I first measured the system, it was drawing about 40-45 Watts while idle. Now, I have Isis along with the WRT54G and a gigabit switch on a UPS that tells me that I’m using about 60 Watts when idle. The load can spike up quite a bit if I put load on the 4 Xeon cores and give the disks something to do. (Afterall, it is an 80 Watt CPU!) While this is by no means super low-power, it is low enough and at the same time I have the capability to actually get work done instead of waiting for hours for something to compile.

SmartOS

As I already mentioned, I ended up rebuilding the system with SmartOS. SmartOS is not a general purpose distro. Rather, it strives to be a hypervisor with utilities that make guest management trivial. Guests can either be zones, or KVM-powered virtual machines. Here are the major changes from the OpenIndiana setup.

Storage — pools

SmartOS is one of those distros you do not install. It always netboots, boots from a USB stick or a CD. As a result, you do not need a system drive. This immediately obsoleted the two laptop drives. Conveniently, around the same time, Holly’s laptop suffered from a disk failure so Isis got to donate one of the unused 2.5" system disks.

SmartOS calls its data pool “zones”, which took a little bit of getting used to. There’s a way to import other pools, but wanted to keep the settings as vanilla as possible.

At some point, I threw in a Intel 160 GB SSD to use for L2ARC and Wikipedia article: ZIL.

Here’s what the pool looks like:

# zpool status
  pool: zones
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0 in 2h59m with 0 errors on Sun Jan 13 08:37:37 2013
config:

        NAME        STATE     READ WRITE CKSUM
        zones       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
            c1t3d0  ONLINE       0     0     0
        logs
          c1t1d0s0  ONLINE       0     0     0
        cache
          c1t1d0s1  ONLINE       0     0     0

errors: No known data errors

In case you are wondering about the features related status message, I created the zones pool way back when Illumos (and therefore SmartOS) had only two ZFS features. Since then, Illumos added one and Joyent added one to SmartOS.

# zpool get all zones | /usr/xpg4/bin/grep -E '(PROP|feature)'
NAME   PROPERTY                   VALUE                      SOURCE
zones  feature@async_destroy      enabled                    local
zones  feature@empty_bpobj        active                     local
zones  feature@lz4_compress       disabled                   local
zones  feature@filesystem_limits  disabled                   local

I haven’t experimented with either enough to enable it on a production system I rely on so much.

Storage — deduplication & compression

The rebuild gave me a chance to start with a clean slate. Specifically, it gave me a chance to get rid off the dedup table. (The dedup table, DDT, is built as writes happen to the filesystem with dedup enabled.) Data deduplication relies on some form of data structure (the most trivial one is a hash table) that maps the hash of the data to the data. In ZFS, the DDT maps the Wikipedia article: SHA-256 of the block to the block address.

The reason I stopped using dedup on my systems was pretty straight forward (and not specific to ZFS). Every entry in the DDT has an overhead. So, ideally, every entry in the DDT is referenced at least twice. If a block is referenced only once, then one would be better off without the block taking up an entry in the DDT. Additionally, every time a reference is taken or released, the DDT needs to be updated. This causes very nasty random I/O under which spinning disks want to weep. It turns out, that a “normal” user will have mostly unique data rendering deduplication impractical.

That’s why I stopped using dedup. Instead, I became convinced that most of the time light compression is the way to go. Lightly compressing the data will result in I/O bandwidth savings as well as capacity savings with little overhead given today’s processor speeds versus I/O latencies. Since I haven’t had time to experiment with the recently integrated LZ4, I still use LZJB.

Benchmark Assumptions

Today I came across a blog post about Running PostgreSQL on Compression-enabled ZFS. I found the article because (1) I am a fan of Wikipedia article: ZFS, and (2) transparent storage compression interests me. (Maybe I’ll talk about the later in the future.)

Whoever ran the benchmark decided to compare ZFS with Wikipedia article: lzjb, ZFS with gzip, against ext3. Their analysis states that ZFS-gzip is faster than ZFS-lzjb, which is faster than ext3. They admit that the benchmark is I/O bound. Then they state that compression effectively speeds up the disk I/O by making every byte transfered contain more information. The analysis goes down the drain right after that.

While doing background research for this blog post we also got a chance to investigate some of the other features besides compression that differentiate ZFS from older file system architectures like ext3. One of the biggest differences is ZFS’s approach to scheduling disk IOs which employs explicit IO priorities, IOP reordering, and deadline scheduling in order to avoid flooding the request queues of disk controllers with pending requests.

Anyone who’s benchmarked a system should have a red flag going off after reading those sentences. My reaction was something along the lines: “What?! You know that there are at least three major differences between ZFS and ext3 in addition to compression and you still try to draw conclusions about compression effectiveness by comparing ZFS with compression against ext3?!”

All they had to do to make their analysis so much more interesting and keep me quiet was to include another set of numbers — ZFS without compression. That way, one can compare ext3 with ZFS-uncompressed to see how much difference the radically different filesystem design makes. Then one could compare ZFS-uncompressed with the lzjb and gzip data to see if compression helps. Based on the data presented, we have no idea if compression helps — we just know that compression and ZFS outperform ext3. What if ZFS without compression is 5x faster than ext3? Then using gzip (~4x faster than ext3) is actually not the fastest.

To be fair, knowing how modern disk drives behave, chances are that compressed ZFS is faster than uncompressed ZFS. Since CPU cycles are so plentiful these days, all my systems have lzjb compression enabled everywhere. I do this mostly to conserve space, but also in hopes of transferring less data to disk. Yes, this is exactly what their benchmark attempts to show. (I haven’t had a chance to experiment with the new-ish lz4 compression algorithm in ZFS.) My point here is solely about benchmark analysis and unfounded (or at least unstated) assumptions found in just about every benchmark out there.

Adobe Lightroom 4

In my previous post I mentioned that I have more or less settled on using Adobe Lightroom 4 for my photo management and editing needs. After getting a comment from someone about his trouble with image management software, I decided to write a blahg post just about why I decided to go with Lightroom.

Adobe
Some people love Adobe while some hate it. Regardless of your feelings for the company, you’ve got to agree, they have a lot of experience when it comes to making photo editing and management software. If you want to get serious with digital photo management and editing, they probably got it right. It turns out that a fair amount of professional photographers use Lightroom during their workflow. So, if this program is good enough for people that rely on it for their livelihood, it is probably good enough for me. :)
catalog
I talked about this a bit in my previous post already, but I will repeat it here. Lightroom lets me do most things the same way I did before it except for the parts I didn’t like. So, I get to keep my <year>/<event> directory structure that I like, but all the photos are indexed and searchable. The catalog stores all the metadata and lets me quickly see thumbnails of only the photos I want to see.
tagging
This is a very common feature of photo management software, but I am including it here since I did not have anything like it with my previous workflow. One can associate arbitrary text strings with a photo and then filter based on that.
geo-tagging
One special metadata field that Lightroom handles is the GPS location. It also lets you select photos based on location on a nice map (pulled from Google Maps).
captions
Every photo can have a title and a caption. I haven’t experimented with this feature all that much but it is rather self explanatory.
ratings
Lightroom offers several ways to “rate” photographs. There is a very straight forward 5-star scale (you can set zero to five stars) as well as each photo can have a “pick” flag or a “reject” flag. After I first import photos from a new event, I display all the thumbnails to get a quick glance at what I shot. Then I view every photo individually and mark every photo that is utterly useless (e.g., blur-y) as a “reject” and every photo that seems promising as “pick”. Then, I look at just the rejects and delete them completely. Now, I just have ok photos (un-flagged) and good photos (flagged as “pick”). I use the stars to rate post-processed photos from mediocre (1-star) to ones I am proud of (5 stars).
DNG
Lightroom supports a variety of image formats — JPGs, TIFFs, PSDs, even various camera raw formats. I used to occasionally shoot in raw (NEFs) but viewing and editing them was a pain. With Ligthroom, I can use them just like any other file format. They just work. Interestingly, I no longer store NEFs. Instead of importing them and storing them as is, I let Lightroom convert them to Wikipedia article: DNGs. I won’t go into NEF vs. DNG, but what tipped the scale in DNG’s favor in my case were sidecar files.
Wikipedia article: sidecars
I do not know what most photo management applications do, but Lightroom stores all the metadata changes in its (SQLite) database. Additionally, it lets you tell it to store all metadata changes along with the original images as well. For every NEF file, it creates a sidecar Wikipedia article: XMP file. That is, next to foo.nef it will create foo.xmp which contains all the metadata changes. JPGs store the metadata in the EXIF tags. DNGs also store the metadata internally. So, if you want raw files because of their quality, you can either use the camera manufacturer’s native raw format and have to keep the XMP files around, or convert them to DNG (which is lossless conversion by default) and then not worry about sidecars.
misc
There are many other cool features that Lightroom offers — from being able to quickly batch process hundreds of photos to being able to generate web galleries and upload them via sftp with a single click. The list of features is way too long, and I am certain that I haven’t found them all yet.

So there you have it. That is what I do with Lightroom. Other software packages had various deficiencies. As an added benefit, with Lightroom I get to use more open formats (DNG & XMP) than without it.

As a technical side-note, all my photos are on a ZFS dataset that I access via CIFS. Yes, compression is enabled (Wikipedia article: lzjb).

Timesavers: ZFS & BE

I’ve mentioned Boot Environments before. Well, earlier this week BEs and ZFS snapshots saved me a bunch of time. Here’s what happened.

I was in the middle of installing some package (pkg install foo) when my laptop locked up. I had to power cycle it the hard way. When it booted back up, I retried the install, but pkg complained that some state file was corrupted and it didn’t want to do anything. Uh oh. I’ve had similar issue happen to me on Debian with aptitude, so I knew that the hard way of fixing this issue was going to take more time than I’d like to dedicate to it (read: none). Thankfully, I use OpenIndiana which has ZFS and BEs.

  1. Reboot into a BE from a while ago (openindiana-3). The latest BE (openindiana-4) was created by pkg about a month ago as a clone of openindiana-3 during a major upgrade.
  2. Figure out which automatic ZFS snapshot I want to revert to. A matter of running zfs list -t all rpool/ROOT/openindiana-4 | tail -5 and picking the latest snapshot which I believe is from before pkg messed it all up. I ended up going an hour back just to make sure.
  3. Revert the BE. beadm rollback openindiana-4@zfs-auto-snap_hourly-2011-10-25-19h11
  4. Reboot back into openindiana-4.

After the final reboot, everything worked just fine. (Since the home directories are on a different dataset, they were left untouched.)

Total downtime: 5 minutes
Ease of repair: trivial

Your Turn

Do you have a corrupt package manager war story? Did you just restore from backup? Let me know in a comment.

OpenIndiana The What and Why

You have seen me publish two posts about OpenIndiana, but neither of them really says what it is and why you should use it.

The What

OpenIndiana started off as a fork of OpenSolaris. At first, its aim was to provide an alternative to Oracle’s soon-to-be-released Solaris 11, but lately its aim shifted to “an enterprise-quality OS alternative to Linux.”

OpenIndiana is much like a distro in the Linux world. It relies on the Illumos project for the kernel and basic userspace utilities (the shell, etc.). In September 2010, Illumos forked the OpenSolaris kernel and utilities, and OpenIndiana forked the surrounding userspace (the build system for all the packages that make the system usable).

The Why

It is the technology that is the reason I started using OI. Here are some of the features that either drew me in to try OI, or made me stay.

Crossbow
Crossbow was the name of the project that consisted of a major revamp of the network stack. With this revamp (which was available in OpenSolaris), you can create virtual network interfaces, vlans, bridges, switches (called etherstubs), as well as aggregate links with simple commands — quickly, and all the configuration is persistent. You can dedicate both physical and virtual links to zones (see below) to create entire network topologies within one computer. (see dladm(1M) and ipdam(1M))
Zones
These days, everyone is happily setting up virtual machines whenever they need an environment they can tweak without affecting stability of other services. Solaris zones are a great virtualization technology. They allow you to set up multiple Solaris instances (called zones) that have a separate root filesystem (much like chroot). Unlike chrooted environments, having root access in a zone does not give you unrestricted access to the kernel. Zones combined with crossbow is a great combination to consolidate separate systems onto a single Solaris host. (I am currently writing a post about using zones and crossbow on a home server/router.)
Boot Environments (BE) & IPS
Long story short, if the package manager (IPS) detects that a potentially major change is going to occur during an update (e.g., a driver or kernel upgrade), it clones the current root filesystem (easy to do thanks to ZFS) and applies the updates there. It then adds a menu entry to grub to boot into this new environment. The current environment is unchanged. At your leisure, you just reboot into the new environment. If everything works — great. If, however, things break, you can just reboot into the previous BE, and mount the new BE’s root and fix things up. This means that the only downtime the system sees is the reboot or two.
ZFS
There’s plenty of ZFS discussion elsewhere. My favorite features about it are (in no particular order): snapshots, deduplication, integrated volume management, and checksumming.

So there you have it. Sure, many of Solaris’s features are available in some shape or form on Linux, but they tend to be either horribly crippled, or if you are “lucky,” lacking sane management interface.

If you want to see what all this fuss is about, I suggest you grab the Live DVD (or Live USB) image on the download page and give it a try.

Powered by blahgd