# Josef “Jeff” Sipek

As I mentioned previously, I decided to do some hardware hacking and as a result I ended up with a Raspberry Pi B+. After playing with Raspbian for a bit, I started trying to read up on how the Pi boots. That’s what today’s post is about.

### Standalone UART

While searching for information about how the Pi boots, I stumbled across a git repo with a standalone UART program. Conveniently, the repo includes ELF files as well as raw binary images. (This actually made me go “ewww” at first.)

Before even running it, I looked at the code and saw that it prints out 0x12345678 followed by the PC of one of the first instructions of the program. Pretty minimal, but quite enough.

### Boot Process

Just about everyone on the internet (and their grandmothers!) knows that the Pi has a multi stage boot. First of all, it is the GPU that starts executing first. The ARM core just sits there waiting for a reset.

The Pi requires an SD card to boot — on it must be a FAT formatted partition with a couple of files that you can download from the Raspberry Pi Foundation.

Here’s the short version of what happens:

1. Some sort of baked-in firmware loads bootcode.bin from the SD card and executes it.
2. bootcode.bin does a bit of setup, and then loads and executes start.elf.
3. start.elf does a whole lot of setup.

1. Reads config.txt which is a text file with a bunch of options.
2. Splits the RAM between the GPU and CPU (with the help of fixup.dat).
3. Loads the kernel and the ramdisk.
4. Loads the device tree file or sets up ATAGs.
5. Resets the ARM core. The ARM core then begins executing the kernel.

This sounds pretty nice, right? For the most part, it is. But as they say, the devil’s in the details.

### Booting ELF Files

It doesn’t take a lot of reading to figure out that start.elf can deal with kernel files in ELF format. This made me quite happy since I’ve written an ELF loader before for HVF and while it wasn’t difficult, I didn’t want to go through the exercise just to get something booting.

So, I copied uart02.elf to the SD card, and made a minimal config file:

```kernel=uart02.elf
```

A power-cycle later, I saw the two hex numbers. (Ok, this is a bit of a distortion of reality. It took far too many reboots because I was playing with multiple things at the same time — including U-Boot which turned out to be a total waste of my time.)

The PC was not what I expected it to be. It was 0x8080 instead of 0x800c. After a lot of trial and error, I realized that it just so happened that the .text section is 0x74 bytes into the ELF file. Then I had a horrifying thought… start.elf understands ELF files enough to jump to the right instruction but it does nothing to make the contents of the ELF file properly laid out in memory. It just reads the file into memory starting at address 0x8000, gets the start address from the ELF header, converts it into a file offset and jumps to that location in memory. This is wrong.

Sure enough, booting uart02.bin printed the right number. So much for avoiding writing my own ELF loader.

### Ramdisk

Once I had a reliable way to get code to execute and communicate via the serial port, I started playing with ramdisks. The code I was booting parsed the ATAGs and looked for ATAG_INITRD2 which describes the location of the ramdisk.

So, I extended my config to include the ramfsfile parameter to specify the filename of the ramdisk:

```kernel=foo
ramfsfile=initrd
```

Reboot, aaaand…the code panicked because it couldn’t find ATAG_INITRD2. It was weird, the file was present, no misspellings, etc. After a while of rebooting and trying different things, I decide to use the initramfs option and just pick some arbitrary high address for the ramdisk to be loaded at. The config changed to:

```kernel=foo
initramfs initrd 0x800000
```

Another reboot later, I was looking at a dump of the ATAG_INITRD2. Everything worked as expected! So, it turns out that the boot loader on the Pi is not capable of picking an address for the initial ramdisk by itself.

### Command line

Finally, I just had to experiment with passing a command line string. I created a file called cmdline.txt on the SD card with some text. A reboot later, I saw that the dump of the ATAG_CMDLINE had some cruft prepended. The entire value of the ATAG looked like (with some spaces replaced by line breaks):

```dma.dmachans=0x7f35 bcm2708_fb.fbwidth=656 bcm2708_fb.fbheight=416
bcm2708.boardrev=0x10 bcm2708.serial=0xbd225074
bcm2708.disk_led_gpio=47 bcm2708.disk_led_active_low=0
sdhci-bcm2708.emmc_clock_freq=250000000 vc_mem.mem_base=0x1ec00000
vc_mem.mem_size=0x20000000  foo
```

This isn’t exactly the worst thing, but it does mean that the option parsing code has to handle cruft prepended to what it would expect the command line to look. I wish that these settings were passed via a separate ATAG.

## x2APIC, IOMMU, Illumos

About a week ago, I hinted at a boot hang I was debugging. I’ve made some progress with it, and along the way I found some interesting things about which I’ll blog over the next few days. Today, I’m going to talk about the  APIC, xAPIC, and  x2APIC and how they’re handled in Illumos.

### APIC, xAPIC, x2APIC

I strongly suggest you become at least a little familiar with APIC architecture before reading on. The Wikipedia articles above are a good start.

First things first, we need some definitions. APIC can refer to either the architecture or to very old (pre-Pentium 4) implementation. Since I’m working with a Sandy Bridge, I’m going to use APIC to refer to the architecture and completely ignore that these chips existed. Everything they do is a subset of xAPIC. xAPIC is an extension to APIC. xAPIC chips started showed up in NetBurst architecture Intel CPUs (i.e., Pentium 4). xAPIC included some goodies such as upping the limit on the number of CPUs to 256 (from 16). x2APIC is an extension to xAPIC. x2APIC chips started appearing around the same time Sandy Bridge systems started showing up. It is a major update to how interrupts are handled, but as with many things in the PC industry the x2APIC is fully backwards compatible with xAPICs. x2APIC includes some goodies such as upping the limit on the number of CPUs to ${2}^{32}$.

Regardless of which exact flavor you happen to use, you will find two components: the local APIC and I/O APIC. Each processor gets their own local APIC and I/O buses get I/O APICs. I/O APICs can service more than one device, and in fact many systems have only one I/O APIC.

The xAPIC uses  MMIO to program the local and I/O APICs.

x2APIC has two mode of operation. First, there is the xAPIC compatibility mode which makes the x2APIC behave just like an xAPIC. This mode doesn’t give you all the new bells and whistles. Second, there is the new x2APIC mode. In this mode, the APIC is programmed using  MSRs.

One interesting fact about x2APIC is that it requires an  iommu. My Sandy Bridge laptop has an Intel iommu as part of the VT-d feature.

### Illumos /etc/mach

x2APIC in Illumos has two APIC drivers. First, there is pcplusmp which knows how to handle APIC and xAPIC. Second, there is apix which targets x2APIC, but knows how to operate it in both modes. On boot, the kernel consults /etc/mach to get a list of machine specific modules to try to load. Currently, the default contents (trimmed for display here) are:

```#
# CAUTION!  The order of modules specified here is very important. If the
# order is not correct it can result in unexpected system behavior. The
# loading of modules is in the reverse order specified here (i.e. the last
#
pcplusmp
apix
xpv_psm
```

Since I’m not running Xen, xpv_psm will fail to load, and apix gets its chance to load.

### pcplusmp + apix Code Sharing

The code in these two modules can be summarized with a word: mess. Following what happens when would be enough of an adventure. The code for the two modules lives in four directories: usr/src/uts/i86pc/io, usr/src/uts/i86pc/io/psm, usr/src/uts/i86pc/io/pcplusmp, and usr/src/uts/i86pc/io/apix. But the sharing isn’t as straight forward as one would hope.

 Directory pcplusmp apix i86pc/io mp_platform_common.c, mp_platform_misc.c, hpet_acpi.c mp_platform_common.c, hpet_acpi.c i86pc/io/psm psm_common.c psm_common.c i86pc/io/pcplusmp * apic_regops.c, apic_common.c, apic_timer.c i86pc/io/apix — *

This is of course not clear at all when you look at the code. (Reality is a bit messier because of the i86xpv platform which uses some of the i86pc source.)

### apix_probe

When the apix module gets loaded, its probe function (apix_probe) is called. This is the place where the module decides if the hardware is worthy. Specifically, if it finds that the CPU reports x2APIC support via  cpuid, it goes on to call the common APIC probe code (apic_probe_common). Unless that fails, the system will use the apix module — even if there is no iommu and therefore the x2APIC needs to operate in xAPIC mode.

What mode are you using? Easy, just check the apic_mode global in the kernel:

```# echo apic_mode::whatis | mdb -k
fffffffffbd0ee4c is apic_mode, in apix's data segment
# echo apic_mode::print | mdb -k
0x2
```

2 (LOCAL_APIC) indicates xAPIC mode, while 3 (LOCAL_X2APIC) indicates x2APIC mode.

Because this part is as clear as mud, I made a table that tells you what module and mode to expect given your hardware, what CPUID says, and the presence and state of the iommu.

 APIC hw CPUID IOMMU IOMMU state Module apic_mode xAPIC off — — pcplusmp LOCAL_APIC x2APIC off — — pcplusmp LOCAL_APIC x2APIC on absent — apix LOCAL_APIC x2APIC on present off apix LOCAL_APIC x2APIC on present on apix LOCAL_X2APIC

### Defaults

I’ve never seen apic_mode equal to LOCAL_X2APIC in the wild. This was very puzzling. Yesterday, I discovered why. As I mentioned earlier, in order for the x2APIC to operate in x2APIC mode an iommu is required. Long story short, the default config that Illumos ships disables iommus on boot. Specifically:

```\$ cat /platform/i86pc/kernel/drv/rootnex.conf | grep -v '^\(#.*\|\)\$'
immu-enable="false";
```

In order to get LOCAL_X2APIC mode, you need to set:

```immu-enable="true";
immu-intrmap-enable="true";
```

Once you put those into the config file, update you boot archive and reboot. You should be set… except the iommu support in Illumos is… shall we say… poor.

(I should point out that it is possible for the BIOS to enable x2APIC mode before handing control off to the OS. This is pretty rare unless you have a really big x86 system.)

#### 1394

It would seem that the hci1394 driver doesn’t quite know how to deal with an iommu “messing” with its I/Os and its interrupt service routine shuts down the driver. (On a debug build it throws is ASSERT(0) for good measure.) I just disabled 1394 in the BIOS since I don’t have any Firewire devices handy and therefore no use for the port at the moment.

#### immu-enable Details

In case you want to know how iommu initialization affects the apix initialization…

During boot, immu_init gets called to initialize iommus. If the config option (immu-enable) is not true, the function just returns instead of calling immu_subsystems_setup which calls immu_intrmap_setup which sets psm_vt_ops to non-NULL value.

Later on, when apix is loaded and is initializing itself in apix_picinit, it calls apic_intrmap_init. This function does nothing if psm_vt_ops are NULL.

### The Hang

I might as well tell you a bit about my progress on tracking down the hang. It happens only if I’m using the apix module and I allow deep C states in the idle thread (technically, it could also be an mwait related issue since I cannot disable just mwait without disabling deep C states). It does not matter if the apic_mode is LOCAL_APIC or LOCAL_X2APIC.

## Rebooter

I briefly mentioned that I was debugging a boot hang. Since the hang does not happen every time I try to boot, it may take a couple of reboots to get the kernel to hang. Doing this manually is tedious. Thankfully it can be scripted. Therefore, I made a simple script and a SMF manifest that runs the script at the end of boot. If the system boots fine, my script reboots it. If the system hangs mid-boot, well my script never executes leaving the system in a hung state. Then, I can break into the kernel debugger (mdb) and investigate.

I’m sharing the two here mostly for my benefit… in case one day in the future I decide that I need my system automatically rebooted over and over again.

The script is pretty simple. Hopefully, 60 seconds is long enough to log in and disable the service if necessary. (In reality, I setup a separate boot environment that’s the default choice in Grub. I can just select my normal boot environment and get back to non-timebomb system.)

```#!/bin/sh

sleep 60

reboot -p
```

The tricky part is of course in the manifest. Not because it is hard, but because XML is … verbose.

```<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type='manifest' name='rebooter'>
<service name='site/rebooter' type='service' version='1'>
<dependency name='booted'
grouping='require_all'
restart_on='none'
type='service'>
<service_fmri
value='svc:/milestone/multi-user-server:default'/>
</dependency>

<property_group name="startd" type="framework">
<propval name="duration" type="astring" value="child"/>
<propval name="ignore_error" type="astring"
value="core,signal"/>
</property_group>

<instance name='system' enabled='true'>
<exec_method
type='method'
name='start'
exec='/home/jeffpc/illumos/rebooter/script.sh'
timeout_seconds='0' />

<exec_method
type='method'
name='stop'
exec=':true'
timeout_seconds='0' />
</instance>

<stability value='Unstable' />
</service>
</service_bundle>
```

That’s all, carry on what you were doing. :)

Recently, I ended up debugging a boot hang. (I’m still working on it, so I don’t have a resolution to it yet.) The hang seems to occur during the mp startup. That is, when the boot CPU tries to online all the other CPUs on the system. As a result, I spent a fair amount of time reading the code and poking around with mdb. Given the effort I put in, I decided to document my understanding of how CPUs get brought online during boot in Illumos. In this post, I’ll talk about the CPU pause threads.

Each CPU has a special thread — the pause thread. It is a very high priority thread that’s supposed to preempt everything on the CPU. If all CPUs are executing this high-priority thread, then we know for fact that nothing can possibly be dereferencing the CPU structures’ (cpu_t) pointers. Why is this useful? Here’s a comment from right above cpu_pause — the function pause threads execute:

```/*
* This routine is called to place the CPUs in a safe place so that
* one of them can be taken off line or placed on line.  What we are
* trying to do here is prevent a thread from traversing the list
* of active CPUs while we are changing it or from getting placed on
* the run queue of a CPU that has just gone off line.  We do this by
* creating a thread with the highest possible prio for each CPU and
* having it call this routine.  The advantage of this method is that
* we can eliminate all checks for CPU_ACTIVE in the disp routines.
* This makes disp faster at the expense of making p_online() slower
* which is a good trade off.
*/
```

The pause thread is pointed to by the CPU structure’s cpu_pause_thread member. A new CPU does not have a pause thread until after it has been added to the list of existing CPUs. (cpu_pause_alloc does the actual allocation.)

CPU pausing is pretty strange. First of all, let’s call the CPU requesting other CPUs to pause the controlling CPU and all online CPUs that will pause the pausing CPUs. (The controlling CPU does not pause itself.) Second, there are two global structures: (1) a global array called safe_list which contains a 8-bit integer for each possible CPU where each element holds a value ranging from 0 to 4 (PAUSE_*) denoting the state of that CPU’s pause thread, and (2) cpu_pause_info which contains some additional goodies used for synchronization.

### Pausing

To pause CPUs, the controlling CPU calls pause_cpus (which uses cpu_pause_start), where it iterates over all the pausing CPUs setting their safe_list entries to PAUSE_IDLE and queueing up (using setbackdq) their pause threads.

Now, just because the pause threads got queued doesn’t mean that they’ll get to execute immediately. That is why the controlling CPU then waits for each of the pause threads to up a semaphore in the cpu_pause_info structure. Once all the pause threads have upped the semaphore, the controlling CPU sets the cp_go flag to let the pause threads know that it’s time for them to go to sleep. Then the controlling CPU waits for each pause thread to signal (via the safe_list) that they have disabled just about all interrupts and that they are spinning (mach_cpu_pause). At this point, pause_cpus knows that all online CPUs are in a safe place.

### Starting

Starting the CPUs back up is pretty easy. The controlling CPU just needs to set all the CPU’s safe_list to a PAUSE_IDLE. That will cause the pausing CPUs to break out of their spin-loop. Once out of the spin loop, interrupts are re-enabled and a CPU control relinquished (via swtch). The controlling CPU does some cleanup of its own, but that’s all that is to it.

### Synchronization

Why not use a mutex or semaphore for everything? The problem lies in the fact that we are in a really fragile state. We don’t want to lose the CPU because we blocked on a semaphore. That’s why this code uses a custom synchronization primitives.

## iSCSI boot - Success

In my previous post, I documented some steps necessary to get OpenIndiana to boot from iSCSI.

I finally managed to get it to work cleanly. So, here are the remaining details necessary to boot your OI box from iSCSI.

### Installation

First, boot from one of the OI installation media. I used a USB flash drive. Then, before starting the installer, drop into a shell and connect to the target.

```# iscsiadm add discovery-address 172.16.0.1
# iscsiadm modify discovery -t enable
```

At this point, you should have all the LUs accessible:

```# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
0. c5t600144F000000000000052A4B4CE0002d0 <SUN-COMSTAR-1.0 cyl 13052 alt 2 hd 255 sec 63>
/scsi_vhci/disk@g600144f000000000000052a4b4ce0002
Specify disk (enter its number):
```

Exit the shell and start the installer.

Now, the tricky part… When you get to the network configuration page, you must select the “None” option. Selecting “Automatically” will cause nwam to try to start on boot and it’ll step onto the already configured network interface. That’s it. Finish installation normally. Once you’re ready to reboot, either configure your network card or use iPXE as I’ve shared before.

### e1000g

For the curious, here’s what the iSCSI booted (from the e1000g NIC) system looks like:

```# svcs network/physical
STATE          STIME    FMRI
disabled       17:13:10 svc:/network/physical:nwam
online         17:13:15 svc:/network/physical:default
LINK        CLASS     MTU    STATE    BRIDGE     OVER
e1000g0     phys      1500   up       --         --
e1000g0/?         static   ok           172.16.0.179/24
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128
```

### nge

Does switching back to the on-board nge NICs work now? No. We still get a lovely panic:

```WARNING: Cannot plumb network device 19

Warning - stack not written to the dump buffer
fffffffffbc71ae0 genunix:vfs_mountroot+75 ()
fffffffffbc71b10 genunix:main+136 ()
fffffffffbc71b20 unix:_locore_start+90 ()
```

## iSCSI boot

I decided a couple of days ago to try to see if OpenIndiana would still fail to boot from iSCSI like it did about two years ago. This post exists to remind me later what I did. If you find it helpful, great.

First, I got to set up the target. There is a bunch of documentation how to use COMSTAR to export a LU, so I won’t explain. I made a 100 GB LU.

I dug up an older system to act as my test box and disconnected its SATA disk. Booting from the OI USB image was uneventful. Before starting the installer, dropped into a shell and connected to the target (using iscsiadm). Then I installed OI onto the LU. Then, I dropped back into the shell to modify Grub’s menu.lst to use the serial port for both the Grub menu as well as make the kernel direct console output there.

Since the two on-board NICs can’t boot off iSCSI, I ended up using iPXE to boot off iSCSI. First, I made a script file:

```#!ipxe

dhcp
sanboot iscsi:172.16.0.1:::0:iqn.2010-08.org.illumos:02:oi-test
```

Then it was time to grab the source and build it. I did run into a simple problem in a test file, so I patched it trivially.

```\$ git clone git://git.ipxe.org/ipxe.git
\$ cd ipxe
\$ cat /tmp/ipxe.patch
diff --git a/src/tests/vsprintf_test.c b/src/tests/vsprintf_test.c
index 11512ec..2231574 100644
--- a/src/tests/vsprintf_test.c
+++ b/src/tests/vsprintf_test.c
@@ -66,7 +66,7 @@ static void vsprintf_test_exec ( void ) {
/* Basic format specifiers */
snprintf_ok ( 16, "%", "%%" );
snprintf_ok ( 16, "ABC", "%c%c%c", 'A', 'B', 'C' );
-	snprintf_ok ( 16, "abc", "%lc%lc%lc", L'a', L'b', L'c' );
+	//snprintf_ok ( 16, "abc", "%lc%lc%lc", L'a', L'b', L'c' );
snprintf_ok ( 16, "Hello world", "%s %s", "Hello", "world" );
snprintf_ok ( 16, "Goodbye world", "%ls %s", L"Goodbye", "world" );
snprintf_ok ( 16, "0x1234abcd", "%p", ( ( void * ) 0x1234abcd ) );
\$ patch -p1 < /tmp/ipxe.patch
\$ make bin/ipxe.usb EMBED=/tmp/ipxe.script
\$ sudo dd if=bin/ipxe.usb of=/dev/rdsk/c8t0d0p0 bs=1M
```

Now, I had a USB flash drive with iPXE that’d get a DHCP lease and then proceed to boot from my iSCSI target.

Did the system boot? Partially. iPXE did everything right — DHCP, storing the iSCSI information in the  iBFT, reading from the LU and handing control over to Grub. Grub did the right thing too. Sadly, once within kernel, things didn’t quite work out the way they should.

### iBFT

Was the iBFT getting parsed properly? After reading the code for a while and using mdb to examine the state, I found a convenient tunable (read: global int that can be set using the debugger) that will cause the iSCSI boot parameters to be dumped to the console. It is called iscsi_print_bootprop. Setting it to non-zero will produce nice output:

```Welcome to kmdb
kmdb: unable to determine terminal type: assuming `vt100'
Loaded modules: [ unix krtld genunix ]
[0]> iscsi_print_bootprop/W 1
iscsi_print_bootprop:           0               =       0x1
[0]> :c
OpenIndiana Build oi_151a7 64-bit (illumos 13815:61cf2631639d)
SunOS Release 5.11 - Copyright 1983-2010 Oracle and/or its affiliates.
Initiator Name : iqn.2010-04.org.ipxe:00020003-0004-0005-0006-000700080009
Local gateway  : 172.16.0.1
Local DHCP     : 0.0.0.0
Local MAC      : 00:02:b3:a8:66:0c
Target Name    : iqn.2010-08.org.illumos:02:oi-test
Target IP      : 172.16.0.1
Target Port    : 3260
Boot LUN       : 0000-0000-0000-0000
```

### nge vs. e1000g

So, the iBFT was getting parsed properly. The only “error” message to indicate that something was wrong was the “Cannot plumb network device 19”. Searching the code reveals that this is in the rootconf function. After more tracing, it became apparent that the kernel was trying to set up the NIC but was failing to find a device with the MAC address iBFT indicated. (19 is ENODEV)

At this point, it dawned on me that the on-board NICs are mere nge devices. I popped in a PCI-X e1000g moved the cable over and rebooted. Things got a lot farther!

### unable to connect

Currently, I’m looking at this output.

```NOTICE: Configuring iSCSI boot session...
NOTICE: iscsi connection(5) unable to connect to target iqn.2010-08.org.illumos:02:oi-test
Hostname: oi-test
Configuring devices.
NOTICE: iscsi connection(12) unable to connect to target iqn.2010-08.org.illumos:02:oi-test
```

The odd thing is, while these appear SMF is busy loading manifests and tracing the iSCSI traffic to the target shows that the kernel is doing a bunch of reads and writes. I suspect that all the successful I/O was done over one connection and then something happens and we lose the link. This is where I am now.

## Timesavers: ZFS & BE

I’ve mentioned Boot Environments before. Well, earlier this week BEs and ZFS snapshots saved me a bunch of time. Here’s what happened.

I was in the middle of installing some package (pkg install foo) when my laptop locked up. I had to power cycle it the hard way. When it booted back up, I retried the install, but pkg complained that some state file was corrupted and it didn’t want to do anything. Uh oh. I’ve had similar issue happen to me on Debian with aptitude, so I knew that the hard way of fixing this issue was going to take more time than I’d like to dedicate to it (read: none). Thankfully, I use OpenIndiana which has ZFS and BEs.

1. Reboot into a BE from a while ago (openindiana-3). The latest BE (openindiana-4) was created by pkg about a month ago as a clone of openindiana-3 during a major upgrade.
2. Figure out which automatic ZFS snapshot I want to revert to. A matter of running zfs list -t all rpool/ROOT/openindiana-4 | tail -5 and picking the latest snapshot which I believe is from before pkg messed it all up. I ended up going an hour back just to make sure.
3. Revert the BE. beadm rollback openindiana-4@zfs-auto-snap_hourly-2011-10-25-19h11
4. Reboot back into openindiana-4.

After the final reboot, everything worked just fine. (Since the home directories are on a different dataset, they were left untouched.)

Total downtime: 5 minutes
Ease of repair: trivial