Josef “Jeff” Sipek

Interactivity During nightly(1), part 2

Back in May, I talked about how I increase the priority of Firefox in order to get decent response times while killing my laptop with a nightly build of Illumos. Specifically, I have been increasing the priority of Firefox so that it would get to run in a timely manner. I have been doing this by setting it to the real-time (RT) scheduling class which has higher priority than most things on the system. This, of course, requires extra privileges.

Today, I realized that I was thinking about the problem the wrong way. What I really should be doing is lowering the priority of the build. This requires no special privileges. How do I do this? In my environment file, I include the following line:

priocntl -s -c FX -p 0 $$

This sets the nightly build script’s scheduling class to fixed (FX) and manually sets the priority to 0. From that point on, the nightly script and any processes it spawns run with a lower priority (zero) than everything else (which tends to be in the 40-59 range).

Interactivity During nightly(1)

Every so often, I do a nightly build of illumos on my laptop. This is a long and very CPU intensive process. During the build (which takes about 2.75 hours), the load average is rarely below 8 and most of the time it hovers in the low twenties. (This is a full debug and non-debug build with lint and all the other checking. If I need a build quickly, I can limit it to just what I need and then we’re talking minutes or seconds.)

Anyway, as you might imagine this sort of load puts some pressure on the CPUs. As a result, some interactive processes suffer a bit. Specifically, Firefox doesn’t get enough cycles to render the modern web (read: JavaScript cesspool). Instead of suffering for almost three hours, I just change Firefox’s scheduling class from IA (interactive) to RT (real time):

# priocntl -s -c RT `pgrep firefox`

This allows Firefox to preempt just about everything on my laptop. This works because Firefox actually yields the CPU properly. This will probably bite me sometime in the future when I end up on a page with such a busted JavaScript turd that it takes over a CPU and won’t let go. Till then, I have a pretty good workaround.

Lua Compatibility

Phew! Yesterday afternoon, I decided to upgrade my laptop’s OpenIndiana from 151a9 to “Hipster”. I did it in a bit convoluted way, and hopefully I’ll write about that some other day. In the end, I ended up with a fresh install of the OS with X11 and Gnome. If you’ve ever seen my monitors, you know that I do not use Gnome — I use Notion. So, of course I had it install it. Sadly, OpenIndiana doesn’t ship it so it was up to me to compile it. After the usual fight to get a piece of software to compile on Illumos (a number of the Solaris-isms are still visible), I got it installed. A quick gdm login later, Notion threw me into a minimal environment because something was exploding.

After far too many hours of fighting it, searching online, and trying random things, I concluded that it was not Notion’s fault. Rather, it was something on the system. Eventually, I figured it out. Lua 5.2 (which is standard on Hipster) is not compatible with Lua 5.1 (which is standard on 151a9)! Specifically, a number of functions have been removed and the behavior of other functions changed. Not being a Lua expert (I just deal with it whevever I need to change my window manager’s configuration), it took longer than it should but eventually I managed to get Notion working like it should be.

So, what sort of incompatibilies did I have to work around?

loadstring

loadstring got renamed to load. This is an easy to fix thing, but still a headache especially if you want to support multiple versions of Lua.

table.maxn

table.maxn got removed. This function returned the largest positive integer key in an associative array (aka. a table) or 0 if there aren’t any. (Lua indexes arrays starting at 1.) The developers decided that it’s so simple that those that want it can write it themselves. Here’s my version:

local function table_maxn(t)
    local mn = 0
    for k, v in pairs(t) do
        if mn < k then
            mn = k
        end
    end
    return mn
end

table.insert

table.insert now checks bounds. There doesn’t appear to be any specific way to get old behavior. In my case, I was lucky. The index/positition for the insertion was one higher than table_maxn returned. So, I could replace:

table.insert(ret, pos, newscreen)

with:

ret[pos] = newscreen

Final Thougths

I can understand wanting to deprecate old crufty interfaces, but I’m not sure that the Lua developers did it right. I really think they should have marked those interfaces as obsolete, make any use spit out a warning, and then in a couple of years remove it. I think that not doing this, will hurt Lua 5.2’s adoption.

Yes, I believe there is some sort of a compile time option for Lua to get legacy interfaces, but not everyone wants to recompile Lua because the system installed version wasn’t compiled quite the way that would make things Just Work™.

x2APIC, IOMMU, Illumos

About a week ago, I hinted at a boot hang I was debugging. I’ve made some progress with it, and along the way I found some interesting things about which I’ll blog over the next few days. Today, I’m going to talk about the Wikipedia article: APIC, xAPIC, and Wikipedia article: x2APIC and how they’re handled in Illumos.

APIC, xAPIC, x2APIC

I strongly suggest you become at least a little familiar with APIC architecture before reading on. The Wikipedia articles above are a good start.

First things first, we need some definitions. APIC can refer to either the architecture or to very old (pre-Pentium 4) implementation. Since I’m working with a Sandy Bridge, I’m going to use APIC to refer to the architecture and completely ignore that these chips existed. Everything they do is a subset of xAPIC. xAPIC is an extension to APIC. xAPIC chips started showed up in NetBurst architecture Intel CPUs (i.e., Pentium 4). xAPIC included some goodies such as upping the limit on the number of CPUs to 256 (from 16). x2APIC is an extension to xAPIC. x2APIC chips started appearing around the same time Sandy Bridge systems started showing up. It is a major update to how interrupts are handled, but as with many things in the PC industry the x2APIC is fully backwards compatible with xAPICs. x2APIC includes some goodies such as upping the limit on the number of CPUs to $2^{32}$.

Regardless of which exact flavor you happen to use, you will find two components: the local APIC and I/O APIC. Each processor gets their own local APIC and I/O buses get I/O APICs. I/O APICs can service more than one device, and in fact many systems have only one I/O APIC.

The xAPIC uses Wikipedia article: MMIO to program the local and I/O APICs.

x2APIC has two mode of operation. First, there is the xAPIC compatibility mode which makes the x2APIC behave just like an xAPIC. This mode doesn’t give you all the new bells and whistles. Second, there is the new x2APIC mode. In this mode, the APIC is programmed using Wikipedia article: MSRs.

One interesting fact about x2APIC is that it requires an Wikipedia article: iommu. My Sandy Bridge laptop has an Intel iommu as part of the VT-d feature.

Illumos /etc/mach

x2APIC in Illumos has two APIC drivers. First, there is pcplusmp which knows how to handle APIC and xAPIC. Second, there is apix which targets x2APIC, but knows how to operate it in both modes. On boot, the kernel consults /etc/mach to get a list of machine specific modules to try to load. Currently, the default contents (trimmed for display here) are:

#
# CAUTION!  The order of modules specified here is very important. If the
# order is not correct it can result in unexpected system behavior. The
# loading of modules is in the reverse order specified here (i.e. the last
# entry is loaded first and the first entry loaded last).
#
pcplusmp
apix
xpv_psm

Since I’m not running Xen, xpv_psm will fail to load, and apix gets its chance to load.

pcplusmp + apix Code Sharing

The code in these two modules can be summarized with a word: mess. Following what happens when would be enough of an adventure. The code for the two modules lives in four directories: usr/src/uts/i86pc/io, usr/src/uts/i86pc/io/psm, usr/src/uts/i86pc/io/pcplusmp, and usr/src/uts/i86pc/io/apix. But the sharing isn’t as straight forward as one would hope.

Directory pcplusmp apix
i86pc/io mp_platform_common.c, mp_platform_misc.c, hpet_acpi.c mp_platform_common.c, hpet_acpi.c
i86pc/io/psm psm_common.c psm_common.c
i86pc/io/pcplusmp * apic_regops.c, apic_common.c, apic_timer.c
i86pc/io/apix *

This is of course not clear at all when you look at the code. (Reality is a bit messier because of the i86xpv platform which uses some of the i86pc source.)

apix_probe

When the apix module gets loaded, its probe function (apix_probe) is called. This is the place where the module decides if the hardware is worthy. Specifically, if it finds that the CPU reports x2APIC support via Wikipedia article: cpuid, it goes on to call the common APIC probe code (apic_probe_common). Unless that fails, the system will use the apix module — even if there is no iommu and therefore the x2APIC needs to operate in xAPIC mode.

What mode are you using? Easy, just check the apic_mode global in the kernel:

# echo apic_mode::whatis | mdb -k
fffffffffbd0ee4c is apic_mode, in apix's data segment
# echo apic_mode::print | mdb -k
0x2

2 (LOCAL_APIC) indicates xAPIC mode, while 3 (LOCAL_X2APIC) indicates x2APIC mode.

Because this part is as clear as mud, I made a table that tells you what module and mode to expect given your hardware, what CPUID says, and the presence and state of the iommu.

APIC hw CPUID IOMMU IOMMU state Module apic_mode
xAPIC off pcplusmp LOCAL_APIC
x2APIC off pcplusmp LOCAL_APIC
x2APIC on absent apix LOCAL_APIC
x2APIC on present off apix LOCAL_APIC
x2APIC on present on apix LOCAL_X2APIC

Defaults

I’ve never seen apic_mode equal to LOCAL_X2APIC in the wild. This was very puzzling. Yesterday, I discovered why. As I mentioned earlier, in order for the x2APIC to operate in x2APIC mode an iommu is required. Long story short, the default config that Illumos ships disables iommus on boot. Specifically:

$ cat /platform/i86pc/kernel/drv/rootnex.conf | grep -v '^\(#.*\|\)$'
immu-enable="false";

In order to get LOCAL_X2APIC mode, you need to set:

immu-enable="true";
immu-intrmap-enable="true";

Once you put those into the config file, update you boot archive and reboot. You should be set… except the iommu support in Illumos is… shall we say… poor.

(I should point out that it is possible for the BIOS to enable x2APIC mode before handing control off to the OS. This is pretty rare unless you have a really big x86 system.)

1394

It would seem that the hci1394 driver doesn’t quite know how to deal with an iommu “messing” with its I/Os and its interrupt service routine shuts down the driver. (On a debug build it throws is ASSERT(0) for good measure.) I just disabled 1394 in the BIOS since I don’t have any Firewire devices handy and therefore no use for the port at the moment.

immu-enable Details

In case you want to know how iommu initialization affects the apix initialization…

During boot, immu_init gets called to initialize iommus. If the config option (immu-enable) is not true, the function just returns instead of calling immu_subsystems_setup which calls immu_intrmap_setup which sets psm_vt_ops to non-NULL value.

Later on, when apix is loaded and is initializing itself in apix_picinit, it calls apic_intrmap_init. This function does nothing if psm_vt_ops are NULL.

The Hang

I might as well tell you a bit about my progress on tracking down the hang. It happens only if I’m using the apix module and I allow deep C states in the idle thread (technically, it could also be an mwait related issue since I cannot disable just mwait without disabling deep C states). It does not matter if the apic_mode is LOCAL_APIC or LOCAL_X2APIC.

Assorted Documentation

  1. Intel 64 Architecture x2APIC Specification
  2. Intel MP Spec 1.4

CPU Pause Threads

Recently, I ended up debugging a boot hang. (I’m still working on it, so I don’t have a resolution to it yet.) The hang seems to occur during the mp startup. That is, when the boot CPU tries to online all the other CPUs on the system. As a result, I spent a fair amount of time reading the code and poking around with mdb. Given the effort I put in, I decided to document my understanding of how CPUs get brought online during boot in Illumos. In this post, I’ll talk about the CPU pause threads.

Each CPU has a special thread — the pause thread. It is a very high priority thread that’s supposed to preempt everything on the CPU. If all CPUs are executing this high-priority thread, then we know for fact that nothing can possibly be dereferencing the CPU structures’ (cpu_t) pointers. Why is this useful? Here’s a comment from right above cpu_pause — the function pause threads execute:

/*
 * This routine is called to place the CPUs in a safe place so that
 * one of them can be taken off line or placed on line.  What we are
 * trying to do here is prevent a thread from traversing the list
 * of active CPUs while we are changing it or from getting placed on
 * the run queue of a CPU that has just gone off line.  We do this by
 * creating a thread with the highest possible prio for each CPU and
 * having it call this routine.  The advantage of this method is that
 * we can eliminate all checks for CPU_ACTIVE in the disp routines.
 * This makes disp faster at the expense of making p_online() slower
 * which is a good trade off.
 */

The pause thread is pointed to by the CPU structure’s cpu_pause_thread member. A new CPU does not have a pause thread until after it has been added to the list of existing CPUs. (cpu_pause_alloc does the actual allocation.)

CPU pausing is pretty strange. First of all, let’s call the CPU requesting other CPUs to pause the controlling CPU and all online CPUs that will pause the pausing CPUs. (The controlling CPU does not pause itself.) Second, there are two global structures: (1) a global array called safe_list which contains a 8-bit integer for each possible CPU where each element holds a value ranging from 0 to 4 (PAUSE_*) denoting the state of that CPU’s pause thread, and (2) cpu_pause_info which contains some additional goodies used for synchronization.

Pausing

To pause CPUs, the controlling CPU calls pause_cpus (which uses cpu_pause_start), where it iterates over all the pausing CPUs setting their safe_list entries to PAUSE_IDLE and queueing up (using setbackdq) their pause threads.

Now, just because the pause threads got queued doesn’t mean that they’ll get to execute immediately. That is why the controlling CPU then waits for each of the pause threads to up a semaphore in the cpu_pause_info structure. Once all the pause threads have upped the semaphore, the controlling CPU sets the cp_go flag to let the pause threads know that it’s time for them to go to sleep. Then the controlling CPU waits for each pause thread to signal (via the safe_list) that they have disabled just about all interrupts and that they are spinning (mach_cpu_pause). At this point, pause_cpus knows that all online CPUs are in a safe place.

Starting

Starting the CPUs back up is pretty easy. The controlling CPU just needs to set all the CPU’s safe_list to a PAUSE_IDLE. That will cause the pausing CPUs to break out of their spin-loop. Once out of the spin loop, interrupts are re-enabled and a CPU control relinquished (via swtch). The controlling CPU does some cleanup of its own, but that’s all that is to it.

Synchronization

Why not use a mutex or semaphore for everything? The problem lies in the fact that we are in a really fragile state. We don’t want to lose the CPU because we blocked on a semaphore. That’s why this code uses a custom synchronization primitives.

Meili upgrades

A couple of months ago, I decided to update my almost two and a half year old laptop. Twice.

First, I got more RAM. This upped it to 12 GB. While still on the low side for a box which actually gets to see some heavy usage (compiling illumos takes a couple of hours and generates a couple of GB of binaries), it was better than the 4 GB I used for way too long.

Second, I decided to bite the bullet and replaced the 320 GB disk with a 256 GB SSD (Samsung 840 Pro). Sadly, in the process I had the pleasure of reinstalling the system — both Windows 7 and OpenIndiana. Overall, the installation was uneventful as my Windows partition has no user data and my OI storage is split into two pools (one for system and one for my data).

The nice thing about reinstalling OI was getting back to a stock OI setup. A while ago, I managed to play with software packaging a bit too much and before I knew it I was using a customized fork of OI that I had no intention of maintaining. Of course, I didn’t realize this until it was too late to rollback. Oops. (Specifically, I had a custom pkg build which was incompatible with all versions OI ever released.)

One of the painful things about my messed-up-OI install was that I was running a debug build of illumos. This made some things pretty slow. One such thing was boot. The ZFS related pieces took about a minute alone to complete. The whole boot procedure took about 2.5 minutes. Currently, with a non-debug build and an SSD, my laptop goes from Grub prompt to gdm login in about 40 seconds. I realize that this is an apples to oranges comparison.

I knew SSDs were supposed to be blazing fast, but I resisted getting one for the longest time mostly due to reliability concerns. What changed my mind? I got to use a couple of SSDs in my workstation at work. I saw the performance and I figured that ZFS would take care of alerting me of any corruption. Since most of my work is version controlled, chances are that I wouldn’t lose anything. Lastly, SSDs got a fair amount of improvements over the past few years.

Meili

Back in June I got myself a new laptop — Thinkpad T520. As always, it’s a solid design (yes, I know, Thinkpads aren’t what they used to be). The unfortunate news is that the hardware was just a bit too new to be supported well. It came with Windows 7, which of course knew how to deal with all the devices in the system.

Debian

I tried installing Debian, but anything but the latest testing snapshot didn’t recognize my Intel 82579LM ethernet chip. The latest development snapshot installed just fine, but when I tried to boot into the installed system, everything got stuck in the middle of the initramfs. Booting with init=/bin/bash got me a shell, but anything and everything I tried didn’t fix the problem in the end. I searched the bug tracker for similar issues — no luck. In a last ditch effort, I tried to ask #debian. In has been my experience in the past that this channel is useless, but I asked anyway. I got precisely zero responses.

OpenIndiana

Unlike Debian, OpenIndiana installed and booted just fine. Sadly, both my wifi (Intel Wifi Link 1000) and my wired ethernet were not supported. I ended up installing VirtualBox in Windows and OpenIndiana underneath it. It worked reasonably well. At the same time, I started pestering some of the Illumos developers that mentioned that they were working on an update to the e1000g driver — the driver for my wired network interface.

Yesterday, one of them updated the bug related to the driver update with binaries for people to try. Well, guess what? I’m writing this entry in Vim over ssh.

The driver install was a simple matter of overwriting the existing e1000g files, and then running update_drv -a -i ’"pci8086,1502"’ e1000g and then rebooting. (I could have used devfsadm instead, but I wanted to make sure things would come up on boot anyway.)

I still need to switch Xorg to the proprietary NVidia driver.

Windows

I’m pretty sure I’ll end up rebooting into Windows every so often anyway…if only to play Wikipedia article: Age of Mythology. :)

P.S. In case you haven’t guessed it yet, Meili is my laptop’s hostname.

Powered by blahgd