Josef “Jeff” Sipek

Unleashed: The Birth (and Death) of an Operating System Fork

I realize that I haven’t mentioned Unleashed on my blahg before. The major reason for it was that whenever I thought of doing that, I ended up working on Unleashed instead. Oops! :)

For those that don’t know, Unleashed was an operating system fork created by me and Lauri in 2016. A fork that we’ve maintained till now. I said was, because earlier this month, we made the last release. Instead of trying to summarize everything again, let me just quote the relevant parts of the announcement email I sent out on April 4th:

This is the fifth and final release of Unleashed—an operating system fork of illumos. For more information about Unleashed itself and the download links, see our website [1].

That is right, we are pulling the plug on this project. What began as a hobby project for two developers never grew much beyond being a hobby project for two developers. But after nearly 4 years of work, we proved that the illumos code base can be cleaned up significantly, its APIs modernized, and the user experience improved in general.

While we’ve made too many changes [2] to list them all here, I’d like to highlight what I think are some of our major accomplishments:

  • shrinking the codebase by about 25% (~5.6 million lines of code) even though we imported major components (e.g., libressl, openbsd ksh)
  • reducing build times drastically (from ~40 minutes to ~16 minutes on a 2012-era 4 core CPU)
  • changing uname to Unleashed and amd64 (from SunOS 5.11 i86pc)

In addition to the projects we finished, we were well on the way to several other improvements that we simply haven’t gotten around to completing. Some of the more notable ones include:

  • page cache rewrite (~3/5 done)
  • modernizing the build system with bmake / removing dmake (~1/5 done)
  • everything 64-bit only (~4/5 done)

All that we’ve accomplished is just the tip of the iceberg of what remains to be done. Unfortunately for Unleashed, we both decided that we wanted to spend our free time on different projects.

I know I’m biased, but I think we’ve made some good changes in Unleashed and I’d love to see at least some of them make their way into illumos-gate or one of the illumos distros. So, I hope that someone will have the time and interest to integrate some of these changes.

Finally, I’d like to encourage anyone who may be interested not to be afraid of forking an open source project. Even if it doesn’t work out, it is extremely educational.

Happy hacking!

Jeff.

[1] https://www.unleashed-os.org
[2] https://repo.or.cz/unleashed.git/blob/HEAD:/FEATURES.txt

What Unleashed was or wasn’t is described on the website, the README, the features file, and the mailing list archives. The history leading up to Unleashed is essentially undocumented. I am dedicating the rest of this post to that purpose.

Jeffix (2015–2016)

Before Unleashed there was Jeffix.

I made an extremely indirect mention of Jeffix on Twitter in 2015, and one direct mention in a past blahg post in 2016.

So, what exactly was Jeffix? Was it a distro? Not quite. It was more of an overlay for OpenIndiana Hipster. You weren’t able to install Jeffix. Instead, you had to install Hipster and then add the Jeffix package repository (at a higher priority), followed by an upgrade.

Why make it? At the time I used OpenIndiana (an illumos distro) on my laptop. While that was great for everyday work, being an illumos developer meant that I wanted to test my changes before they made it upstream. For about a year, I was running various pre-review and pre-RTI bits. This meant that the set of improvements I had available changed over time. Sometimes this got frustrating since I wouldn’t have certain changes just because they didn’t make it through the RTI process yet and I was testing a different change.

So, in October 2015, I decided to change that. I made a repo with an assortment of my changes. Some are mine while others are authored by other developers in the community. I called this modified illumos Jeffix. Very creative, I know.

I kept this up until May 2016. Today, I have no idea why I stopped building it then.

To Fork or Not To Fork

For the three years leading up to Unleashed, I spent a considerable amount of time thinking about what would make illumos better—not just on the technical side but also the community side.

It has always been relatively easy to see that the illumos community was not (and still is not) very big. By itself this isn’t a problem, but it becomes one when you combine it with the other issues in the community.

The biggest problem is the lack of clear vision for where the project should go.

Over the years, the only idea that I’ve seen come up consistently could be summarized as “don’t break compatibility.” What exactly does that mean? Well, ask 10 people and you’ll get 12 different opinions.

Provably hostile

There have been several times where I tried to clean up an interface in the illumos kernel, and the review feedback amounted to “doesn’t this break on $some-sparc-hardware running kernel modules from Solaris?”

In one instance (in August 2015), when I tried to remove an ancient specialized kernel module binary compatibility workaround that Sun added in Solaris 9 (released in 2002), I was asked about what turned out to be a completely ridiculous situation—you’d need to try to load a kernel module built against Solaris 8 that used this rather specialized API, an API that has been changed twice in incompatible ways since it was added (once in 2005 by Sun, and once in April 2015 by Joyent).

Unsurprisingly, that “feedback” completely derailed a trivial cleanup. To this day, illumos has that ugly workaround for Solaris 8 that does absolutely nothing but confuse anyone that looks at that code. (Of course, this cleanup was one of the first commits in Unleashed.)

While this was a relatively simple (albeit ridiculous) case—I just had to find a copy of old Solaris headers to see when things changed—it nicely demonstrates the “before we take your change, please prove that it doesn’t break anything for anyone on any system” approach the community takes.

Note that this is well beyond the typical “please do due diligence” where the reviewers tend to help out with the reasoning or even testing. The approach the illumos community takes feels more like a malicious “let’s throw every imaginable thought at the contributor, maybe some of them stick.” Needless to say, this is a huge motivation killer that makes contributors leave—something that a small-to-begin-with community cannot afford.

Toxic

In the past there have been a number of people in the illumos community that were, in my opinion, outright toxic for the project. I’m happy to say, that a number of them have left the community. Good riddance.

What do I mean by toxic?

Well, for instance, there was the time in 2014 when someone decided to contribute to a thread about removing SunOS 4.x compatibility code (that is binary compatibility with an OS whose last release was in 1994) with only one sentence: “Removing stuff adds no value.

Elsewhere in the same thread, another person chimed in with his typical verbiage that could be summarized as “why don’t you do something productive with your time instead, and work on issues that I think are important.” While his list of projects was valid, being condescending to anyone willing to spend their free time to help out your project or telling them that they’re wasting their time unless they work on something that scratches your or your employer’s itch is…well…stupid. Yet, this has happened many times on the mailing list and on IRC over the years.

Both of these examples come from the same thread simply because I happened to stumble across it while looking for another email, but rest assured that there have been plenty of other instances of this sort of toxic behavior.

The Peanut Gallery

Every project with enough people on the mailing list ends up with some kind of a Wikipedia article: peanut gallery. The one in illumos is especially bad. Every time a removal of something antique is mentioned, people that haven’t contributed a single line of code would come out of the woodwork with various forms of “I use it” or even a hypothetical “someone could be using it to do $foo”.

It takes a decent amount of effort to deal with those. For new contributors it is even worse as they have no idea if the feedback is coming from someone that has spent a lot of time developing the project (and should be taken seriously) or if it is coming from an obnoxiously loud user or even a troll (and should be ignored).

2020

All this combined results in a potent mix that drives contributors away. Over the years, I’ve seen people come, put in reasonable effort to attempt to contribute, hit this wall of insanity, and quietly leave.

As far as I can tell, some of the insanity is better now—many of the toxic people left, some of the peanut gallery members started to contribute patches to remove dead code, etc.—but a lot of problems still remain (e.g., changes still seem to get stuck in RTI).

So, why did I write so many negative things about the illumos community? Well it documents the motivation for Unleashed. :) Aside from that, I think there is some good code in illumos that should live on but it can only do that if there is a community taking care of it—a community that can survive long term. Maybe this post will help with opening some eyes.

Unleashed

In July 2016, I visited Helsinki for a job interview at Dovecot. Before the visit, I contacted Lauri to see if he had any suggestions for what to see in Helsinki. In addition to a variety of sightseeing ideas, he suggested that we meet up for a beer.

At this point, the story continues as Lauri described it on the mailing list — while we were lamenting, Jouni suggested we fork it. It was obvious that he was (at least partially) joking, but having considered forking, it resonated with me. After I got home, I thought about it for a while but ultimately decided that it was a good enough idea.

That’s really all there is to the beginning of Unleashed itself. While the decision to fork was definitely instigated by Jouni, the thought was certainly on my mind for some time before that.

Next

With Unleashed over, what am I going to do next?

I have plenty of fun projects to work on—ranging from assorted DSP attempts to file system experiments. They’ll get developed on my FreeBSD laptop. And no, I will not resume contributing to illumos. Maybe I’m older and wiser (and probably grumpier), but I want to spend my time working on code that is appreciated.

With all that said, I made some friends in illumos and I will continue chatting with them about anything and everything.

CBOR vs. JSON vs. libnvpair

My blahg uses nvlists for logging extra information about its operation. Historically, it used Sun libnvpair. That is, it used its data structures as well as the XDR encoding to serialize the data to disk.

A few months ago, I decided to replace libnvpair with my own nvlist implementation—one that was more flexible and better integrated with my code. (It is still a bit of a work-in-progress, but it is looking good.) The code conversion went smoothly, and since then all the new information was logged in JSON.

Last night, I decided to convert a bunch of the previously accumulated libnvpair data files into the new JSON-based format. After whipping up a quick conversion program, I ran it on the data. The result surprised me—the JSON version was about 55% of the size of the libnvpair encoded input!

This piqued my interest. I re-ran the conversion but with CBOR (RFC 7049) as the output format. The result was even better with the output being 45% of libnvpair’s encoding.

This made me realize just how inefficient libnvpair is when serialized. At least part of it is because XDR (the way libnvpair serializes data) uses a lot of padding, while both JSON and CBOR use a more compact encoding for many data types (e.g., an unsigned number in CBOR uses 1 byte for the type and 0, 1, 2, 4, or 8 additional bytes based on its magnitude, while libnvpair always encodes a uint64_t as 8 bytes plus 4 bytes for the type).

Since CBOR is 79% of JSON’s size (and significantly less underspecified compared to the minefield that is JSON), I am hoping to convert everything that makes sense to CBOR. (CBOR being a binary format makes it harder for people to hand-edit it. If hand-editing is desirable, then it makes sense to stick with JSON or other text-based formats.)

The Data & Playing with Compression

The blahg-generated dataset that I converted consisted of 230866 files, each containing an nvlist. The following byte counts are a simple concatenation of the files. (A more complicated format like tar would add a significant enough overhead to make the encoding efficiency comparison flawed.)

Format Size % of nvpair
nvpair 471 MB 100%
JSON 257 MB 54.6%
CBOR 203 MB 45.1%

I also took each of the concatenated files and compressed it with gzip, bzip2, and xz. In each case, I used the most aggressive compression by using -9. The percentages in parentheses are comparing the compressed size to the same format’s uncompressed size. The results:

Format Uncomp. gzip bzip2 xz
nvpair 471 MB 37.4 MB (7.9%) 21.0 MB (4.5%) 15.8 MB (3.3%)
JSON 257 MB 28.7 MB (11.1%) 17.9 MB (7.0%) 14.5 MB (5.6%)
CBOR 203 MB 26.8 MB (13.2%) 16.9 MB (8.3%) 13.7 MB (6.7%)

(The compression ratios are likely artificially better than normal since each of the 230k files has the same nvlist keys.)

Since tables like this are hard to digest, I turned the same data into a graph:

CBOR does very well uncompressed. Even after compressing it with a general purpose compression algorithm, it outperforms JSON with the same algorithm by about 5%.

I look forward to using CBOR everywhere I can.

setuid/setgid & coreadm

While core files can be a huge nuisance (i.e., “why are there core files all over my file system?!”), it is hard to deny the fact that they often are invaluable when trying to diagnose a bug. Illumos systems allow the system administrator to control the creation of core files in an easy but powerful way—via the coreadm utility. Recently, I had a bit of a run-in with coreadm because I misunderstood one of its features. As I often do, I’m using this space to document my findings for my and others’ benefit.

A Small Introduction

First of all, let me introduce coreadm a little bit. Simply running coreadm without any arguments will print the current settings. For example, these are the default Illumos settings (keep in mind that some distros use different defaults):

# coreadm
     global core file pattern: 
     global core file content: default
       init core file pattern: core
       init core file content: default
            global core dumps: disabled
       per-process core dumps: enabled
      global setid core dumps: disabled
 per-process setid core dumps: disabled
     global core dump logging: disabled

When I first came across coreadm a couple of years ago, I expected a boolean (“should cores be created”) and a pattern for core file names. As you can see above, there are more options than that. While wearing both developer and system administrator hats at the same time, I was very happy to see that it is possible for a misbehaving process to create two cores—a “per-process” core and a “global” one. Since most are familiar with the “per-process” cores that end up in whatever the current working directory was, I’m going to talk only about the global ones. (The term “global” in this context has nothing to do with the global zone. All these settings are per zone.)

The global core settings allow the system administrator to save a copy of every core ever made in a central location for later analysis. So, for example we can make all normal as well as setuid processes dump core in a specified location (e.g., /cores/%f.%p):

# coreadm -e global -e global-setid -g /cores/%f.%p

Running coreadm without any arguments again shows us the new configuration:

# coreadm 
     global core file pattern: /cores/%f.%p
     global core file content: default
       init core file pattern: core
       init core file content: default
            global core dumps: enabled
       per-process core dumps: enabled
      global setid core dumps: enabled
 per-process setid core dumps: disabled
     global core dump logging: disabled

setid

The difference between plain ol’ global core dumps and global setid core dumps is pretty easy to guess: global core dumps are exactly what I described above…except a setid process will not generate a core file unless the setid core dumps are also enabled. The handwavy reasoning here is that setid cores can include sensitive information that could be leaked by sharing the core.

The aspect that bit me recently is that if a process gives up its privileges via setuid(2), it is treated the same way as if it were setuid. This behavior is documented in the setuid syscall handler in uid.c:121:

	/*
	 * A privileged process that gives up its privilege
	 * must be marked to produce no core dump.
	 */

So, when I tried to make the daemon more secure by instructing it to change to a non-privileged user, I was accidentally disabling core dumps for that process. Needless to say, this made debugging the SIGSEGV I was running into significantly harder. Now that I know this, I make sure both global and global setid cores are always enabled.

MACHINE_THAT_GOES_PING

Given that my first UNIX experience was on Linux, I’ve gotten used to the way certain commands work. When I switched from Linux to OpenIndiana (an Illumos-based distro), I had to get used to the fact that some commands worked slightly (or in some case extremely) differently. One such command is ping.

On Linux, invoking ping without any arguments, I would get the very familiar output:

linux$ ping powerdns.com
PING powerdns.com (82.94.213.34) 56(84) bytes of data.
64 bytes from xs.powerdns.com (82.94.213.34): icmp_req=1 ttl=55 time=98.0 ms
64 bytes from xs.powerdns.com (82.94.213.34): icmp_req=2 ttl=55 time=99.2 ms
64 bytes from xs.powerdns.com (82.94.213.34): icmp_req=3 ttl=55 time=100 ms
^C
--- powerdns.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 98.044/99.170/100.188/0.950 ms

I was very surprised when I first ran ping on an OpenIndiana box since it outputted something very different:

oi$ ping powerdns.com
powerdns.com is alive

No statistics! Just a boolean indicating “has the host responded to a single ping.” When I run ping, I want to see the statistics—that’s why I run ping to begin with. The manpage helpfully points out that I can get statistics by using the -s option:

oi$ ping -s powerdns.com
PING powerdns.com: 56 data bytes
64 bytes from xs.powerdns.com (82.94.213.34): icmp_seq=0. time=98.955 ms
64 bytes from xs.powerdns.com (82.94.213.34): icmp_seq=1. time=99.597 ms
64 bytes from xs.powerdns.com (82.94.213.34): icmp_seq=2. time=99.546 ms
^C
----powerdns.com PING Statistics----
3 packets transmitted, 3 packets received, 0% packet loss
round-trip (ms)  min/avg/max/stddev = 98.955/99.366/99.597/0.357

For the past few years, I’ve just been getting used to adding -s. It was a little annoying, but it wasn’t the end of the world because I don’t use ping that much and when I do, the two extra characters don’t matter.

Recently, I was looking through the source for Illumos’s ping when I discovered that statistics can be enabled not just by the -s option but also with the MACHINE_THAT_GOES_PING environment variable!

A quick test later, I added the variable to my environment scripts and never looked back.

This is what is looks like:

oi$ export MACHINE_THAT_GOES_PING=1
oi$ ping powerdns.com
PING powerdns.com: 56 data bytes
64 bytes from xs.powerdns.com (82.94.213.34): icmp_seq=0. time=98.704 ms
64 bytes from xs.powerdns.com (82.94.213.34): icmp_seq=1. time=99.062 ms
64 bytes from xs.powerdns.com (82.94.213.34): icmp_seq=2. time=99.156 ms
^C
----powerdns.com PING Statistics----
3 packets transmitted, 3 packets received, 0% packet loss
round-trip (ms)  min/avg/max/stddev = 98.704/98.974/99.156/0.239

In conclusion, if you are a Linux refugee and you miss the way ping worked on Linux, just add MACHINE_THAT_GOES_PING to your environment and don’t look back.

dis(1): support for System/370, System/390, and z/Architecture ELF bins

A few months ago, I came to the conclusion that it would be both fun and educational to add a new disassembler backend to libdisasm—the disassembler library in Illumos. Being a mainframe fan, I decided that implementing a System/390 and z/Architecture disassembler would be fun (I’ve done it before in HVF).

At first, I was targetting only the 390 and z/Architecture, but given that the System/370 is a trivial (almost) subset of the 390 (and there is a spec for 370 ELF files!), I ended up including the 370 support as well.

It took a while to get the code written (z/Architecture has so many instructions!) and reviewed, but it finally happened… the commit just landed in the repository.

If you get the latest Illumos bits, you’ll be able to disassemble 370, 390, and z/Architecture binaries with style. For example:

$ dis -F strcmp hvf             
disassembly for hvf

strcmp()
    strcmp:      a7 19 00 00        lghi    %r1,0
    strcmp+0x4:  a7 f4 00 08        j       0x111aec
    strcmp+0x8:  a7 1b 00 01        aghi    %r1,1
    strcmp+0xc:  b9 02 00 55        ltgr    %r5,%r5
    strcmp+0x10: a7 84 00 17        je      0x111b16
    strcmp+0x14: e3 51 20 00 00 90  llgc    %r5,0(%r1,%r2)
    strcmp+0x1a: e3 41 30 00 00 90  llgc    %r4,0(%r1,%r3)
    strcmp+0x20: 18 05              lr      %r0,%r5
    strcmp+0x22: 1b 04              sr      %r0,%r4
    strcmp+0x24: 18 40              lr      %r4,%r0
    strcmp+0x26: a7 41 00 ff        tmll    %r4,255
    strcmp+0x2a: a7 84 ff ef        je      0x111ae0
    strcmp+0x2e: 18 20              lr      %r2,%r0
    strcmp+0x30: 89 20 00 18        sll     %r2,%r0,24(%r0)
    strcmp+0x34: 8a 20 00 18        sra     %r2,%r0,24(%r0)
    strcmp+0x38: b9 14 00 22        lgfr    %r2,%r2
    strcmp+0x3c: 07 fe              br      %r14
    strcmp+0x3e: a7 28 00 00        lhi     %r2,0
    strcmp+0x42: b9 14 00 22        lgfr    %r2,%r2
    strcmp+0x46: 07 fe              br      %r14

I am hoping that this will help document all the places needed to change when adding support for a new ISA to libdisasm.

Happy disassembling!

Interactivity During nightly(1), part 2

Back in May, I talked about how I increase the priority of Firefox in order to get decent response times while killing my laptop with a nightly build of Illumos. Specifically, I have been increasing the priority of Firefox so that it would get to run in a timely manner. I have been doing this by setting it to the real-time (RT) scheduling class which has higher priority than most things on the system. This, of course, requires extra privileges.

Today, I realized that I was thinking about the problem the wrong way. What I really should be doing is lowering the priority of the build. This requires no special privileges. How do I do this? In my environment file, I include the following line:

priocntl -s -c FX -p 0 $$

This sets the nightly build script’s scheduling class to fixed (FX) and manually sets the priority to 0. From that point on, the nightly script and any processes it spawns run with a lower priority (zero) than everything else (which tends to be in the 40-59 range).

Dumping Memory in MDB

It doesn’t take much reading of documentation, other people’s blogs, and other random web search results to learn how to dump a piece of memory in mdb.

In the following examples, I’ll use the address fffffffffbc30a70. This just so happens to be an avl_tree_t on my system. We can use the ::dump command:

> fffffffffbc30a70::dump
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc30a70:  801dc3fb ffffffff 0087b1fb ffffffff  ................

Or we can use the adb-style /B command:

> fffffffffbc30a70/B
kas+0x50:       80      

We can even specify the amount of data we want to dump. ::dump takes how many bytes to dump, while /B takes how many 1-byte integers to dump (while for example, /X takes how many 4-byte integers to dump):

> fffffffffbc30a70,20::dump
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc30a70:  801dc3fb ffffffff 0087b1fb ffffffff  ................
fffffffffbc30a80:  20000000 00000000 09000000 00000000   ...............
> fffffffffbc30a70,20/B
kas+0x50:       80      1d      c3      fb      ff      ff      ff      ff      0       87      b1      
                fb      ff      ff      ff      ff      20      0       0       0       0       0       
                0       0       9       0       0       0       0       0       0       0       

Things break down if we want to use a walker and pipe the output to ::dump or /B:

> fffffffffbc30a70::walk avl | ::dump
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc6d2e0:  00000000 00feffff 0000001e 03000000  ................
> fffffffffbc30a70::walk avl | /B
kpmseg:
kpmseg:         0       0       0       0       0       0       0       0       0       

Even though there are 9 entries in the AVL tree, ::dump dumps only the first one. /B does a bit better and it does print what appears to be the first byte of each. What if we want to dump more than just the first byte? Say, the first 32? ::dump is of no use already. Let’s see what we can make /B do:

> fffffffffbc30a70::walk avl | 20/B
mdb: syntax error near "20"
> fffffffffbc30a70::walk avl | ,20/B
mdb: syntax error near ","

No luck.

Solution

Ok, it’s time for the trick that makes it all work. You have to use the ::eval function. For example:

> fffffffffbc30a70::walk avl | ::eval .,20::dump
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc6d2e0:  00000000 00feffff 0000001e 03000000  ................
fffffffffbc6d2f0:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc34960:  00000000 00ffffff 00000017 00000000  ................
fffffffffbc34970:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc31ce0:  00000017 00ffffff 00000080 00000000  ................
fffffffffbc31cf0:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc35a80:  00000097 00ffffff 0000a0fc 02000000  ................
fffffffffbc35a90:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc34880:  0000a0d3 03ffffff 00000004 00000000  ................
fffffffffbc34890:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc31d60:  0000a0d7 03ffffff 000060e8 fb000000  ..........`.....
fffffffffbc31d70:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc7f3a0:  000000c0 ffffffff 00b07f3b 00000000  ...........;....
fffffffffbc7f3b0:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc7de60:  000080fb ffffffff 00105500 00000000  ..........U.....
fffffffffbc7de70:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc7e000:  000080ff ffffffff 00004000 00000000  ..........@.....
fffffffffbc7e010:  00000000 00000000 200ac3fb ffffffff  ........ .......

Perfect! ::eval makes repetition with /B work as well:

> fffffffffbc30a70::walk avl | ::eval .,8/B
kpmseg:
kpmseg:         0       0       0       0       0       fe      ff      ff
kvalloc:
kvalloc:        0       0       0       0       0       ff      ff      ff
kpseg:
kpseg:          0       0       0       17      0       ff      ff      ff
kzioseg:
kzioseg:        0       0       0       97      0       ff      ff      ff
kmapseg:
kmapseg:        0       0       a0      d3      3       ff      ff      ff
kvseg:
kvseg:          0       0       a0      d7      3       ff      ff      ff
kvseg_core:
kvseg_core:     0       0       0       c0      ff      ff      ff      ff
ktextseg:
ktextseg:       0       0       80      fb      ff      ff      ff      ff
kdebugseg:
kdebugseg:      0       0       80      ff      ff      ff      ff      ff

/nap

There is one more trick I want to share in this post. Suppose you have a mostly useless core file, and you want to dump the stack. Not as hex, but rather as a symbol + offset (if possible). The magic command you want is /nap. ‘/’ for printing, ‘n’ for a newline, ‘a’ for symbol + offset (of the value at “dot”), and ‘p’ for symbol (or address) of “dot”. (Formatting differences aside, ‘p’ prints the pointer—“dot”, and ‘a’ prints the value being pointed to—*“dot”.)

For example:

> fd94e3a8,8/nap
0xfd94e3a8:     
0xfd94e3a8:     0xfd94f5a8      
0xfd94e3ac:     libzfs.so.1`namespace_reload+0x394
0xfd94e3b0:     0xfdd6ce28      
0xfd94e3b4:     0xfdd6a423      
0xfd94e3b8:     0xcc            
0xfd94e3bc:     libzfs.so.1`__func__.16928
0xfd94e3c0:     0xfdd6ce00      
0xfd94e3c4:     0xfdd6ce28      

Since the memory happens to be part of the stack, there are no symbols associated with it and therefore the ‘p’ prints a raw hex value.

So, remember: if you have a core file and you think that you need to dump the stack to scavenge for hopefully useful values, you want to…nap. :)

Interactivity During nightly(1)

Every so often, I do a nightly build of illumos on my laptop. This is a long and very CPU intensive process. During the build (which takes about 2.75 hours), the load average is rarely below 8 and most of the time it hovers in the low twenties. (This is a full debug and non-debug build with lint and all the other checking. If I need a build quickly, I can limit it to just what I need and then we’re talking minutes or seconds.)

Anyway, as you might imagine this sort of load puts some pressure on the CPUs. As a result, some interactive processes suffer a bit. Specifically, Firefox doesn’t get enough cycles to render the modern web (read: JavaScript cesspool). Instead of suffering for almost three hours, I just change Firefox’s scheduling class from IA (interactive) to RT (real time):

# priocntl -s -c RT `pgrep firefox`

This allows Firefox to preempt just about everything on my laptop. This works because Firefox actually yields the CPU properly. This will probably bite me sometime in the future when I end up on a page with such a busted JavaScript turd that it takes over a CPU and won’t let go. Till then, I have a pretty good workaround.

Raspberry Pi Bootloader

As I mentioned previously, I decided to do some hardware hacking and as a result I ended up with a Raspberry Pi B+. After playing with Raspbian for a bit, I started trying to read up on how the Pi boots. That’s what today’s post is about.

Standalone UART

While searching for information about how the Pi boots, I stumbled across a git repo with a standalone UART program. Conveniently, the repo includes ELF files as well as raw binary images. (This actually made me go “ewww” at first.)

Before even running it, I looked at the code and saw that it prints out 0x12345678 followed by the PC of one of the first instructions of the program. Pretty minimal, but quite enough.

Boot Process

Just about everyone on the internet (and their grandmothers!) knows that the Pi has a multi stage boot. First of all, it is the GPU that starts executing first. The ARM core just sits there waiting for a reset.

The Pi requires an SD card to boot — on it must be a FAT formatted partition with a couple of files that you can download from the Raspberry Pi Foundation.

Here’s the short version of what happens:

  1. Some sort of baked-in firmware loads bootcode.bin from the SD card and executes it.
  2. bootcode.bin does a bit of setup, and then loads and executes start.elf.
  3. start.elf does a whole lot of setup.

    1. Reads config.txt which is a text file with a bunch of options.
    2. Splits the RAM between the GPU and CPU (with the help of fixup.dat).
    3. Loads the kernel and the ramdisk.
    4. Loads the device tree file or sets up ATAGs.
    5. Resets the ARM core. The ARM core then begins executing the kernel.

This sounds pretty nice, right? For the most part, it is. But as they say, the devil’s in the details.

Booting ELF Files

It doesn’t take a lot of reading to figure out that start.elf can deal with kernel files in ELF format. This made me quite happy since I’ve written an ELF loader before for HVF and while it wasn’t difficult, I didn’t want to go through the exercise just to get something booting.

So, I copied uart02.elf to the SD card, and made a minimal config file:

kernel=uart02.elf

A power-cycle later, I saw the two hex numbers. (Ok, this is a bit of a distortion of reality. It took far too many reboots because I was playing with multiple things at the same time — including U-Boot which turned out to be a total waste of my time.)

The PC was not what I expected it to be. It was 0x8080 instead of 0x800c. After a lot of trial and error, I realized that it just so happened that the .text section is 0x74 bytes into the ELF file. Then I had a horrifying thought… start.elf understands ELF files enough to jump to the right instruction but it does nothing to make the contents of the ELF file properly laid out in memory. It just reads the file into memory starting at address 0x8000, gets the start address from the ELF header, converts it into a file offset and jumps to that location in memory. This is wrong.

Sure enough, booting uart02.bin printed the right number. So much for avoiding writing my own ELF loader.

Ramdisk

Once I had a reliable way to get code to execute and communicate via the serial port, I started playing with ramdisks. The code I was booting parsed the ATAGs and looked for ATAG_INITRD2 which describes the location of the ramdisk.

So, I extended my config to include the ramfsfile parameter to specify the filename of the ramdisk:

kernel=foo
ramfsfile=initrd

Reboot, aaaand…the code panicked because it couldn’t find ATAG_INITRD2. It was weird, the file was present, no misspellings, etc. After a while of rebooting and trying different things, I decide to use the initramfs option and just pick some arbitrary high address for the ramdisk to be loaded at. The config changed to:

kernel=foo
initramfs initrd 0x800000

Another reboot later, I was looking at a dump of the ATAG_INITRD2. Everything worked as expected! So, it turns out that the boot loader on the Pi is not capable of picking an address for the initial ramdisk by itself.

Command line

Finally, I just had to experiment with passing a command line string. I created a file called cmdline.txt on the SD card with some text. A reboot later, I saw that the dump of the ATAG_CMDLINE had some cruft prepended. The entire value of the ATAG looked like (with some spaces replaced by line breaks):

dma.dmachans=0x7f35 bcm2708_fb.fbwidth=656 bcm2708_fb.fbheight=416
bcm2708.boardrev=0x10 bcm2708.serial=0xbd225074
smsc95xx.macaddr=B8:27:EB:45:74:FA bcm2708_fb.fbswap=1
bcm2708.disk_led_gpio=47 bcm2708.disk_led_active_low=0
sdhci-bcm2708.emmc_clock_freq=250000000 vc_mem.mem_base=0x1ec00000
vc_mem.mem_size=0x20000000  foo

This isn’t exactly the worst thing, but it does mean that the option parsing code has to handle cruft prepended to what it would expect the command line to look. I wish that these settings were passed via a separate ATAG.

Debugging with mdb

Recently, Theo Schlossnagle posted two interesting articles about debugging on Illumos using mdb. They are MDB, CTF, DWARF, and other angelic things, and mdb custom dmods.

Powered by blahgd