Josef “Jeff” Sipek

Explicit, Automatic, Magical, and Manual

On several occasions, I expressed opinions about what I consider good and bad ideas as far as sysadmin-friendly interfaces are concerned. Recently, I had a reason to try to organize those thoughts a bit more, and so I decided to write them down in this post.

You may have heard me say that I don’t like “magical behavior”. Instead, I want everything system software does to be explicit. The rest of this post is going to be about words and my mantra when designing sysadmin friendly interfaces.

First of all, explicit does not mean manual. Explicit behavior is simply something the sysadmin has to ask for. A manual behavior is something the sysadmin has to do by hand.

The opposite of explicit is magical. That is, something which has a varying behavior depending subtle differences in “some state” of something related.

The opposite of manual is automatic. That is, repetitive actions are performed by the computer instead of the human operator.

For the tl;dr crowd:

explicit & automatic = good

magical & manual = bad

To “prove” this by example, let me analyze the good ol’ Unix rm command.

By default, it will refuse to remove a directory. You have to explicitly tell it that it is ok to do by using the -d flag (either directly or implicitly via the -r flag).

The command does not try to guess what you likely intended to do—that’d be magical behavior.

At the same time, rm can delete many files without manually listing every single one. In other words, rm has automation built in, and therefore it isn’t manual.

Makes sense? Good.

What does this mean for more complicated software than rm? Well, I came up with these “rules” to guide your design:

  1. Avoid Magical Behavior
  2. Error Out when Uncertain
  3. Provide Interfaces and Tools
  4. Create Low-level Primitives
  5. Avoid Commitment
  6. Be Consistent

Let me go through each of these rules and explain what I mean. I jump between examples of APIs and user (sysadmin) interfaces. In many ways, the same ideas apply to both and so I reach for whichever is easier to talk about at the time.

1. Avoid Magical Behavior

Avoid magical behavior by not guessing what the user may have intended.

Just like installing a second web browser shouldn’t change all your settings, installing a new RDBMS shouldn’t just randomly find some disk space and reformat it for its use. Similarly, when a new host in a cluster starts up, the software has no way of knowing what the intent is. Is it supposed to go into production immediately? What if it is Friday at 5pm? Would it make sense to wait till Monday morning? Is it supposed to be a spare?

The software should give the sysadmin the tools to perform whatever actions may be needed but leave it up to the sysadmin to decide when to do them. This is very similar to the idea of Wikipedia article: separation of mechanism and policy.

So, won’t this create a lot of work for the sysadmin? Glad you asked! Keep reading to find out why it doesn’t ;)

2. Error Out when Uncertain

Err on the side of caution and error out if the user’s intent isn’t clear.

Error out if you aren’t sure what exactly the user intended. This is really a form of the first rule—avoid magical behavior.

It is better to be (slightly) annoying to use, than to misinterpret the user’s intentions and lose data. “Annoying to use” can be addressed in the future with new commands and APIs. Lost data cannot be “unlost” by code changes.

3. Provide Interfaces and Tools

Provide interfaces and tools to encapsulate implementation details.

If the installation instructions for an operating system included disk byte offsets and values to store there, you’d either think that it is insane or that you are living in the 1970’s and you just got a super fancy 8-bit computer with a (floppy) disk drive.

Any modern OS installer will encapsulate all these disk writes by several layers of abstractions. Disk driver, file system, some sort of mkfs utility, and so on. Depending on the intended users’ skill level, the highest abstraction visible may be a fully functional shell or just a single “Install now” button.

Similarly, a program that requires a database should provide some (explicit) “initialize the database” command instead of requiring the user to run manual queries. (Yes, there is software requiring setup steps like that!) Or in the “new host in a cluster” scenario, the new host should have a “add self to cluster” command for the sysadmin.

With these interfaces and commands, it is possible to automate tasks if the need arises. For example, since the cluster admin already has some form of provisioning or configuration management tool, it is rather easy to add the “add self to cluster” command invocation to the existing tooling. Whether or not to automate this (as well as when exactly to run the command) is a matter of policy and therefore shouldn’t be dictated by the developer.

4. Create Low-level Primitives

Err on the side of caution and create (reasonably) low-level primitives.

Different tasks benefit from different levels of abstraction. The higher the abstraction level, the less flexible it is, but the easier it is to use—but only if that’s exactly what you want to do. If what you want to do is not quite what the level of abstraction provides, it can be very difficult (or outright impossible) to accomplish what you are after.

Therefore, it is better to have a solid lower-level abstraction that can be built on rather than a higher-level abstraction that you have to fight with.

Note that these two aren’t mutually exclusive, it is possible to have a low-level abstraction with a few higher level primitives that help with common tasks.

Consider a simple file access API. We could implement functions to delete a single file, delete a set of files, delete recursively, and so on. This would take a lot of effort and would create a lot of code that needs to be maintained. If you are uncertain what the users will need, do the simplest thing you expect them to need. For example, give them a way to delete one file and to list files. Users are clever, and before long they’ll script ways to delete multiple files or to delete recursively.

Then, when you have some user feedback (“I always end up writing the same complicated command over and over”), you can revisit the set of provided primitives, and add or modify as needed.

It doesn’t matter if you are providing a file API, a cluster management API, or something else, providing some form of Wikipedia article: create, read, update, delete, and list API for each “thing” you expect the users to operate on is sufficient to get going. Of course the type of object will dictate the exact set of operations. For example, better command names may be add/remove (cluster node) instead of a create/delete.

5. Avoid Commitment

Err on the side of caution and do not commit to support APIs and other interfaces.

It is essentially impossible to predict what APIs or other interfaces will actually end up being useful. It can become a huge maintenance burden (in time and cost) to maintain seldom used interfaces that have only a handful of users. Unfortunately, users like being able to rely on functionality not going away. Therefore, for your own sanity, make it clear:

  1. Which interfaces are supported and which may change or disappear without any warning.
  2. When supported interfaces may change (e.g., major versions may break all APIs).
  3. What behavior of a supported interface is supported and what is merely an implementation detail.

The first two items are self-explanatory, but the last one requires a few extra words.

It is tempting to say that “function foo is supported”, but that is the wrong way to do it. Rather, you want to say “function foo, which does only bar, is supported”.

For example, suppose that we have a function which returns an array of names (strings). Let’s also assume that it is convenient to keep track of those names internally using a balanced binary search tree. When we implement this get-names function, we are likely to simply iterate the tree appending all the names to the output array. This results in the output being sorted because of the tree-based implementation.

Now, consider these two possible statements of what is supported. First, a bad one:

Function get-names is supported. The function returns all names.

And now a better one:

Function get-names is supported. The function returns all names in an unspecified order.

Using the first (bad) description, it is completely reasonable for someone to start relying on the fact that the returned names are sorted. This ties our hands in multiple ways.

The second (better) description makes it clear that the order of the names can be anything and therefore the users shouldn’t rely on a particular order. It better communicates our intention of what is and what isn’t supported.

Now, say that we’d like to replace the tree with a hash table. (Maybe the tree insertion cost is too high for what we are trying to do.) This is a pretty simple change, but now our get-names output is unsorted. If we used the first description and some major consumer relied on the sorted behavior, we have to add a O(n log n) sort to the end of get-names making everyone pay the penalty instead of just the consumers that want sorted output. On the other hand, the second description lets us do this hash table change.

So, be very explicit about what is and what isn’t supported.

I used a function in the above example, but the same applies to utilities and other tools. For example, it is perfectly reasonable to make the default output of a command implementation dependant, but provide arguments that force certain columns, values, or units for the consumers that care.

Real World Example

A project I worked on a number of years ago had very simple rules for what was supported. Everything was unsupported, unless otherwise stated.

In short, we supported any API function and utility command that had a manpage that didn’t explicitly say that it was a developer-only interface or that the specific behavior should not be relied upon.

Yes, there have been a few instances where our users assumed something was supported when it wasn’t, and inevitably they filed bugs. However, we were able to tell them that there was a better (and supported) way to accomplishing their task. Once they fixed their assumptions, their system became more robust and significantly less likely to break in the future.

6. Be Consistent

Finally, make your interfaces consistent.

Not only internally consistent (e.g., always use the same option name for the same behavior) but also consistent with the rest of the system (e.g., use the same option names that commonly used software uses). This is a rephrasing of the Wikipedia article: principle of least astonishment.

Summary

There are other ways of approaching interface design, but the only one I’ve seen that doesn’t turn into a mess (or doesn’t cause a user revolt) is what I tried to outline here. Specifically, start with a small, simple, consistent, and explicit interface that you barely support and evolve over time.

This was a long post with a lot of information, but it still barely scratches the surface of what could be said. Really, most of the rules could easily turn into lengthy posts of their own. Maybe if there is interest (and I find the time), I’ll elaborate on some of these.

2021-01-25

The Confusing World of USB

Amateur Radio License Map

Topographic maps

VHF/UHF Line of Sight Calculator

Hey What’s That — Line of sight calculator

Radio Mobile Online — RF coverage estimator

Telnet Access to DX Clusters and Reverse Beacon Network

FCC M3 Map of Effective Ground Conductivity in the US for AM Broadcast Stations

The Official U.S. Time

Retiring Guilt

It took me about 3 years to write this post. Partly because I had other things I wanted to work on and partly because I hoped that it wouldn’t be needed. Well, I finally decided that I really need to write this.

In short, I’m officially stopping work on guilt.

Practically speaking, I haven’t touched it (as a developer) in over two years and as a user in about as long. So really, nothing will change.

What is guilt?

I started writing Guilt in fall 2006 because I was working on unionfs and needed to maintain patches on top of the Linux kernel git repository—much like what the mq extension did with Mercurial repositories.

It all started with:

commit 664e5a7d7f8d2c2726f03a239de11fa00127cf84
Author: Josef Sipek <jsipek@thor.fsl.cs.sunysb.edu>
Date:   Mon Nov 6 13:08:30 2006 -0500

    Initial commit

That’s right, 14 years to the day.

Technically, the first few versions were called “gq” (which stood for “git quilt”) until someone pointed out that “GQ” was a well established GTK-based LDAP client.

Artifacts

If anyone wishes to resurrect this project, then by all means go for it. If not, the old content will remain online for as long as I have a web server. :)

Specifically, you can find everything up to and including the last release (v0.37-rc1) at the following locations:

Users

I know that Guilt has served a number of people quite well over the years. It’s been quite stable and mostly feature complete since at least 2008, so I haven’t really been hearing from people short of the occasional patch or an occasional “oh yeah, I use that”.

To those users: I hope the last release works well enough for you until someone starts to maintain Guilt again or you find a different tool that suits your needs.

Email vs. Tool du Jour

TL;DR: Just because email is decades old doesn’t mean that it cannot serve a vital role in modern project management, research, development, and support.

Ultimately, working on a project requires communication—and lots of it. Communication with peers, with managers, with other departments within the company, and even with customers. It is tempting to grab the Tool du Jour and add it to your ever-growing arsenal of tools believing it will make communication easier. Often, it does not.

For example, let’s consider these tools: Jira, Confluence, Slack, Zoom, GitHub/GitLab, phone, and email.

Does your company use these tools or their equivalents? Isn’t it a bit overkill to have 7 different channels of communication? Sure, often one tool is better at a particular mode of communication than the others but there is a significant overlap.

Do you want a video chat? Do you use Slack or Zoom?

Voice chat? Slack, Zoom, or phone?

Do you want to ask a question related to a bug? Do you use Jira, Slack, or just set up a call? Voice or video? Or would email be best?

Do you keep track of your project via high-level Jira issues? Or do you use a set of Confluence pages where you include various semi-autogenerated plots?

Wikipedia article: Decision fatigue is real. Do you want your (rather expensive) employees to waste their cognitive capacity deciding which tool to use? Or do you want them to make the product better?

It is painful how many times over the past ten years I’ve witnessed conversations that went much like this:

A: Can you answer the question I left in Jira?
B: <B reads Jira question> Oh, that is answered on the ABC123 Confluence page.
A: Ah. Can you make a note of that in Jira? Thanks.

This example involves three communication channels—Jira, Confluence, and some chat system.

This sort of communication fragmentation is really bad. Not only does it waste a lot of time with exchanges like this example, it also makes searching for information essentially impossible. Who in their right mind would search half a dozen tools (with various degrees of search capability) for something? It is simply easier to just ask your coworkers. After all, their time is less valuable to you than your own time and sanity.

So, what can be done to improve things?

Well, if at all possible do not use tools that have duplicate functionality. If you have to, hopefully you can disable the duplicate functionality. If there is no way to disable it, then you must make it painfully clear where such communication should go. Hopefully this can be done via automated hooks that somehow notify the user. For example, automatically closing issues opened in the wrong bug tracker (e.g., opened in GitHub instead of Jira), or automatically responding to wiki commenters directing them to the proper medium for wiki discussion. Finally, if all else fails, have someone (ideally manager or team lead so the notification has some weight to it) manually make sure that anyone that uses the functionality is told not to.

This reduction in the number of tools should also help with responsiveness. It is no secret that Wikipedia article: the average human can hold only about 7 things in working memory at the same time. How many of those do you want to dedicate to tooling? If I have to remember to check 7 different tools periodically, one of two things happens: either I manage to check them all but accomplish nothing else, or I get things done but only remember about 2 or 3 tools.

That should help with quite a bit of the fragmentation. Now “all” that’s left to do is decide which communication channel is used for what.

I have concluded that there are four major levels of communication:

  1. important, synchronous
  2. important, asynchronous
  3. unimportant, synchronous
  4. unimportant, asynchronous

I’m using the terms “synchronous” to mean that you want the back-and-forth latency to be low, and “important” to mean that that you must have an answer. Note that “unimportant” does not mean off-topic, but rather lower priority.

Why make the synchronous/asynchronous distinction? For multiple reasons. First of all, being interrupted in the middle of something is costly. It takes a significant amount of time to get back “into the zone” but only a fraction of a minute to get out of it. Would you rather pay your employees to try to work or to actually work? And second, asynchrony makes communication across time zones easier. Not easy, but easier.

So, let me go through the four major levels of communication one by one and share my opinion about what works and why.

important, synchronous
If you want to have a (relatively) rapid back-and-forth, you pretty much have to use an in-person meeting or a voice/video call. A one-to-one (i.e., non-group) chat can also possibly work, but there will be temptation to multi-task. This desire to multi-task implies that the chat isn’t actually that important.
important, asynchronous
When you don’t require having the answers immediately or when it simply isn’t possible to get everyone in the same “room” at the same time for a meeting (in person, voice, or video), email is probably the best communication method. Each person can read it and possibly reply at a the most convenient time for them.
unimportant, synchronous
This is the form of communication that includes various chit-chat, sanity checking polls (e.g., “would anyone object if I tried xyz?”), and so on. It lets you quickly get bits of information, but in a way it is unreliable. Not everyone is reading the chat when you say something and when it scrolls off the screen it is as if you never said it. In other words, do not expect anyone to read the group chat messages from when they were away. If you want someone specific or even everyone to see a particular message, it is not an unimportant message. One-to-one chat is a little different since it is more “reliable”, but usually anything substantial that is important will end up with a call instead.
unimportant, asynchronous
Finally, all the things you’d like others to see at some point in the future should be sent as an email. The recipients will read it when they get to it, and since it isn’t important it probably doesn’t even require a reply.

These four levels are, of course, not set in stone. It is possible (and I’d even encourage it) to upgrade or downgrade your communication as needed. For example, it is perfectly reasonable to ask in chat if there are objections or obvious issues with a particular approach, function, or workload. Then, if the responses don’t make it obviously a terrible idea but a more definitive discussion is desired, a similar (but more detailed) version can be sent via email. In essence, upgrading it from “unimportant synchronous” to “important asynchronous”. (Caution: don’t overdo these upgrades/downgrades.)

As you can see, I think that email is a good choice for any asynchronous communication. That’s for good reasons. Everyone has an email address, everyone knows how to use it (at least a little), and the free-form format allows you to use the most appropriate content type to get your point across—be it ASCII art, images, or even Excel spreadsheets. In other words:

Email is ubiquitous.

Email works remarkably well.

Email is extremely flexible.

As a real world example, consider that pretty much every company-wide announcement (important or not) has been made either in a huge meeting or via email. Often the meeting-time announcements are followed up by an email anyway! It’s not a chat message. It’s not a Confluence page. It’s not a Jira issue. It’s an email.

Before I conclude, I’d like to address two slightly more specific cases.

First, what about issue tracking? How does that tie into my email-centric world? Well, you can keep your issue tracker, but in my opinion, the comments feature should not be used. If a ticket needs to be discussed, send an email, set up a conference call, whatever works for your—just don’t use the comments on the issue. If you look at any issue in your issue tracker, the comments will fall into one of two categories. Either there are none or there are many, and it is painfully clear that people don’t read them and ask the same questions over and over. Instead of burying progress reports or updates to the understanding of the issue in a comment that nobody will ever read, that information should be used to reword the issue description. The same largely applies to other tools’ comments sections as well.

Second is a concern that people will not read all those emails. I think this is only a problem if there are too many tools and email isn’t viewed as an important one. If (unrealistically) all communication happens through email, then right after communicating with someone, the person is already in the right tool to handle the next communication. If code reviews, support requests, and everything else were to go to the same place, there is nearly zero context switching cost. Even if the person goes to use a different tool (e.g., a text editor or an IDE), when that work is done, they’ll return to their email client. In other words, if you make the email communication channel important, your emails will get read. If you don’t make it important, then you (individually) are better off using a channel the recipient considers important. In an environment with too many tools, each recipient may have a different preference.

Just to make it painfully clear, I am not advocating killing off everything except email. Instead, I’m advocating killing off tools that duplicate functionality, and shifting all asynchronous communication to a single medium. In my experience, the most efficient (and least disruptive) asynchronous communication medium is email. And therefore it should not only be one of the tools that survives the culling, but it also should be the one that is embraced afterwards.

That’s it for today. In the next post, I’ll talk about what I consider the ideal code review workflow.

FreeBSD Sound: ALSA & Qt

Sound in FreeBSD is somewhat complicated because of the various portability and compatibility shims. Last week, I hit an annoying to diagnose situation: I plugged in a USB sound card and while the kernel and some applications detected it just fine, other applications didn’t seem to notice it at all.

At first, I thought it was a Qt issue since only Qt applications appeared broken. But then, mere minutes before emailing a FreeBSD mailing list, I managed to find a hint that it was likely an ALSA on FreeBSD issue. Some searching later, I learned that in order for ALSA to see the device, it needed a mapping to the actual OSS device.

So, after adding the following to ~/.asoundrc, any ALSA application (and therefore any Qt application) that tries to list the sound devices will see a “ft991a” device:

pcm.ft991a {
	type oss
	device /dev/dsp3
}

To make it more explicit, without adding the above stanza to .asoundrc:

  1. OSS applications work fine.
  2. PortAudio applications work fine.
  3. ALSA applications did not see the device.

With the stanza, everything seems to work.

Installing Debian under FreeBSD's bhyve

This weekend I tried to move my work dev vm to a slightly beefier vm host. This meant moving the vm from kvm on OmniOS to bhyve on FreeBSD 12.1. After moving the disk images over, I tried to configure vm-bhyve to use them as-is. Unfortunately, grub2-bhyve fails to boot XFS root file systems with CRC support and there was no good way to convert the disk image to get it to boot via UEFI instead (to avoid grub2-bhyve completely). So, I decided to simply reinstall it.

In theory, this shouldn’t have been too difficult since I had the foresight to have /home on a separate virtual disk. In practice, I spent several hours reinstalling Debian Buster over and over, trying to figure out why it’d install and reboot fine, but subsequent boots wouldn’t make it past the EFI firmware.

It turns out that Debian tries to make multi-booting possible and puts its EFI binary into a non-standard location. That combined with bhyve not persisting EFI variables after shutdown results in any boot after the the first poweroff not even trying to look at the Debian-specific path.

This is not a Debian bug, but rather bhyve’s EFI support being incomplete. The easiest way around this is to copy the Debian binary into the standard location immediately after installation. In other words:

# cd /boot/efi/EFI
# mkdir BOOT
# cp debian/grubx64.efi BOOT/bootx64.efi

That’s all that’s needed. The downside to this is that the copy will not get automatically upgraded when grub gets an update.

For completeness, here are the relevant bits of the vm’s config:

loader="uefi"
graphics="yes"
cpu="6"
memory="1G"
network0_type="virtio-net"
network0_switch="public"
zfs_zvol_opts="volblocksize=4096"
disk0_dev="sparse-zvol"
disk0_type="virtio-blk"
disk0_name="disk0"
disk1_dev="sparse-zvol"
disk1_type="virtio-blk"
disk1_name="disk1"

Flight Planning My Cruise Power

When I was working on my private pilot certificate, there was one thing that was never satisfactorily explained to me: how to select the “right” line of the cruise performance table in the POH. Now that I’m a few years older and wiser, I thought I’d write up an explanation for those who, like me six years ago, aren’t getting a good enough answer from their CFIs.

I did my training in a Cessna 172SP, and so the table was relatively simple:

Reading it is trivial. Pick your cruise altitude, then pick the RPM that the instructor told you to use for cruising (e.g., 2200). Now, read across to figure out what your true airspeed and fuel flow will be. That is all there is to it.

When I got checked out in the club’s 182T, things got more confusing. The table itself got split across multiple pages of the POH because of the addition of a new variable: manifold pressure (MP).

The table works much the same way as before. First, select the table based on which altitude you’ll be cruising at, then pick the RPM and manifold pressure, and read across the true airspeed and fuel flow.

On the surface (bad pun intended), this seems like a reasonable explanation. But if you look closely, there are multiple combinations of RPM and MP which give you the same performance. For example, in the above table both 2200/21” and 2400/20” give more or less the same performance. When I asked how to choose between them, all I got was a reminder to “keep the MP at or below the RPM.” It was thoroughly unsatisfying. So, I stuck with something simple like 2300/23”.

Fast forward to today. I fly a fixed gear Cessna Cardinal (177B). Its manual contains a table much like the one above for a 182. Here is a sample for 4000’:

As before, I started with something simple like 2300/23”, but eventually I had a moment of clarity. When flying the 172 and 182, I paid for Wikipedia article: Hobbs time. In other words, it was in my best interest to cruise as fast as possible without much regard for which exact RPM/MP combination I used (all within club and manufacturer limitations, of course).

My bill for the Cardinal is different—it is based on Wikipedia article: tach time. This means that the lower the RPM, the slower I’m spending money. So, like any other optimization problem, I want to find the right spot where my bill, my cruise speed, and my fuel flow (and therefore endurance) are all acceptable.

If the tach timer is calibrated to run at full speed at 2700 RPM, running the engine at only 2300 equates to 85% while using 2400 equates to 88.9%.

So, say I’m flying for two hours. If I use 2400 RPM, I’ll be paying 1.78 hours. On the other hand, if I use 2300 RPM at the same power output, I’ll be paying for 1.70 hours. Not a big difference, but after 24 hours at 2300 instead of 2400, I would have saved a full hour of tach time.

I don’t yet have enough data to verify these figures, but collecting it is on my todo list.

While composing this post, I happened to find an article by Mike Busch about why lower RPM is better. He makes a number of compeling points—reduced noise, better propeller efficiency, and fewer revolutions the engine has to make (which should improve the engine’s lifetime and therefore the overall cost). I have to admit that Mike’s points seems more compeling than the small savings I’ve calculated above.

2020-05-06

OpenMCT — While I’m not a fan of web-based UIs, this is a rather neat “dashboard” framework by NASA.

Wideband spectrum received in JO32KF — Over 5 years of HF spectrum waterfall in Enschede, NL.

10 Most(ly dead) Influential Programming Languages

Wikipedia article: PACELC theorem — An extension of the Wikipedia article: CAP theorem.

Learn Rust the Dangerous Way — Finally a Rust tutorial that speaks to people comfortable in C.

Interferometry and Synthesis in Radio Astronomy — An open access book.

Aviation Formulary — Great circle math applied to various aviation problems for those too lazy to derive the formulas themselves.

Papírová platidla Československa 1918-1993, České republiky a Slovenské republiky 1993-2016 — Complete list of all bank notes used in Czechoslovakia, Czech Republic, and Slovak Republic.

NOAA GOES Image ViewerWikipedia article: GOES weather satellite imagery.

Building FreeBSD Binary Packages

On my laptop, I use the binary packages provided by FreeBSD ports. Sometimes however, I want to rebuild a package because I want to change an option (for example, recently I wanted to set DEBUG=on for mutt).

While this is very easy, for whatever reason I can never find a doc with a concise set of steps to accomplish it.

So, for the next time I need to do this:

# portsnap fetch
# portsnap update
# cd /usr/ports/some/thing
# make showconfig
# make rmconfig   # to reset config, if needed
# make clean      # as needed
# make package
# pkg install work/pkg/*.txz

That’s all there is to it.

Unleashed: The Birth (and Death) of an Operating System Fork

I realize that I haven’t mentioned Unleashed on my blahg before. The major reason for it was that whenever I thought of doing that, I ended up working on Unleashed instead. Oops! :)

For those that don’t know, Unleashed was an operating system fork created by me and Lauri in 2016. A fork that we’ve maintained till now. I said was, because earlier this month, we made the last release. Instead of trying to summarize everything again, let me just quote the relevant parts of the announcement email I sent out on April 4th:

This is the fifth and final release of Unleashed—an operating system fork of illumos. For more information about Unleashed itself and the download links, see our website [1].

That is right, we are pulling the plug on this project. What began as a hobby project for two developers never grew much beyond being a hobby project for two developers. But after nearly 4 years of work, we proved that the illumos code base can be cleaned up significantly, its APIs modernized, and the user experience improved in general.

While we’ve made too many changes [2] to list them all here, I’d like to highlight what I think are some of our major accomplishments:

  • shrinking the codebase by about 25% (~5.6 million lines of code) even though we imported major components (e.g., libressl, openbsd ksh)
  • reducing build times drastically (from ~40 minutes to ~16 minutes on a 2012-era 4 core CPU)
  • changing uname to Unleashed and amd64 (from SunOS 5.11 i86pc)

In addition to the projects we finished, we were well on the way to several other improvements that we simply haven’t gotten around to completing. Some of the more notable ones include:

  • page cache rewrite (~3/5 done)
  • modernizing the build system with bmake / removing dmake (~1/5 done)
  • everything 64-bit only (~4/5 done)

All that we’ve accomplished is just the tip of the iceberg of what remains to be done. Unfortunately for Unleashed, we both decided that we wanted to spend our free time on different projects.

I know I’m biased, but I think we’ve made some good changes in Unleashed and I’d love to see at least some of them make their way into illumos-gate or one of the illumos distros. So, I hope that someone will have the time and interest to integrate some of these changes.

Finally, I’d like to encourage anyone who may be interested not to be afraid of forking an open source project. Even if it doesn’t work out, it is extremely educational.

Happy hacking!

Jeff.

[1] https://www.unleashed-os.org
[2] https://repo.or.cz/unleashed.git/blob/HEAD:/FEATURES.txt

What Unleashed was or wasn’t is described on the website, the README, the features file, and the mailing list archives. The history leading up to Unleashed is essentially undocumented. I am dedicating the rest of this post to that purpose.

Jeffix (2015–2016)

Before Unleashed there was Jeffix.

I made an extremely indirect mention of Jeffix on Twitter in 2015, and one direct mention in a past blahg post in 2016.

So, what exactly was Jeffix? Was it a distro? Not quite. It was more of an overlay for OpenIndiana Hipster. You weren’t able to install Jeffix. Instead, you had to install Hipster and then add the Jeffix package repository (at a higher priority), followed by an upgrade.

Why make it? At the time I used OpenIndiana (an illumos distro) on my laptop. While that was great for everyday work, being an illumos developer meant that I wanted to test my changes before they made it upstream. For about a year, I was running various pre-review and pre-RTI bits. This meant that the set of improvements I had available changed over time. Sometimes this got frustrating since I wouldn’t have certain changes just because they didn’t make it through the RTI process yet and I was testing a different change.

So, in October 2015, I decided to change that. I made a repo with an assortment of my changes. Some are mine while others are authored by other developers in the community. I called this modified illumos Jeffix. Very creative, I know.

I kept this up until May 2016. Today, I have no idea why I stopped building it then.

To Fork or Not To Fork

For the three years leading up to Unleashed, I spent a considerable amount of time thinking about what would make illumos better—not just on the technical side but also the community side.

It has always been relatively easy to see that the illumos community was not (and still is not) very big. By itself this isn’t a problem, but it becomes one when you combine it with the other issues in the community.

The biggest problem is the lack of clear vision for where the project should go.

Over the years, the only idea that I’ve seen come up consistently could be summarized as “don’t break compatibility.” What exactly does that mean? Well, ask 10 people and you’ll get 12 different opinions.

Provably hostile

There have been several times where I tried to clean up an interface in the illumos kernel, and the review feedback amounted to “doesn’t this break on $some-sparc-hardware running kernel modules from Solaris?”

In one instance (in August 2015), when I tried to remove an ancient specialized kernel module binary compatibility workaround that Sun added in Solaris 9 (released in 2002), I was asked about what turned out to be a completely ridiculous situation—you’d need to try to load a kernel module built against Solaris 8 that used this rather specialized API, an API that has been changed twice in incompatible ways since it was added (once in 2005 by Sun, and once in April 2015 by Joyent).

Unsurprisingly, that “feedback” completely derailed a trivial cleanup. To this day, illumos has that ugly workaround for Solaris 8 that does absolutely nothing but confuse anyone that looks at that code. (Of course, this cleanup was one of the first commits in Unleashed.)

While this was a relatively simple (albeit ridiculous) case—I just had to find a copy of old Solaris headers to see when things changed—it nicely demonstrates the “before we take your change, please prove that it doesn’t break anything for anyone on any system” approach the community takes.

Note that this is well beyond the typical “please do due diligence” where the reviewers tend to help out with the reasoning or even testing. The approach the illumos community takes feels more like a malicious “let’s throw every imaginable thought at the contributor, maybe some of them stick.” Needless to say, this is a huge motivation killer that makes contributors leave—something that a small-to-begin-with community cannot afford.

Toxic

In the past there have been a number of people in the illumos community that were, in my opinion, outright toxic for the project. I’m happy to say, that a number of them have left the community. Good riddance.

What do I mean by toxic?

Well, for instance, there was the time in 2014 when someone decided to contribute to a thread about removing SunOS 4.x compatibility code (that is binary compatibility with an OS whose last release was in 1994) with only one sentence: “Removing stuff adds no value.

Elsewhere in the same thread, another person chimed in with his typical verbiage that could be summarized as “why don’t you do something productive with your time instead, and work on issues that I think are important.” While his list of projects was valid, being condescending to anyone willing to spend their free time to help out your project or telling them that they’re wasting their time unless they work on something that scratches your or your employer’s itch is…well…stupid. Yet, this has happened many times on the mailing list and on IRC over the years.

Both of these examples come from the same thread simply because I happened to stumble across it while looking for another email, but rest assured that there have been plenty of other instances of this sort of toxic behavior.

The Peanut Gallery

Every project with enough people on the mailing list ends up with some kind of a Wikipedia article: peanut gallery. The one in illumos is especially bad. Every time a removal of something antique is mentioned, people that haven’t contributed a single line of code would come out of the woodwork with various forms of “I use it” or even a hypothetical “someone could be using it to do $foo”.

It takes a decent amount of effort to deal with those. For new contributors it is even worse as they have no idea if the feedback is coming from someone that has spent a lot of time developing the project (and should be taken seriously) or if it is coming from an obnoxiously loud user or even a troll (and should be ignored).

2020

All this combined results in a potent mix that drives contributors away. Over the years, I’ve seen people come, put in reasonable effort to attempt to contribute, hit this wall of insanity, and quietly leave.

As far as I can tell, some of the insanity is better now—many of the toxic people left, some of the peanut gallery members started to contribute patches to remove dead code, etc.—but a lot of problems still remain (e.g., changes still seem to get stuck in RTI).

So, why did I write so many negative things about the illumos community? Well it documents the motivation for Unleashed. :) Aside from that, I think there is some good code in illumos that should live on but it can only do that if there is a community taking care of it—a community that can survive long term. Maybe this post will help with opening some eyes.

Unleashed

In July 2016, I visited Helsinki for a job interview at Dovecot. Before the visit, I contacted Lauri to see if he had any suggestions for what to see in Helsinki. In addition to a variety of sightseeing ideas, he suggested that we meet up for a beer.

At this point, the story continues as Lauri described it on the mailing list — while we were lamenting, Jouni suggested we fork it. It was obvious that he was (at least partially) joking, but having considered forking, it resonated with me. After I got home, I thought about it for a while but ultimately decided that it was a good enough idea.

That’s really all there is to the beginning of Unleashed itself. While the decision to fork was definitely instigated by Jouni, the thought was certainly on my mind for some time before that.

Next

With Unleashed over, what am I going to do next?

I have plenty of fun projects to work on—ranging from assorted DSP attempts to file system experiments. They’ll get developed on my FreeBSD laptop. And no, I will not resume contributing to illumos. Maybe I’m older and wiser (and probably grumpier), but I want to spend my time working on code that is appreciated.

With all that said, I made some friends in illumos and I will continue chatting with them about anything and everything.

Powered by blahgd