Explicit, Automatic, Magical, and Manual
On several occasions, I expressed opinions about what I consider good and bad ideas as far as sysadmin-friendly interfaces are concerned. Recently, I had a reason to try to organize those thoughts a bit more, and so I decided to write them down in this post.
You may have heard me say that I don’t like “magical behavior”. Instead, I want everything system software does to be explicit. The rest of this post is going to be about words and my mantra when designing sysadmin friendly interfaces.
First of all, explicit does not mean manual. Explicit behavior is simply something the sysadmin has to ask for. A manual behavior is something the sysadmin has to do by hand.
The opposite of explicit is magical. That is, something which has a varying behavior depending subtle differences in “some state” of something related.
The opposite of manual is automatic. That is, repetitive actions are performed by the computer instead of the human operator.
For the tl;dr crowd:
explicit & automatic = good
magical & manual = bad
To “prove” this by example, let me analyze the good ol’ Unix rm command.
By default, it will refuse to remove a directory. You have to explicitly tell it that it is ok to do by using the -d flag (either directly or implicitly via the -r flag).
The command does not try to guess what you likely intended to do—that’d be magical behavior.
At the same time, rm can delete many files without manually listing every single one. In other words, rm has automation built in, and therefore it isn’t manual.
Makes sense? Good.
What does this mean for more complicated software than rm? Well, I came up with these “rules” to guide your design:
- Avoid Magical Behavior
- Error Out when Uncertain
- Provide Interfaces and Tools
- Create Low-level Primitives
- Avoid Commitment
- Be Consistent
Let me go through each of these rules and explain what I mean. I jump between examples of APIs and user (sysadmin) interfaces. In many ways, the same ideas apply to both and so I reach for whichever is easier to talk about at the time.
1. Avoid Magical Behavior
Avoid magical behavior by not guessing what the user may have intended.
Just like installing a second web browser shouldn’t change all your settings, installing a new RDBMS shouldn’t just randomly find some disk space and reformat it for its use. Similarly, when a new host in a cluster starts up, the software has no way of knowing what the intent is. Is it supposed to go into production immediately? What if it is Friday at 5pm? Would it make sense to wait till Monday morning? Is it supposed to be a spare?
The software should give the sysadmin the tools to perform whatever actions may be needed but leave it up to the sysadmin to decide when to do them. This is very similar to the idea of separation of mechanism and policy.
So, won’t this create a lot of work for the sysadmin? Glad you asked! Keep reading to find out why it doesn’t ;)
2. Error Out when Uncertain
Err on the side of caution and error out if the user’s intent isn’t clear.
Error out if you aren’t sure what exactly the user intended. This is really a form of the first rule—avoid magical behavior.
It is better to be (slightly) annoying to use, than to misinterpret the user’s intentions and lose data. “Annoying to use” can be addressed in the future with new commands and APIs. Lost data cannot be “unlost” by code changes.
3. Provide Interfaces and Tools
Provide interfaces and tools to encapsulate implementation details.
If the installation instructions for an operating system included disk byte offsets and values to store there, you’d either think that it is insane or that you are living in the 1970’s and you just got a super fancy 8-bit computer with a (floppy) disk drive.
Any modern OS installer will encapsulate all these disk writes by several layers of abstractions. Disk driver, file system, some sort of mkfs utility, and so on. Depending on the intended users’ skill level, the highest abstraction visible may be a fully functional shell or just a single “Install now” button.
Similarly, a program that requires a database should provide some (explicit) “initialize the database” command instead of requiring the user to run manual queries. (Yes, there is software requiring setup steps like that!) Or in the “new host in a cluster” scenario, the new host should have a “add self to cluster” command for the sysadmin.
With these interfaces and commands, it is possible to automate tasks if the need arises. For example, since the cluster admin already has some form of provisioning or configuration management tool, it is rather easy to add the “add self to cluster” command invocation to the existing tooling. Whether or not to automate this (as well as when exactly to run the command) is a matter of policy and therefore shouldn’t be dictated by the developer.
4. Create Low-level Primitives
Err on the side of caution and create (reasonably) low-level primitives.
Different tasks benefit from different levels of abstraction. The higher the abstraction level, the less flexible it is, but the easier it is to use—but only if that’s exactly what you want to do. If what you want to do is not quite what the level of abstraction provides, it can be very difficult (or outright impossible) to accomplish what you are after.
Therefore, it is better to have a solid lower-level abstraction that can be built on rather than a higher-level abstraction that you have to fight with.
Note that these two aren’t mutually exclusive, it is possible to have a low-level abstraction with a few higher level primitives that help with common tasks.
Consider a simple file access API. We could implement functions to delete a single file, delete a set of files, delete recursively, and so on. This would take a lot of effort and would create a lot of code that needs to be maintained. If you are uncertain what the users will need, do the simplest thing you expect them to need. For example, give them a way to delete one file and to list files. Users are clever, and before long they’ll script ways to delete multiple files or to delete recursively.
Then, when you have some user feedback (“I always end up writing the same complicated command over and over”), you can revisit the set of provided primitives, and add or modify as needed.
It doesn’t matter if you are providing a file API, a cluster management API, or something else, providing some form of create, read, update, delete, and list API for each “thing” you expect the users to operate on is sufficient to get going. Of course the type of object will dictate the exact set of operations. For example, better command names may be add/remove (cluster node) instead of a create/delete.
5. Avoid Commitment
Err on the side of caution and do not commit to support APIs and other interfaces.
It is essentially impossible to predict what APIs or other interfaces will actually end up being useful. It can become a huge maintenance burden (in time and cost) to maintain seldom used interfaces that have only a handful of users. Unfortunately, users like being able to rely on functionality not going away. Therefore, for your own sanity, make it clear:
- Which interfaces are supported and which may change or disappear without any warning.
- When supported interfaces may change (e.g., major versions may break all APIs).
- What behavior of a supported interface is supported and what is merely an implementation detail.
The first two items are self-explanatory, but the last one requires a few extra words.
It is tempting to say that “function foo is supported”, but that is the wrong way to do it. Rather, you want to say “function foo, which does only bar, is supported”.
For example, suppose that we have a function which returns an array of names (strings). Let’s also assume that it is convenient to keep track of those names internally using a balanced binary search tree. When we implement this get-names function, we are likely to simply iterate the tree appending all the names to the output array. This results in the output being sorted because of the tree-based implementation.
Now, consider these two possible statements of what is supported. First, a bad one:
Function get-names is supported. The function returns all names.
And now a better one:
Function get-names is supported. The function returns all names in an unspecified order.
Using the first (bad) description, it is completely reasonable for someone to start relying on the fact that the returned names are sorted. This ties our hands in multiple ways.
The second (better) description makes it clear that the order of the names can be anything and therefore the users shouldn’t rely on a particular order. It better communicates our intention of what is and what isn’t supported.
Now, say that we’d like to replace the tree with a hash table. (Maybe the tree insertion cost is too high for what we are trying to do.) This is a pretty simple change, but now our get-names output is unsorted. If we used the first description and some major consumer relied on the sorted behavior, we have to add a O(n log n) sort to the end of get-names making everyone pay the penalty instead of just the consumers that want sorted output. On the other hand, the second description lets us do this hash table change.
So, be very explicit about what is and what isn’t supported.
I used a function in the above example, but the same applies to utilities and other tools. For example, it is perfectly reasonable to make the default output of a command implementation dependant, but provide arguments that force certain columns, values, or units for the consumers that care.
Real World Example
A project I worked on a number of years ago had very simple rules for what was supported. Everything was unsupported, unless otherwise stated.
In short, we supported any API function and utility command that had a manpage that didn’t explicitly say that it was a developer-only interface or that the specific behavior should not be relied upon.
Yes, there have been a few instances where our users assumed something was supported when it wasn’t, and inevitably they filed bugs. However, we were able to tell them that there was a better (and supported) way to accomplishing their task. Once they fixed their assumptions, their system became more robust and significantly less likely to break in the future.
6. Be Consistent
Finally, make your interfaces consistent.
Not only internally consistent (e.g., always use the same option name for the same behavior) but also consistent with the rest of the system (e.g., use the same option names that commonly used software uses). This is a rephrasing of the principle of least astonishment.
Summary
There are other ways of approaching interface design, but the only one I’ve seen that doesn’t turn into a mess (or doesn’t cause a user revolt) is what I tried to outline here. Specifically, start with a small, simple, consistent, and explicit interface that you barely support and evolve over time.
This was a long post with a lot of information, but it still barely scratches the surface of what could be said. Really, most of the rules could easily turn into lengthy posts of their own. Maybe if there is interest (and I find the time), I’ll elaborate on some of these.