Josef “Jeff” Sipek

2017-03-23

The million dollar engineering problem — Scaling infrastructure in the cloud is easy, so it’s easy to fall into the trap of scaling infrastructure instead of improving efficiency.

Some Notes on the “Who wrote Linux” Kerfuffle

The Ghosts of Internet Time

How a personal project became an exhibition of the most beautifully photographed and detailed bugs you ever saw — Amazing photos of various bugs.

Calculator for Field of View of a Camera and Lens

The Megaprocessor — A microprocessor built from discrete transistors.

Why Pascal is Not My Favorite Programming Language

EAA Video — An assortment of EAA produced videos related to just about anything aircraft related (from homebuilding to aerobatics to history).

The Unreasonable Effectiveness of Recurrent Neural Networks

MACHINE_THAT_GOES_PING

Given that my first UNIX experience was on Linux, I’ve gotten used to the way certain commands work. When I switched from Linux to OpenIndiana (an Illumos-based distro), I had to get used to the fact that some commands worked slightly (or in some case extremely) differently. One such command is ping.

On Linux, invoking ping without any arguments, I would get the very familiar output:

linux$ ping powerdns.com
PING powerdns.com (82.94.213.34) 56(84) bytes of data.
64 bytes from xs.powerdns.com (82.94.213.34): icmp_req=1 ttl=55 time=98.0 ms
64 bytes from xs.powerdns.com (82.94.213.34): icmp_req=2 ttl=55 time=99.2 ms
64 bytes from xs.powerdns.com (82.94.213.34): icmp_req=3 ttl=55 time=100 ms
^C
--- powerdns.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 98.044/99.170/100.188/0.950 ms

I was very surprised when I first ran ping on an OpenIndiana box since it outputted something very different:

oi$ ping powerdns.com
powerdns.com is alive

No statistics! Just a boolean indicating “has the host responded to a single ping.” When I run ping, I want to see the statistics—that’s why I run ping to begin with. The manpage helpfully points out that I can get statistics by using the -s option:

oi$ ping -s powerdns.com
PING powerdns.com: 56 data bytes
64 bytes from xs.powerdns.com (82.94.213.34): icmp_seq=0. time=98.955 ms
64 bytes from xs.powerdns.com (82.94.213.34): icmp_seq=1. time=99.597 ms
64 bytes from xs.powerdns.com (82.94.213.34): icmp_seq=2. time=99.546 ms
^C
----powerdns.com PING Statistics----
3 packets transmitted, 3 packets received, 0% packet loss
round-trip (ms)  min/avg/max/stddev = 98.955/99.366/99.597/0.357

For the past few years, I’ve just been getting used to adding -s. It was a little annoying, but it wasn’t the end of the world because I don’t use ping that much and when I do, the two extra characters don’t matter.

Recently, I was looking through the source for Illumos’s ping when I discovered that statistics can be enabled not just by the -s option but also with the MACHINE_THAT_GOES_PING environment variable!

A quick test later, I added the variable to my environment scripts and never looked back.

This is what is looks like:

oi$ export MACHINE_THAT_GOES_PING=1
oi$ ping powerdns.com
PING powerdns.com: 56 data bytes
64 bytes from xs.powerdns.com (82.94.213.34): icmp_seq=0. time=98.704 ms
64 bytes from xs.powerdns.com (82.94.213.34): icmp_seq=1. time=99.062 ms
64 bytes from xs.powerdns.com (82.94.213.34): icmp_seq=2. time=99.156 ms
^C
----powerdns.com PING Statistics----
3 packets transmitted, 3 packets received, 0% packet loss
round-trip (ms)  min/avg/max/stddev = 98.704/98.974/99.156/0.239

In conclusion, if you are a Linux refugee and you miss the way ping worked on Linux, just add MACHINE_THAT_GOES_PING to your environment and don’t look back.

Raspberry Pi

Two weeks ago, I decided to do some hardware hacking. After a bit of reading up on embedded boards, I ended up buying a Raspberry Pi B+. It’s essentially a slightly smaller form factor version of the B, that has more GPIO pins and uses microSD cards instead of SD cards.

I hooked it up to the TV and played with Raspbian and RiscOS a little bit. As you may have guessed by now, that was not enough fun for me. I just had to boot a custom OS that talked over serial. :) This of course required some way to connect the Pi to something that can talk serial. But that’s a post for another day. :P This post is going to be about my impression of the Pi, as well as a cute little use I found for it over the past week.

Impressions

The Pi is a rather small board. The B+ is even smaller. A lot has been written about the technical side, so I won’t bother.

I was rather impressed with how much punch this little board packs. The hardest part about getting it going was putting it in the case (I got one of those kits because it was cheaper than buying everything separately). The built-in 4-port USB hub ended up quite useful. It allowed me to plug in both a keyboard and a mouse and have NOOBS installing Raspbian and RiscOS within minutes. A quick reboot later, I was at a shell prompt. That’s where the “new toy high” wore off a little. (I know I’ve talked about this with people before — it’s cool to be portable, but it’s also boring since the architecture becomes irrelevant.) I had a shell, and the most creative thing I could think of was to look at /proc/cpuinfo and /proc/meminfo.

I do have some thoughts about where the Pi B+ could have been better. The B version used an SD card. The B+ uses a microSD card. I consider this a bit of a regression. I have a bunch of older SD cards and an SD card reader that works well with SD cards. Sadly, this card reader (using a microSD adapter) fails to play nice with the SDXC modernization of SD that all microSD cards seem to use. I have the same issue with other microSD cards, so I’m pretty sure it’s the card reader. This makes updating a bit more of a pain.

The other thing I wish the Pi had is a DB9 RS232 connector. I have USB serial dongles that work well, but to talk serial to the Pi one needs to either get a level converter or a TTL serial to USB cable. I ended up getting a cheap USB cable with a fake Prolific chip inside. It works, but I hear Windows users are having a terrible time with evil drivers from Prolific.

Storm Timelapse

A little over a week after getting the Pi in the mail, we got a large storm heading our way. I got the brilliant idea to set up a webcam in an upstairs window. Previously, this would involve digging up an old computer, setting it up by the window, etc. This time, I reached for the Pi. I connected a webcam to one of the USB ports and a cheap WiFi USB adapter to another. A short config later, Raspbian was on the network even though there’s no network drop in sight.

I didn’t want to abuse the microSD card for storage of images, so I mounted an NFS share from the storage server in the basement. I had to use the nolock option to make the mount happen. I probably could have figured out why the lock manager was not running, but it was a temporary setup so a “quick hack” was all I did.

To capture images from the webcam, I ended up installing fswebcam, a small program that does one thing and does it well. I started up screen, and ran fswebcam with the following config.

device /dev/video0
input 0
loop 5
resolution 800x600
timestamp "%Y-%m-%d %H:%M:%S %Z"
jpeg 95
save /mnt/webcam/%Y%m%d/%H/0_%Y%m%d_%H%M%S.jpg
palette YUYV

Then, downstairs on my laptop, I mounted the same share and watched the files appear every five seconds. I ended up running the webcam for two days.

Here’s a couple of stills from the 27th:

And here’s a couple from the 28th:

I did make a quick timelapse, but I haven’t tried to figure out a reasonable set of codec options to not end up with 300 MB of video. Maybe one day I’ll find a good set of options and upload the video here. Here’s what I used:

ffmpeg -framerate 30 -pattern_type glob -i '20150128/*/0_*.jpg' \
	-b:v 5000k -g 300 /tmp/out.mp4

Anyway, that’s it for today. I’ll write again about the Pi in the near future — from an OS developer’s perspective.

Performance Co-Pilot: Part 2, Enabling PMDAs

In my previous post, I introduced Performance Co-Pilot (PCP). I know, I promised the next post to be about logging, but I thought I’d make a short detour and show how to install more PMDAs.

After installing PCP on a Linux system, you will have access to somewhere around 850 various metrics from the three basic PMDAs (pmcd, linux, and mmv). There are many more metrics that you can get at if you enable some of the non-default PMDAs.

I pondered what the best way to present a simple howto would be, and then I realized that simply copying & pasting a session where I install a PMDA will do.

First of all, all the PMDAs live in /var/lib/pcp/pmdas/.

# cd /var/lib/pcp/pmdas/
# ls
apache	 gpsd	    lustrecomm	mounts	   news     process   sendmail	systemtap  vmware
bonding  kvm	    mailq	mysql	   pdns     roomtemp  shping	trace	   weblog
cisco	 linux	    memcache	named	   pmcd     samba     simple	trivial    zimbra
dbping	 lmsensors  mmv		netfilter  postfix  sample    summary	txmon

In this post, I will use the PowerDNS PMDA as an example, but the steps are the same for the other PMDAs.

# cd pdns/
# ls
Install  Remove  pmdapdns.pl

As you can see, there are three files in this directory. We are interested in the Install script. Simply run it as root, and when it asks whether you want a collector, a monitor, or both answer appropriately — if you are running the daemon on the same host, answering both is your best bet. (I never had the need to answer anything else.)

# ./Install 
You will need to choose an appropriate configuration for installation of
the "pdns" Performance Metrics Domain Agent (PMDA).

  collector	collect performance statistics on this system
  monitor	allow this system to monitor local and/or remote systems
  both		collector and monitor configuration for this system

Please enter c(ollector) or m(onitor) or b(oth) [b] 
Updating the Performance Metrics Name Space (PMNS) ...
Compiled PMNS contains
	  197 hash table entries
	  847 leaf nodes
	  132 non-leaf nodes
	 8149 bytes of symbol table
Terminate PMDA if already installed ...
Updating the PMCD control file, and notifying PMCD ...
Check pdns metrics have appeared ... 22 warnings, 60 metrics and 42 values

At this point, the PMDA has been installed (take a look at /etc/pmcd/pmcd.conf to see the new config line there enabling the new PMDA). Now, we can see the new metrics using pminfo (there are many more, I just pruned the list for brevity):

# pminfo pdns
pdns.packetcache_hit
pdns.tcp_answers
pdns.packetcache_miss

We are done!

If you decide to uninstall a PMDA, just cd into the directory and run the Remove script.

CJK

During an experiment, I needed to install Fedora 12. I made a few mistakes:

  1. I went with the netinstall. Unlike Debian’s netinstall, Fedora’s is very slow.
  2. The installer was a bit sluggish under KVM, and so I accidentally clicked though the window that let me unselect Gnome. So it’s installing the whole shebang.
  3. For whatever reason, it is installing CJK fonts. I do not speak either of those languages, and therefore they are useless to me. Furthermore, I’ve been told that something in the neighborhood of 20% of Fedora users make use of CJK. That just sounds wrong. Why install a package by default that only 20% of your userbase will benefit from? Aren’t there more useful packages?

TurboHercules

Few days ago, a new company was created: TurboHercules.

As the name implies, they package up Hercules (an IBM mainframe emulator), and provide support for it. They are targetting the platform as a disaster recovery solution.

It shouldn’t directly affect the open source project in a negative way (just like Red Hat cannot prevent people from continuing their work on the Linux Kernel). At the same time, it’ll change the way people look at Hercules.

PAPI - Getting at Hardware Performance Counters

Recently, I wanted to figure out whether or not an application I was analyzing was memory bound or not. While on this quest, I was introduced to Performance Application Programming Interface (PAPI).

There is a rather good HOWTO that shows step-by-step instructions on getting it all running on Debian. The text below is more or less just a short version of that HOWTO, with my thoughts interspersed.

PAPI is a library that hooks into the hardware performance counters, and presents them in a uniform way. Installation is rather simple if you pay attention to the installation instructions.

  1. Get the kernel source
  2. Get the perfctr tarball
  3. Extract the sources, and run the update-kernel script. I really mean this, if you try to be clever and apply the patch by hand, you’ll have a broken source tree. (The script runs patch to fixup some existing kernel files, and then it copies a whole bunch of other files into kernel tree.)
  4. Configure, build, install, and reboot into the new kernel
  5. You can modprobe perfctr and see spew in dmesg

That’s it for perfctr. Now PAPI itself…

  1. Get & extract the source
  2. ./configure, make, make fulltest, make install-all

That’s it for PAPI. The make fulltest will run the tests. Chances are that they will all either pass or all fail. If they fail, then something is wrong (probably with perfctr). If they pass, then you are all set.

There are some examples in the src/examples directory. Those should get you started with using PAPI. It takes about 100 lines of C to get an arbitrary counter going.

Some other time, I’ll talk more about PAPI, and how I used it in my experiments.

O_PONIES & Other Assorted Wishes

You might have already heard about ext4 “eating” people’s data. That’s simply not true.

While I am far from being a fan of ext4, I feel an obligation to set the record straight. But first, let me give you some references with an approximate timeline. I’m sure I managed to leave out a ton of details.

In mid-January, a bug titled Ext4 data loss showed up in the Ubuntu bug tracker. The complaining users apparently were using data on system crashes when using ext4. (The fact that Ubuntu likes to include every unstable & crappy driver into their kernels doesn’t help at all.) As part of the discussion, Ted Ts’o explained that the problem wasn’t with ext4 but with applications that did not ensure that the data they wrote was actually safe. The people did not like hearing that.

Things went pretty quiet until mid-March. That’s when a slashdot article made it painfully obvious that many of today’s apps are buggy. Some applications (KDE being a whole suite of applications) gotten used to the fact that ext3 was a very common filesystem used by Linux installations. More specifically, they got used to the behavior that ext3’s default mount option (data=ordered) provided. This is really the issue. The application developers assumed that the POSIX interface gave them more guarantees that it did! To make matters worse, the one way to ensure that the contents of a file get to the disk (the fsync system call) is very expensive on ext3. So over the past (almost) decade that ext3 has been around, application developers have been “trained” (think Wikipedia article: Pavlov reflexes) to not use fsync — on ext3, it’s expensive and the likelyhood of you losing data is much lower due to the default mount options. ext4’s fsync implementation, much like other filesystems’ implementations (e.g., XFS) does not suffer from this. (You may have heard about fsync on ext3 being expensive almost a year ago when Firefox was hit by this: Fsyncers and curveballs (the Firefox 3 fsync() problem). Note that in this case, as Ted Ts’o points out, the problem is that Firefox uses the same thread to draw the UI and do IO. That’s plain stupid.)

Over the next few days, Ted Ts’o posted two blog entries about delayed allocation (people seem to like to blame it for dataloss): Delayed allocation and the zero-length file problem, Don’t fear the fsync!.

About the same time, Eric Sandeen wrote a blurb about the state of affairs: fsync, sigh. He points out that XFS has faced the same issue years ago. When the application developers were confronted about their application being broken, they just put fingers in their ears, hummed loudly, yelled “I can’t hear you!” There is a word for that, and here’s the OED definition for it:

denial,

The asserting (of anything) to be untrue or untenable; contradiction of a statement or allegation as untrue or invalid; also, the denying of the existence or reality of a thing.

The problem is application developers not wanting to believe that it’s an application problem. Well, it really is! Not only are those apps broken, but they are not portable. AIX, IRIX, or Solaris will not give you the same guarantees as ext3!

(Eric is also trying to fight the common misconception that XFS nulls files: XFS does not null files, and requires no flux, which I assure you is not the case.)

About a week later, on an episode of Free Software Round Table, the problem was discussed a bit. They got most of it right :) (Here’s a 55MB mp3 of the show: 2009-03-21.)

When April 1st came about, the linux-fsdevel mailing list got a patch from yours truly: [PATCH] fs: point out any processes using O_PONIES. (The pony thing…it’s a bit of an inside joke among the Linux filesystem developers.) The idea of having O_PONIES first came up in #linuxfs on OFTC. While I don’t remember who first thought of it (my guess would be Eric), I know for sure that it wasn’t me. At the same time, I couldn’t help it, and considering that the patch took only a minute to make (and compile test), it was well worth it.

Few days later, during the Linux Storage and Filesystem workshop, the whole fsync issue got some discussion time. (See “Rename, fsync, and ponies” at Linux Storage and Filesystem workshop, day 1.) The part that really amused me:

Prior to Ted Ts’o’s session on fsync() and rename(), some joker filled the room with coloring-book pages depicting ponies. These pages reflected the sentiment that Ted has often expressed: application developers are asking too much of the filesystem, so they might as well request a pony while they’re at it.

In the comments for that article you can find Ted Ts’o saying:

Actually, it was Josef ’Jeff’ Sipek who deserves the first mention of application programmers asking for pones, when he posted an April Fools patch submission for the new open flag, O_PONIES — unreasonable file system assumptions desired.

Another file system developer who had worked on two major filesystems (ext4 and XFS) had a t-shirt on that had O_PONIES written on the front. And the joker who distributed the colouring book pages with pictures of ponies was another file system developer working yet another next generation file system.

Application programmers, while they were questioning my competence, judgement, and even my paternity, didn’t quite believe me when I told them that I was the moderate on these issues, but it’s safe to say that most of the file system developers in the room were utterly unsympathetic to the idea that it was a good idea to encourage application programmers to avoid the use of fsync(). About the only one who was also a moderate in the room was Val Aurora (formerly Henson). Both of us recognize that ext3’s data=ordered mode was responsible for people deciding that fsync() was harmful, and I’ve said already that if we had known how badly it would encourage application writers to Do The Wrong Thing, I would have pushed hard not to make data=ordered the default. Unfortunately, memory wasn’t as plentiful in those days, and so the associated page writeback latencies wasn’t nearly as bad ten years ago.

Hrm, I’m not sure how to take it…he makes it sound like I’m an extremist. Jeff — a freedom fighter for sanity of filesystem interfaces! :) As I said, I can’t take credit for the idea of O_PONIES. As I was writing this entry, I mentioned it to Eric and he promptly wrote an entry of his own: Coming clean on O_PONIES. It looks like he isn’t sure that he was the one to invent it! I’ll give him credit for it anyway.

The next day, a group photo of the attendees was taken… You can clearly see Val Aurora wearing an O_PONIES shirt. The idea was Eric’s, and as far as I know, he had his shirt the first day.

Fedora 11 is supposedly going to use ext4 as the default filesystem. When Ars Technica published an article about it (First look: Fedora 11 beta shows promise), some misguided people thinking that that ext4 eats your data left a bunch of comments….*sigh*

Well, there you have it. That’s the summary of events with some of my thoughts interleaved. If you are writing a userspace application that does file IO, do the right thing, fsync the data you care about (or at least fdatasync).

Linux Kernel Developers Go Insane

This is a continuation of the lguest: The New Kid on the Block post I made the other day.

In responses to Rusty’s patches, Linus Torvalds and Alan Cox attempt poetry.

First, Linus…

There’s a reason for [not having enough poetry in the kernel].

There once was a lad from Braidwood
With a wife and a hatred for FUD
He hacked kernels for fun,
couldn’t get them to run.
But he always felt that he should.

See?

So when you say "there’s not enough poetry", next time you’ll know why. You *really* don’t want want poetry.

Then Alan Cox replied with modified lyrics to Eleanor Rigby:

Ah look at all the laundered pages
Ah look at all the laundered pages

Handling Pages
Pick up the list and the link where kswap has been
A paging scheme
Runs down the I/O
Watching the queues that now keep me a list of the store
Who is it for

All the laundered pages
Where do they all come from
All the laundered pages
Where do they all belong

Meeting bdflush
Writing the pages of a disk file that no one will clear
No task comes near
Look at it working
Sleeping a lot in the night when there’s no pressure there
What does it care

All the laundered pages
Where do they all come from
All the laundered pages
Where do they all belong

Ah look at all the laundered pages
Ah look at all the laundered pages

Oracle DB
Died under load and was freed along with its name
No admin came
Good old bdflush
Wiping the dirt from the pages as it walks down the chain
Nothing was aged

All the laundered pages
(Ah look at all the laundered pages)
Where do they all come from
All the laundered pages
(Ah look at all the laundered pages)
Where do they all belong

Then, there was an exchange of limerics between Rusty and Alan…

Rusty:

There once was a virtualization coder,
Whose patches kept getting older,
Each time upstream would drop,
His documentation would slightly rot,
SO APPLY MY FUCKING PATCHES OR I’LL KEEP WRITING LIMERICKS.

Alan:

There once was a man they called rusty
Who patches were terribly crusty
Though his patches were right
And Linus was bright
They sat on the list getting dusty.

Rusty:

There was a poetic infection
Which distorted the kernel’s direction,
The code got no time
As they all tried to rhyme
And it shipped needing lots of correction.

And finally, Alan:

Dear Rusty I think that we know
Your code has good things to show
But an unreliable guide
To the poetic aside
Would probably steal the show

Either way, these are the people that write your operating system. :)

lguest: The New Kid on the Block

As most of you know, virtuallization doesn’t really interest me, so me writing about lguest is rather unusual. For those who don’t know, lguest is Rusty Russell’s way of saying virtualization sucks and I can make it better (don’t quote me on that).

Yesterday, Rusty sent out 7 patch series ( 1, 2, 3, 4, 5, 6, 7) that contains most of the documentation for lguest. This is not the normal style of documentation you’ll find in the kernel. Here’s Rusty’s description…

Lguest is an adventure, with you, the reader, as Hero. I can’t think of many 5000-line projects which offer both such capability and glimpses of future potential; it is an exciting time to be delving into the source!

But be warned; this is an arduous journey of several hours or more! And as we know, all true Heroes are driven by a Noble Goal. Thus I offer a Beer (or equivalent) to anyone I meet who has completed this documentation.

So get comfortable and keep your wits about you (both quick and humorous). Along your way to the Noble Goal, you will also gain masterly insight into lguest, and hypervisors and x86 virtualization in general.

There is a very large number of totally hillarious comments. It looks like one doesn’t have to be an x86 expert to get a laugh out of them, but knowing a thing or two about the architecture makes it all the more enjoyable.

I can’t help but include few excerpts here…

Intel provided a special instruction to clear the TS bit for people too cool to use write_cr0() to do it. This "clts" instruction is faster, because all the vowels have been optimized out.

I’m told there are only two stories in the world worth telling: love and hate. So there used to be a love scene here like this:

Launcher: We could make beautiful I/O together, you and I.
Guest: My, that’s a big disk!

Unfortunately, it was just too raunchy for our otherwise-gentle tale.

Just read the patches. They are really amusing :)

Powered by blahgd