Josef “Jeff” Sipek

Inlining Atomic Operations

One of the items on my ever growing TODO list (do these ever shrink?) was to see if inlining Illumos’s atomic_* functions would make any difference. (For the record, these functions atomically manipulate variables. You can read more about them in the various man pages — atomic_add, atomic_and, atomic_bits, atomic_cas, atomic_dec, atomic_inc, atomic_or, atomic_swap.) Of course once I looked at the issue deeply enough, I ended up with five cleanup patches. The gist of it is, inlining them caused not only about 1% kernel performance improvement on the benchmarks, but also reduced the kernel size by a couple of kilobytes. You can read all about it in the associated bugs (5042, 5043, 5044, 5045, 5046, 5047) and the patch 0/6 email I sent to the developer list. In this blahg post, I want to talk about how exactly Illumos presents these atomic functions in a stable ABI but at the same time allows for inlines.

Genesis

It should come as no surprise that the “content” of these functions really needs to be written in assembly. The functions are 100% implemented in assembly in usr/src/common/atomic. There, you will find a directory per architecture. For example, in the amd64 directory, we’ll find the code for a 64-bit atomic increment:

	ENTRY(atomic_inc_64)
	ALTENTRY(atomic_inc_ulong)
	lock
	incq	(%rdi)
	ret
	SET_SIZE(atomic_inc_ulong)
	SET_SIZE(atomic_inc_64)

The ENTRY, ALTENTRY, and SET_SIZE macros are C preprocessor macros to make writing assembly functions semi-sane. Anyway, this code is used by both the kernel as well as userspace. I am going to ignore the userspace side of the picture and talk about the kernel only.

These assembly functions, get mangled by the C preprocessor, and then are fed into the assembler. The object file is then linked into the rest of the kernel. When a module binary references these functions the krtld (linker-loader) wires up those references to this code.

Inline

Replacing these function with inline functions (using the GNU definition) would be fine as far as all the code in Illumos is concerned. However doing so would remove the actual functions (as well as the symbol table entries) and so the linker would not be able to wire up any references from modules. Since Illumos cares about not breaking existing external modules (both open source and closed source), this simple approach would be a no-go.

Inline v2

Before I go into the next and final approach, I’m going to make a small detour through C land.

extern inline

First off, let’s say that we have a simple function, add, that returns the sum of the two integer arguments, and we keep it in a file called add.c:

#include "add.h"

int add(int x, int y)
{
	return x + y;
}

In the associated header file, add.h, we may include a prototype like the following to let the compiler know that add exists elsewhere and what types to expect.

extern int add(int, int);

Then, we attempt to call it from a function in, say, test.c:

#include "add.h"

int test()
{
	return add(5, 7);
}

Now, let’s turn these two .c files into a .so. We get the obvious result — test calls add:

test()
    test:     be 07 00 00 00     movl   $0x7,%esi
    test+0x5: bf 05 00 00 00     movl   $0x5,%edi
    test+0xa: e9 b1 fe ff ff     jmp    -0x14f	<0xc90>

And the binary contains both functions:

$ /usr/bin/nm test.so | egrep '(Value|test$|add$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[74]	|                3520|                   4|FUNC |GLOB |0    |13   |add
[65]	|                3536|                  15|FUNC |GLOB |0    |13   |test

Now suppose that we modify the header file to include the following (assuming GCC’s inline definition):

extern int add(int, int);

extern inline int add(int a, int b)
{
	return a + b;
}

If we compile and link the same .so the same way, that is we feed in the object file with the previously used implementation of add, we’ll get a slightly different binary. The invocation of add will use the inlined version:

test()
    test:     b8 0c 00 00 00     movl   $0xc,%eax
    test+0x5: c3                 ret    

But the binary will still include the symbol:

$ /usr/bin/nm test.so | egrep '(Value|test$|add$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[72]	|                3408|                   4|FUNC |GLOB |0    |11   |add
[63]	|                3424|                   6|FUNC |GLOB |0    |11   |test

Neat, eh?

extern inline atomic what?

How does this apply to the atomic functions? Pretty simply. As I pointed out, usr/src/common/atomic contains the pure assembly implementations — these are the functions you’ll always find in the symbol table.

The common header file that defines extern prototypes is usr/src/uts/common/sys/atomic.h.

Now, the trick. If you look carefully at the header file, you’ll spot a check on line 39. If all the conditions are true (kernel code, GCC, inline assembly is allowed, and x86), we include asm/atomic.h — which lives at usr/src/uts/intel/asm/atomic.h. This is where the extern inline versions of the atomic functions get defined.

So, kernel code simply includes <sys/atomic.h>, and if the stars align properly, any atomic function use will get inlined.

Phew! This ended up being longer than I expected. :)

Segment Drivers

Lately, I started poking around the Illumos memory management code. As I’ve done in the past, I decided to use this blahg as a place to document some of my discoveries.

Memory Layout

In Illumos (and Solaris), address spaces are managed as sets of segments. Each segment has a base address, length, and a number of other properties. This is true for both process memory as well as kernel memory. Do not confuse these segments with Wikipedia article: memory segmentation that processors like Wikipedia article: x86 provide.

Each process has its own struct as:

> ::pgrep vim
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R  10852  10777  10850  10777    101 0x4a004000 ffffff0411e1c0a0 vim
> ffffff0411e1c0a0::print proc_t p_as | ::print struct as a_segtree
a_segtree = {
    a_segtree.avl_root = 0xffffff03f7c62ea8
    a_segtree.avl_compar = as_segcompar
    a_segtree.avl_offset = 0x20
    a_segtree.avl_numnodes = 0x18
    a_segtree.avl_size = 0x60
}

The kernel address space is maintained in the kas global:

> kas::print a_segtree
a_segtree = {
    a_segtree.avl_root = kvseg+0x20
    a_segtree.avl_compar = as_segcompar
    a_segtree.avl_offset = 0x20
    a_segtree.avl_numnodes = 0x9
    a_segtree.avl_size = 0x60
}

(Once upon a time this set of segments was a linked list, but for a long while now it has been an AVL tree indexed by the base address.)

Regardless of which address space we’re dealing with, the same rules apply: segments represent contiguous regions within the address space. Each segment can represent a different type of memory. For example, walking the kernel address space segment tree yields nine different segments of four different types (kpm, kmem, kp, and map):

> kas::print a_segtree | ::walk avl | ::printf "%p.%016x %a\n" "struct seg" s_base s_size s_ops
fffffe0000000000.000000031e000000 segkpm_ops
ffffff0000000000.0000000017000000 segkmem_ops
ffffff0017000000.0000000080000000 segkp_ops
ffffff0097000000.00000002fca00000 segkmem_ops
ffffff03d3a00000.0000000004000000 segmap_ops
ffffff03d7a00000.000000fbe8600000 segkmem_ops
ffffffffc0000000.000000003b7fb000 segkmem_ops
fffffffffb800000.0000000000550000 segkmem_ops
ffffffffff800000.0000000000400000 segkmem_ops

Segment Drivers

Illumos comes with seven different architecture- and platform-independent segment drivers. A segment driver is a “driver” that implements a couple of functions to manage a segment of memory. That is, each segment type can handle page faults, page locking, sync operations, etc. differently.

For example, suppose that a page fault occurs because a process tried to load a value from a page that lacks a page table entry. The platform specific (assembly) fault handling code gets invoked by the processor. After doing a little bit of work, it calls into the generic (C) fault handling code, as_fault. There, the segtree AVL tree is consulted and the corresponding segment’s fault operation gets invoked.

(Solaris Internals lists 12 and 11 segment drivers, respectively, in the two editions.) In Illumos, the seven common segment drivers are:

seg_dev
Most of the time, userspace processes do not need to map devices into their address space. In the rare case when a process does want a device mapped (e.g., Xorg), the dev segment driver maintains that mapping.
seg_kmem
This segment driver maps the kernel heap, module text, and all early boot memory. (code)
seg_kp
In general, kernel memory is not pageable. In the rare case that something can be in kernel pageable memory, this segment is what maintains the anonymous page mappings.
seg_kpm
If possible (you’re on a 64-bit system), the kpm segment driver maps all physical memory into the kernel’s address space. This allows the kernel to not have to set up temporary mappings to operate on physical memory. (code)
seg_map
The map segment driver is a kernel-only higher performance version of the vn segment driver. (See below.)
seg_spt
This segment driver is responsible for maintaining SysV shared memory segments. (Not to be confused with POSIX shared memory.)
seg_vn
Memory mapped files are handled by the vn segment driver. This includes both regular files as well as anonymous memory.

There are also two platform specific segment drivers:

seg_mf (i86xpv only)
This segment driver is only used by dom0 processes (read: Xen) to map pages from other domains.
seg_nf (sparc v9 only)
The header for the file says that it is for non-faulting loads. I don’t actually know what exactly it is for. (And I don’t care enough to dig deeper given that it is Sparc specific.)

The Reality

This is a lot of different segment drivers. Are all of them used all the time? Well, sort of. The mdb output earlier shows that the (amd64) kernel uses only four different segment drivers (kpm, kmem, kp, and map). A typical userspace process is very boring — it is only made up of vn segments. There are, however, exceptions. For instance, Xorg uses vn and dev. This accounts for six of the seven drivers. The last common segment driver is spt, which provides System V shared memory. (I talked about SysV shared memory previously.) So, on a 64-bit x86 system, all seven common segment drivers are in use.

The story is a bit different on 32-bit kernels. Since a 32-bit system has much smaller address space, the kernel tries to eliminate a number of mappings. Here is the list of segments in a 32-bit kernel:

> kas::print a_segtree | ::walk avl | ::printf "%p %a\n" "struct seg" s_base s_ops
b5802000 segmap_ops
b6800000 segkmem_ops
ef400000 segkmem_ops
fe800000 segkmem_ops
ff000000 segkmem_ops

As you can see, the kp and kpm segments went away. While at first this is surprising, it actually makes perfect sense. When thinking about memory there are two “types” to consider: physical and virtual. In theory, one can have more virtual than physical thanks to the MMU but in reality this is only true on 64-bit systems. The physical memory sizes have outgrown 4 GB a number of years ago and therefore a 32-bit address space can trivially be 100% backed by physical memory. In other words, 32-bit address spaces are tight on virtual memory, while 64-bit address spaces are “tight” on physical memory.

Let’s consider the disappearance of the kp segment on 32-bits. What does kp let us do? It lets us oversubscribe physical memory by backing some virtual memory with disk space. On 32-bit systems we have enough physical memory to back all the virtual memory in the kernel so we don’t need to back some of it by disk. So we have no use for it. (Yes, the kernel still could have paged parts of itself out, but kernel text and data is generally considered important enough to keep it in non-pageable memory. The memory utilization will more than pay for itself by the performance improvement of not having the kernel paged out.)

As I stated before, kpm segments map physical memory into the kernel’s address space for performance reasons (without it the kernel would have to temporarily map a page to access the contents). Therefore, they are good candidates for removal when it comes to slimming down the kernel’s address space demands. (Well, the actual story is the other way… the introduction of 64-bit capable hardware allowed kpm segments to exist to improve kernel performance.)

Bugs in Time

Recently, I blahgd about GCC optimizing code interestingly. There, I mentioned a couple of bugs I’ve stumbled across. I’m going to talk more about them in this post.

ddi_get_time

It all started when I got assigned a bug at work. “The installer hangs while checking available disks.” That’s the extent of the information I was given along with a test system. It didn’t take long to figure that devfsadm -c disk was waiting on a kernel thread that didn’t seem to be making any progress:

swtch+0x141
cv_timedwait_hires+0xec
cv_reltimedwait+0x51
ibdm`ibdm_ibnex_port_settle_wait+0x5f
ib`ibnex_bus_config+0x1e8
devi_config_common+0xa5
mt_config_thread+0x58
thread_start+8

The function of interest here is ibdm_ibnex_port_settle, but before I talk about it I need to mention that the ibdm kmod stashes a ddi_get_time timestamp of when the HCA attached. Now, ibdm_ibnex_port_settle calls ibdm_get_waittime to get a delay to feed to cv_reltimedwait. The delay is (more or less) calculated as: ddi_get_time() - hca_attach_time. This works fine as long as ddi_get_time continues incrementing at a constant rate (1 sec/sec).

You may already see where this is going. The problem is that ddi_get_time returns a Unix timestamp based on the current time-of-day clock. If the TOD setting changes for whatever reason (daylight saving time adjustments, NTP, etc.), the value returned by ddi_get_time may change non-monotonically. This makes it unsuitable for calculating timeouts and wait times. Converting ibdm_get_waittime to use a monotonic clock source (like gethrtime or ddi_get_lbolt) fixes this bug. (Illumos bug 4777)

Things get a bit worse. While figuring out what ddi_get_time does, I noticed that the man page actively encouraged developers to use it for timeouts. (Illumos bug 4776)

Of course, once I knew about this potential abuse, I had to check that there weren’t similar issues elsewhere in the kernel… and so I got to file bugs for iprb (4778), vhci (4779), COMSTAR iSCSI target (4780), sd (4781), usba (4782), emlxs (4786), ipf (4787), mac (4788), amr (4789), arcmsr (4790), aac (4791), and heci (4792).

I’m fixing all except: amr, arcmsr, aac, and heci.

NANOSEC

While developing the series of fixes mentioned in the previous section, I ran into the fact that NANOSEC was defined as 1000000000. This made it an int — a 32-bit signed integer (on both ILP32 and LP64).

If NANOSEC (defined this way) is used to convert seconds to nanoseconds (by multiplying), the naive approach will fail with quantities larger than 2 seconds. For example (hrtime_t is a 64-bit signed int):

hrtime_t convert(int secs)
{
        return (secs * NANOSEC);
}

Since both secs and NANOSEC are integers, the compiler will compute the product and then sign extend the result to 64-bits. If you look around the Illumos codebase, you’ll see plenty of places that cast or use ULL or LL suffix to make the compiler do the right thing. Why not just change the definition of NANOSEC to include a LL suffix releaving the users of this tedious (and error prone!) duty? Well, now you know what Illumos bug 4809 is about. :)

So, I changed the definition and rebuilt everything. Then, using wsdiff (think: recursive diff that understands how to compare ELF files) I found two places where the before and after binaries differed for non-trivial reasons. (I define a trivial reason as “the compiler decided to use registers differently, but the result is the same”.) Each non-trivial difference implies that there was an expression that changed — it used to be busted!

The first difference was in ZFS (Illumos bug 4810). There, spa_async_tasks_pending miscalculated a timeout making the condition always true.

The second difference was in in.mpathd. 4811). This daemon has a utility function to convert a struct timeval into a hrtime_t. You can read more about it in my previous post.

Before the NANOSEC change, I would have needed casts to fix this. With the change in definition, I don’t have to change a thing! And that’s how a one liner closed three bugs at the same time:

commit b59e2127f21675e88c58a4dd924bc55eeb83c7a6
Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Date:   Mon Apr 28 15:53:04 2014 -0400

    4809 NANOSEC should be 'long long' to avoid integer overflow bugs
    4810 spa_async_tasks_pending suffers from an integer overflow bug
    4811 in.mpathd: tv2ns suffers from an integer overflow bug
    Reviewed by: Marcel Telka <marcel.telka@nexenta.com>
    Reviewed by: Dan McDonald <danmcd@omniti.com>
    Approved by: Robert Mustacchi <rm@joyent.com>

Greetings from Nexenta

In case you missed it, back in mid-2011 I discovered Illumos and OpenIndiana. At that point, I already missed hacking on the (Linux) kernel. Based on my blahg posts [1,2], it shouldn’t surprise you that it didn’t take long before I wanted to hack on the Illumos kernel…and so I did.

If you ever contributed to an open source project in your free time while employed full-time, you understand that there’s only so much time you can devote to the open source project and therefore there is only so much you can do.

A couple of months ago, I decided to explore the possibility of working full-time on Illumos. There are only a handful of companies that visibly participate in the Illumos ecosystem, but their use of Illumos is pretty varied (from public clouds to virtualized databases to SAN/NAS appliances). As of this past Tuesday (Monday was a holiday), I’m at Nexenta. At least for now, I’m working remotely (from Ann Arbor) with the fine folks in the Wikipedia article: Lowell office. It feels great to work on open source again.

x2APIC, IOMMU, Illumos

About a week ago, I hinted at a boot hang I was debugging. I’ve made some progress with it, and along the way I found some interesting things about which I’ll blog over the next few days. Today, I’m going to talk about the Wikipedia article: APIC, xAPIC, and Wikipedia article: x2APIC and how they’re handled in Illumos.

APIC, xAPIC, x2APIC

I strongly suggest you become at least a little familiar with APIC architecture before reading on. The Wikipedia articles above are a good start.

First things first, we need some definitions. APIC can refer to either the architecture or to very old (pre-Pentium 4) implementation. Since I’m working with a Sandy Bridge, I’m going to use APIC to refer to the architecture and completely ignore that these chips existed. Everything they do is a subset of xAPIC. xAPIC is an extension to APIC. xAPIC chips started showed up in NetBurst architecture Intel CPUs (i.e., Pentium 4). xAPIC included some goodies such as upping the limit on the number of CPUs to 256 (from 16). x2APIC is an extension to xAPIC. x2APIC chips started appearing around the same time Sandy Bridge systems started showing up. It is a major update to how interrupts are handled, but as with many things in the PC industry the x2APIC is fully backwards compatible with xAPICs. x2APIC includes some goodies such as upping the limit on the number of CPUs to 232.

Regardless of which exact flavor you happen to use, you will find two components: the local APIC and I/O APIC. Each processor gets their own local APIC and I/O buses get I/O APICs. I/O APICs can service more than one device, and in fact many systems have only one I/O APIC.

The xAPIC uses Wikipedia article: MMIO to program the local and I/O APICs.

x2APIC has two mode of operation. First, there is the xAPIC compatibility mode which makes the x2APIC behave just like an xAPIC. This mode doesn’t give you all the new bells and whistles. Second, there is the new x2APIC mode. In this mode, the APIC is programmed using Wikipedia article: MSRs.

One interesting fact about x2APIC is that it requires an Wikipedia article: iommu. My Sandy Bridge laptop has an Intel iommu as part of the VT-d feature.

Illumos /etc/mach

x2APIC in Illumos has two APIC drivers. First, there is pcplusmp which knows how to handle APIC and xAPIC. Second, there is apix which targets x2APIC, but knows how to operate it in both modes. On boot, the kernel consults /etc/mach to get a list of machine specific modules to try to load. Currently, the default contents (trimmed for display here) are:

#
# CAUTION!  The order of modules specified here is very important. If the
# order is not correct it can result in unexpected system behavior. The
# loading of modules is in the reverse order specified here (i.e. the last
# entry is loaded first and the first entry loaded last).
#
pcplusmp
apix
xpv_psm

Since I’m not running Xen, xpv_psm will fail to load, and apix gets its chance to load.

pcplusmp + apix Code Sharing

The code in these two modules can be summarized with a word: mess. Following what happens when would be enough of an adventure. The code for the two modules lives in four directories: usr/src/uts/i86pc/io, usr/src/uts/i86pc/io/psm, usr/src/uts/i86pc/io/pcplusmp, and usr/src/uts/i86pc/io/apix. But the sharing isn’t as straight forward as one would hope.

Directory pcplusmp apix
i86pc/io mp_platform_common.c, mp_platform_misc.c, hpet_acpi.c mp_platform_common.c, hpet_acpi.c
i86pc/io/psm psm_common.c psm_common.c
i86pc/io/pcplusmp * apic_regops.c, apic_common.c, apic_timer.c
i86pc/io/apix *

This is of course not clear at all when you look at the code. (Reality is a bit messier because of the i86xpv platform which uses some of the i86pc source.)

apix_probe

When the apix module gets loaded, its probe function (apix_probe) is called. This is the place where the module decides if the hardware is worthy. Specifically, if it finds that the CPU reports x2APIC support via Wikipedia article: cpuid, it goes on to call the common APIC probe code (apic_probe_common). Unless that fails, the system will use the apix module — even if there is no iommu and therefore the x2APIC needs to operate in xAPIC mode.

What mode are you using? Easy, just check the apic_mode global in the kernel:

# echo apic_mode::whatis | mdb -k
fffffffffbd0ee4c is apic_mode, in apix's data segment
# echo apic_mode::print | mdb -k
0x2

2 (LOCAL_APIC) indicates xAPIC mode, while 3 (LOCAL_X2APIC) indicates x2APIC mode.

Because this part is as clear as mud, I made a table that tells you what module and mode to expect given your hardware, what CPUID says, and the presence and state of the iommu.

APIC hw CPUID IOMMU IOMMU state Module apic_mode
xAPIC off pcplusmp LOCAL_APIC
x2APIC off pcplusmp LOCAL_APIC
x2APIC on absent apix LOCAL_APIC
x2APIC on present off apix LOCAL_APIC
x2APIC on present on apix LOCAL_X2APIC

Defaults

I’ve never seen apic_mode equal to LOCAL_X2APIC in the wild. This was very puzzling. Yesterday, I discovered why. As I mentioned earlier, in order for the x2APIC to operate in x2APIC mode an iommu is required. Long story short, the default config that Illumos ships disables iommus on boot. Specifically:

$ cat /platform/i86pc/kernel/drv/rootnex.conf | grep -v '^\(#.*\|\)$'
immu-enable="false";

In order to get LOCAL_X2APIC mode, you need to set:

immu-enable="true";
immu-intrmap-enable="true";

Once you put those into the config file, update you boot archive and reboot. You should be set… except the iommu support in Illumos is… shall we say… poor.

(I should point out that it is possible for the BIOS to enable x2APIC mode before handing control off to the OS. This is pretty rare unless you have a really big x86 system.)

1394

It would seem that the hci1394 driver doesn’t quite know how to deal with an iommu “messing” with its I/Os and its interrupt service routine shuts down the driver. (On a debug build it throws is ASSERT(0) for good measure.) I just disabled 1394 in the BIOS since I don’t have any Firewire devices handy and therefore no use for the port at the moment.

immu-enable Details

In case you want to know how iommu initialization affects the apix initialization…

During boot, immu_init gets called to initialize iommus. If the config option (immu-enable) is not true, the function just returns instead of calling immu_subsystems_setup which calls immu_intrmap_setup which sets psm_vt_ops to non-NULL value.

Later on, when apix is loaded and is initializing itself in apix_picinit, it calls apic_intrmap_init. This function does nothing if psm_vt_ops are NULL.

The Hang

I might as well tell you a bit about my progress on tracking down the hang. It happens only if I’m using the apix module and I allow deep C states in the idle thread (technically, it could also be an mwait related issue since I cannot disable just mwait without disabling deep C states). It does not matter if the apic_mode is LOCAL_APIC or LOCAL_X2APIC.

Assorted Documentation

  1. Intel 64 Architecture x2APIC Specification
  2. Intel MP Spec 1.4

Think!

Alright, it ain’t rocket science. When you are trying to decide which filesystem to use, and you see a 7 year old article which talks about people having problems with the fs on Red Hat 7.x (running 2.4.18 kernels), are you going to assume that nothing changed? What if all the developers tell you that things changed? Are you still going to believe the slashdot article? Grrr… No one is forcing you to use this filesystem, so if you believe a 7-year old /. article, then go away and don’t waste the developers’ & others’ time.

Haskell Kernel Modules

Insanity! Someone has made it possible to write kernel modules in Haskell. (FYI, Haskell is a functional language with very strong typing.) Currently, they support only x86, but I wouldn’t be surprised if some other architectures got a port soonish.

O_PONIES & Other Assorted Wishes

You might have already heard about ext4 “eating” people’s data. That’s simply not true.

While I am far from being a fan of ext4, I feel an obligation to set the record straight. But first, let me give you some references with an approximate timeline. I’m sure I managed to leave out a ton of details.

In mid-January, a bug titled Ext4 data loss showed up in the Ubuntu bug tracker. The complaining users apparently were using data on system crashes when using ext4. (The fact that Ubuntu likes to include every unstable & crappy driver into their kernels doesn’t help at all.) As part of the discussion, Ted Ts’o explained that the problem wasn’t with ext4 but with applications that did not ensure that the data they wrote was actually safe. The people did not like hearing that.

Things went pretty quiet until mid-March. That’s when a slashdot article made it painfully obvious that many of today’s apps are buggy. Some applications (KDE being a whole suite of applications) gotten used to the fact that ext3 was a very common filesystem used by Linux installations. More specifically, they got used to the behavior that ext3’s default mount option (data=ordered) provided. This is really the issue. The application developers assumed that the POSIX interface gave them more guarantees that it did! To make matters worse, the one way to ensure that the contents of a file get to the disk (the fsync system call) is very expensive on ext3. So over the past (almost) decade that ext3 has been around, application developers have been “trained” (think Wikipedia article: Pavlov reflexes) to not use fsync — on ext3, it’s expensive and the likelyhood of you losing data is much lower due to the default mount options. ext4’s fsync implementation, much like other filesystems’ implementations (e.g., XFS) does not suffer from this. (You may have heard about fsync on ext3 being expensive almost a year ago when Firefox was hit by this: Fsyncers and curveballs (the Firefox 3 fsync() problem). Note that in this case, as Ted Ts’o points out, the problem is that Firefox uses the same thread to draw the UI and do IO. That’s plain stupid.)

Over the next few days, Ted Ts’o posted two blog entries about delayed allocation (people seem to like to blame it for dataloss): Delayed allocation and the zero-length file problem, Don’t fear the fsync!.

About the same time, Eric Sandeen wrote a blurb about the state of affairs: fsync, sigh. He points out that XFS has faced the same issue years ago. When the application developers were confronted about their application being broken, they just put fingers in their ears, hummed loudly, yelled “I can’t hear you!” There is a word for that, and here’s the OED definition for it:

denial,

The asserting (of anything) to be untrue or untenable; contradiction of a statement or allegation as untrue or invalid; also, the denying of the existence or reality of a thing.

The problem is application developers not wanting to believe that it’s an application problem. Well, it really is! Not only are those apps broken, but they are not portable. AIX, IRIX, or Solaris will not give you the same guarantees as ext3!

(Eric is also trying to fight the common misconception that XFS nulls files: XFS does not null files, and requires no flux, which I assure you is not the case.)

About a week later, on an episode of Free Software Round Table, the problem was discussed a bit. They got most of it right :) (Here’s a 55MB mp3 of the show: 2009-03-21.)

When April 1st came about, the linux-fsdevel mailing list got a patch from yours truly: [PATCH] fs: point out any processes using O_PONIES. (The pony thing…it’s a bit of an inside joke among the Linux filesystem developers.) The idea of having O_PONIES first came up in #linuxfs on OFTC. While I don’t remember who first thought of it (my guess would be Eric), I know for sure that it wasn’t me. At the same time, I couldn’t help it, and considering that the patch took only a minute to make (and compile test), it was well worth it.

Few days later, during the Linux Storage and Filesystem workshop, the whole fsync issue got some discussion time. (See “Rename, fsync, and ponies” at Linux Storage and Filesystem workshop, day 1.) The part that really amused me:

Prior to Ted Ts’o’s session on fsync() and rename(), some joker filled the room with coloring-book pages depicting ponies. These pages reflected the sentiment that Ted has often expressed: application developers are asking too much of the filesystem, so they might as well request a pony while they’re at it.

In the comments for that article you can find Ted Ts’o saying:

Actually, it was Josef ’Jeff’ Sipek who deserves the first mention of application programmers asking for pones, when he posted an April Fools patch submission for the new open flag, O_PONIES — unreasonable file system assumptions desired.

Another file system developer who had worked on two major filesystems (ext4 and XFS) had a t-shirt on that had O_PONIES written on the front. And the joker who distributed the colouring book pages with pictures of ponies was another file system developer working yet another next generation file system.

Application programmers, while they were questioning my competence, judgement, and even my paternity, didn’t quite believe me when I told them that I was the moderate on these issues, but it’s safe to say that most of the file system developers in the room were utterly unsympathetic to the idea that it was a good idea to encourage application programmers to avoid the use of fsync(). About the only one who was also a moderate in the room was Val Aurora (formerly Henson). Both of us recognize that ext3’s data=ordered mode was responsible for people deciding that fsync() was harmful, and I’ve said already that if we had known how badly it would encourage application writers to Do The Wrong Thing, I would have pushed hard not to make data=ordered the default. Unfortunately, memory wasn’t as plentiful in those days, and so the associated page writeback latencies wasn’t nearly as bad ten years ago.

Hrm, I’m not sure how to take it…he makes it sound like I’m an extremist. Jeff — a freedom fighter for sanity of filesystem interfaces! :) As I said, I can’t take credit for the idea of O_PONIES. As I was writing this entry, I mentioned it to Eric and he promptly wrote an entry of his own: Coming clean on O_PONIES. It looks like he isn’t sure that he was the one to invent it! I’ll give him credit for it anyway.

The next day, a group photo of the attendees was taken… You can clearly see Val Aurora wearing an O_PONIES shirt. The idea was Eric’s, and as far as I know, he had his shirt the first day.

Fedora 11 is supposedly going to use ext4 as the default filesystem. When Ars Technica published an article about it (First look: Fedora 11 beta shows promise), some misguided people thinking that that ext4 eats your data left a bunch of comments….*sigh*

Well, there you have it. That’s the summary of events with some of my thoughts interleaved. If you are writing a userspace application that does file IO, do the right thing, fsync the data you care about (or at least fdatasync).

OLS 2008 - Day 3

Yeah, I really wanted to write this yesterday — since it is about yesterday, but I was too tired when I got to the hotel. Either way, here it is.

The day started at 10am again - I love it. Previous years, presentations started at 9am (except the first day that was 10am). The first talk I attended was a about kernel documentation — where it resides, and why the current state is bad. The talk was a bit confusing. At one point, the presenter decided to read some text right from a HTML file — opening it in a text editor instead of a browser. He also seemed to contradict himself a bit … at one point he seemed to have said that HTML was better than plaintext docs, and then some time later, he said the other thing — plaintext docs were better than HTML. I kinda gave up understanding what his point was.

I decided to be lazy, and stayed in the same room for the next talk: On submitting kernel features. I zoned out for quite a bit — I knew a bunch of things already, and it was a bit hard to lex what Andi Kleen (the speaker) was saying.

I was going to go to the ext4 talk. Unfortunately, I got distracted by people on my way to the talk, and before I knew it, I missed most of it. I guess I’ll just have to read the paper.

After lunch, I went to Virtualization of Linux servers: a comparative study. The talk was interesting, and I will read the paper. It showed exactly how much x86 virtualization sucks (at least compared to what’s on the mainframe). I can’t wait to have some time to hack on HVF some more. :)

Then, I got distracted by people, preparation of slides for my BOF about Guilt, pondering about trying SELinux again, etc., etc.

Anyway, I’m going to finish a summary of what happened yesterday later today. Until then…

Memory Leaks

Alright, I think I’ve had just about enough. Why does Amarok eat up 22% of my RAM (1GB) after 4 days of running (and playing music for maybe 18 hours of those 4 days)? Why does Firefox use up 33% of my RAM in 4 days?

Why is it that when I shut down the app, and restart it, the usage is 4–5 times less?

Amarok Firefox 2
before app restart 225 MB 338 MB
after app restart 58 MB 72 MB

The only reason I can think of is application being buggy, or having really crappy defaults.

Buggy Applications

Dear developers, believe it or not, when you allocate memory, you also have to free it when you are done with it. If you don’t, you are committing a crime against humanity known as a “memory leak”. This memory is unusable, and essentially becomes dead weight the process carries around. Since it is not used, the OS may swap it out, and before long, your swap file/partition becomes full of memory that has been leaked.

Contrary to popular belief, freeing memory is really simple.

For you C++ coders (yes, that includes you Amarok folks), you simply use the delete keyword followed by a pointer of what you want to free. For example,

delete some_pointer;

If you are using C, the free function is your friend. Just call it, and make the one argument you give it the pointer to what you want to free. For example,

free(some_pointer);

Now, if you are working on a larger project, there might be wrappers around the memory management (malloc/free, new/delete) functions, but whatever the “free this memory” function is called, USE IT.

I can almost hear all the managed languages fans yell: “Just use a language that does garbage collection, and you won’t have to worry about freeing memory.” Well, you are WRONG!

Garbage collectors maintain graphs of memory allocations, and whenever they notice that some piece of memory is unreachable, they mark it as garbage, and free it. Here’s my favorite example for causing leaks in a garbage collected language:

Suppose that you have implemented a class that works as a stack. You implemented it as a list of elements, and an index into the array to mark the top of the stack. Pushing an element is trivial, you just increment the index, and set the reference in the array to the object you want to store. Popping is really easy, you just decrement the index, and you’re done. Right? WRONG! Decrementing the index changes that one integer variable, but that reference in the array is still valid, and therefore the object is still reachable as far as the garbage collector is concerned. Sure, next time you push into that slot, the previous reference will get broken, and the previous allocation will get freed (assuming that there are no other references). But what if you never push that many elements back onto the stack? What if you experienced some high-load spike? You’ll have a large number of objects incorrectly referenced, tieing up memory, and quite possibly making the entire system slower.

How can you solve this? Pretty simple, just reset the reference to some “null” quantity. In Java, that means using the null literal. For example,

some_reference = null;

In Python, None is the proper keyword to use:

some_reference = None

The lesson is, free the memory you allocated when you are done using it.

Crappy Defaults

Many large applications (Firefox included), have many options you can set that affect its behavior. The default options should cover 95% or more of the users (or at least the greatest majority possible). Why such a high number? Well, suppose you settle for making 90% of your users happy out of the box…that means that 1 in 10 people that try your app will not be happy with the defaults. How many will bother checking if there even are knobs they can turn to make it work the way they want? Not all. Some will just try to install another open source app written by someone else that does pretty much the same thing. So, the default options should make as many people happy as possible.

How does this tie into a third of my RAM being used by Firefox? Simple, I do not know if there are any knobs that would “fix” the problem I am seeing. For all I know, someone decided that it was a great idea to be really aggressive about caching web page content in memory — something that’s fine if you have 16GB RAM, but guess what most people don’t.

Whatever it is (defaults that don’t make sense or memory leaks), Firefox and Amarok have problems that must get addressed. What is one of the reasons people complain about Microsoft Word? It takes up tons of memory. Well, I don’t feel like throwing over 200 MB of RAM at an application that plays MP3s, displays a playlist, and cover art.

And before someone suggests that I use Firefox 3… I realize that it is all super-duper-better-than-ever, but let’s think for a second. When the original Firefox was released, it was hailed as the non-leaky, light-weight Mozilla. Then, things started to get slow again. Firefox 2 was supposed to be the super-fast, non-leaky browser. What happened? What happened to my >300 MB of RAM? Now, Firefox 3 is all the rage…do you see the pattern yet?

I think this brings up a larger issue. It’s no secret that I do some Linux kernel coding from time to time. In the kernel, there are leaks at times, but it seems that the kernel leaks are effectively non-existent compared to applications like Firefox. Don’t believe me? How come you can have a server run for over a year and it responds just as well after the year as it did when you booted it? Imagine running Firefox for a year without restarting it? Can you even imagine that? The Linux kernel doesn’t seem to be the only “non-leaky” (there are leaks, but they are very rare, and probably mostly in the ugliest parts of the kernel — device drivers), Apache performs quite well even after running for a while, PostgreSQL, and the list goes on and on.

Why is it that Firefox and other projects seem to have so many problems? The only thing I can think of is the quality control that goes into checking new code before it’s committed. In the kernel community, a patch may get rewritten a dozen times, submitted to mailing lists for review, get comments from people familiar with the subsystem, but also from other developers (and budding developers trying to understand the existing code). It takes a lot of effort to get a piece of code into the kernel, but in the end, that code is well written, well reviewed, and it should benefit most users. Do the Firefox, et. al., communities do this? I do not know, but somehow, I suspect that it isn’t the case.

Powered by blahgd