Josef “Jeff” Sipek

dis(1): support for System/370, System/390, and z/Architecture ELF bins

A few months ago, I came to the conclusion that it would be both fun and educational to add a new disassembler backend to libdisasm—the disassembler library in Illumos. Being a mainframe fan, I decided that implementing a System/390 and z/Architecture disassembler would be fun (I’ve done it before in HVF).

At first, I was targetting only the 390 and z/Architecture, but given that the System/370 is a trivial (almost) subset of the 390 (and there is a spec for 370 ELF files!), I ended up including the 370 support as well.

It took a while to get the code written (z/Architecture has so many instructions!) and reviewed, but it finally happened… the commit just landed in the repository.

If you get the latest Illumos bits, you’ll be able to disassemble 370, 390, and z/Architecture binaries with style. For example:

$ dis -F strcmp hvf             
disassembly for hvf

strcmp()
    strcmp:      a7 19 00 00        lghi    %r1,0
    strcmp+0x4:  a7 f4 00 08        j       0x111aec
    strcmp+0x8:  a7 1b 00 01        aghi    %r1,1
    strcmp+0xc:  b9 02 00 55        ltgr    %r5,%r5
    strcmp+0x10: a7 84 00 17        je      0x111b16
    strcmp+0x14: e3 51 20 00 00 90  llgc    %r5,0(%r1,%r2)
    strcmp+0x1a: e3 41 30 00 00 90  llgc    %r4,0(%r1,%r3)
    strcmp+0x20: 18 05              lr      %r0,%r5
    strcmp+0x22: 1b 04              sr      %r0,%r4
    strcmp+0x24: 18 40              lr      %r4,%r0
    strcmp+0x26: a7 41 00 ff        tmll    %r4,255
    strcmp+0x2a: a7 84 ff ef        je      0x111ae0
    strcmp+0x2e: 18 20              lr      %r2,%r0
    strcmp+0x30: 89 20 00 18        sll     %r2,%r0,24(%r0)
    strcmp+0x34: 8a 20 00 18        sra     %r2,%r0,24(%r0)
    strcmp+0x38: b9 14 00 22        lgfr    %r2,%r2
    strcmp+0x3c: 07 fe              br      %r14
    strcmp+0x3e: a7 28 00 00        lhi     %r2,0
    strcmp+0x42: b9 14 00 22        lgfr    %r2,%r2
    strcmp+0x46: 07 fe              br      %r14

I am hoping that this will help document all the places needed to change when adding support for a new ISA to libdisasm.

Happy disassembling!

2015-09-23

The Apple ISA — An interesting view of what Apple could aim for instruction set architecture-wise.

Internetová jazyková příručka — A reference book with grammar and dictionary detailing how to conjugate each Czech word.

Java is Magic: the Gathering (or Poker) and Haskell is Go (the game)

An Interview with Brian Kernighan (July 2000)

Booting a Raspberry Pi2, with u-boot and HYP enabled

The SmPL Grammar — Description of the grammar used by Coccinelle.

Netbooting Debian Squeeze — A link I had sitting around for a couple of years when I last set up a NFS-root netbooting Linux system.

Are there any 3 dimensional items wwe can’t print layer by layer — A humorous story about Wikipedia article: Fubini’s theorem and its relation to 3D printing.

The Diagnosis of Mistakes in Programmes on the EDSAC — In some ways, debugging hasn’t changed much since 1951.

Dumping Memory in MDB

It doesn’t take much reading of documentation, other people’s blogs, and other random web search results to learn how to dump a piece of memory in mdb.

In the following examples, I’ll use the address fffffffffbc30a70. This just so happens to be an avl_tree_t on my system. We can use the ::dump command:

> fffffffffbc30a70::dump
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc30a70:  801dc3fb ffffffff 0087b1fb ffffffff  ................

Or we can use the adb-style /B command:

> fffffffffbc30a70/B
kas+0x50:       80      

We can even specify the amount of data we want to dump. ::dump takes how many bytes to dump, while /B takes how many 1-byte integers to dump (while for example, /X takes how many 4-byte integers to dump):

> fffffffffbc30a70,20::dump
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc30a70:  801dc3fb ffffffff 0087b1fb ffffffff  ................
fffffffffbc30a80:  20000000 00000000 09000000 00000000   ...............
> fffffffffbc30a70,20/B
kas+0x50:       80      1d      c3      fb      ff      ff      ff      ff      0       87      b1      
                fb      ff      ff      ff      ff      20      0       0       0       0       0       
                0       0       9       0       0       0       0       0       0       0       

Things break down if we want to use a walker and pipe the output to ::dump or /B:

> fffffffffbc30a70::walk avl | ::dump
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc6d2e0:  00000000 00feffff 0000001e 03000000  ................
> fffffffffbc30a70::walk avl | /B
kpmseg:
kpmseg:         0       0       0       0       0       0       0       0       0       

Even though there are 9 entries in the AVL tree, ::dump dumps only the first one. /B does a bit better and it does print what appears to be the first byte of each. What if we want to dump more than just the first byte? Say, the first 32? ::dump is of no use already. Let’s see what we can make /B do:

> fffffffffbc30a70::walk avl | 20/B
mdb: syntax error near "20"
> fffffffffbc30a70::walk avl | ,20/B
mdb: syntax error near ","

No luck.

Solution

Ok, it’s time for the trick that makes it all work. You have to use the ::eval function. For example:

> fffffffffbc30a70::walk avl | ::eval .,20::dump
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc6d2e0:  00000000 00feffff 0000001e 03000000  ................
fffffffffbc6d2f0:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc34960:  00000000 00ffffff 00000017 00000000  ................
fffffffffbc34970:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc31ce0:  00000017 00ffffff 00000080 00000000  ................
fffffffffbc31cf0:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc35a80:  00000097 00ffffff 0000a0fc 02000000  ................
fffffffffbc35a90:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc34880:  0000a0d3 03ffffff 00000004 00000000  ................
fffffffffbc34890:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc31d60:  0000a0d7 03ffffff 000060e8 fb000000  ..........`.....
fffffffffbc31d70:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc7f3a0:  000000c0 ffffffff 00b07f3b 00000000  ...........;....
fffffffffbc7f3b0:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc7de60:  000080fb ffffffff 00105500 00000000  ..........U.....
fffffffffbc7de70:  00000000 00000000 200ac3fb ffffffff  ........ .......
                   \/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
fffffffffbc7e000:  000080ff ffffffff 00004000 00000000  ..........@.....
fffffffffbc7e010:  00000000 00000000 200ac3fb ffffffff  ........ .......

Perfect! ::eval makes repetition with /B work as well:

> fffffffffbc30a70::walk avl | ::eval .,8/B
kpmseg:
kpmseg:         0       0       0       0       0       fe      ff      ff
kvalloc:
kvalloc:        0       0       0       0       0       ff      ff      ff
kpseg:
kpseg:          0       0       0       17      0       ff      ff      ff
kzioseg:
kzioseg:        0       0       0       97      0       ff      ff      ff
kmapseg:
kmapseg:        0       0       a0      d3      3       ff      ff      ff
kvseg:
kvseg:          0       0       a0      d7      3       ff      ff      ff
kvseg_core:
kvseg_core:     0       0       0       c0      ff      ff      ff      ff
ktextseg:
ktextseg:       0       0       80      fb      ff      ff      ff      ff
kdebugseg:
kdebugseg:      0       0       80      ff      ff      ff      ff      ff

/nap

There is one more trick I want to share in this post. Suppose you have a mostly useless core file, and you want to dump the stack. Not as hex, but rather as a symbol + offset (if possible). The magic command you want is /nap. ‘/’ for printing, ‘n’ for a newline, ‘a’ for symbol + offset (of the value at “dot”), and ‘p’ for symbol (or address) of “dot”. (Formatting differences aside, ‘p’ prints the pointer—“dot”, and ‘a’ prints the value being pointed to—*“dot”.)

For example:

> fd94e3a8,8/nap
0xfd94e3a8:     
0xfd94e3a8:     0xfd94f5a8      
0xfd94e3ac:     libzfs.so.1`namespace_reload+0x394
0xfd94e3b0:     0xfdd6ce28      
0xfd94e3b4:     0xfdd6a423      
0xfd94e3b8:     0xcc            
0xfd94e3bc:     libzfs.so.1`__func__.16928
0xfd94e3c0:     0xfdd6ce00      
0xfd94e3c4:     0xfdd6ce28      

Since the memory happens to be part of the stack, there are no symbols associated with it and therefore the ‘p’ prints a raw hex value.

So, remember: if you have a core file and you think that you need to dump the stack to scavenge for hopefully useful values, you want to…nap. :)

Debugging with mdb

Recently, Theo Schlossnagle posted two interesting articles about debugging on Illumos using mdb. They are MDB, CTF, DWARF, and other angelic things, and mdb custom dmods.

x2APIC, IOMMU, Illumos

About a week ago, I hinted at a boot hang I was debugging. I’ve made some progress with it, and along the way I found some interesting things about which I’ll blog over the next few days. Today, I’m going to talk about the Wikipedia article: APIC, xAPIC, and Wikipedia article: x2APIC and how they’re handled in Illumos.

APIC, xAPIC, x2APIC

I strongly suggest you become at least a little familiar with APIC architecture before reading on. The Wikipedia articles above are a good start.

First things first, we need some definitions. APIC can refer to either the architecture or to very old (pre-Pentium 4) implementation. Since I’m working with a Sandy Bridge, I’m going to use APIC to refer to the architecture and completely ignore that these chips existed. Everything they do is a subset of xAPIC. xAPIC is an extension to APIC. xAPIC chips started showed up in NetBurst architecture Intel CPUs (i.e., Pentium 4). xAPIC included some goodies such as upping the limit on the number of CPUs to 256 (from 16). x2APIC is an extension to xAPIC. x2APIC chips started appearing around the same time Sandy Bridge systems started showing up. It is a major update to how interrupts are handled, but as with many things in the PC industry the x2APIC is fully backwards compatible with xAPICs. x2APIC includes some goodies such as upping the limit on the number of CPUs to $2^{32}$.

Regardless of which exact flavor you happen to use, you will find two components: the local APIC and I/O APIC. Each processor gets their own local APIC and I/O buses get I/O APICs. I/O APICs can service more than one device, and in fact many systems have only one I/O APIC.

The xAPIC uses Wikipedia article: MMIO to program the local and I/O APICs.

x2APIC has two mode of operation. First, there is the xAPIC compatibility mode which makes the x2APIC behave just like an xAPIC. This mode doesn’t give you all the new bells and whistles. Second, there is the new x2APIC mode. In this mode, the APIC is programmed using Wikipedia article: MSRs.

One interesting fact about x2APIC is that it requires an Wikipedia article: iommu. My Sandy Bridge laptop has an Intel iommu as part of the VT-d feature.

Illumos /etc/mach

x2APIC in Illumos has two APIC drivers. First, there is pcplusmp which knows how to handle APIC and xAPIC. Second, there is apix which targets x2APIC, but knows how to operate it in both modes. On boot, the kernel consults /etc/mach to get a list of machine specific modules to try to load. Currently, the default contents (trimmed for display here) are:

#
# CAUTION!  The order of modules specified here is very important. If the
# order is not correct it can result in unexpected system behavior. The
# loading of modules is in the reverse order specified here (i.e. the last
# entry is loaded first and the first entry loaded last).
#
pcplusmp
apix
xpv_psm

Since I’m not running Xen, xpv_psm will fail to load, and apix gets its chance to load.

pcplusmp + apix Code Sharing

The code in these two modules can be summarized with a word: mess. Following what happens when would be enough of an adventure. The code for the two modules lives in four directories: usr/src/uts/i86pc/io, usr/src/uts/i86pc/io/psm, usr/src/uts/i86pc/io/pcplusmp, and usr/src/uts/i86pc/io/apix. But the sharing isn’t as straight forward as one would hope.

Directory pcplusmp apix
i86pc/io mp_platform_common.c, mp_platform_misc.c, hpet_acpi.c mp_platform_common.c, hpet_acpi.c
i86pc/io/psm psm_common.c psm_common.c
i86pc/io/pcplusmp * apic_regops.c, apic_common.c, apic_timer.c
i86pc/io/apix *

This is of course not clear at all when you look at the code. (Reality is a bit messier because of the i86xpv platform which uses some of the i86pc source.)

apix_probe

When the apix module gets loaded, its probe function (apix_probe) is called. This is the place where the module decides if the hardware is worthy. Specifically, if it finds that the CPU reports x2APIC support via Wikipedia article: cpuid, it goes on to call the common APIC probe code (apic_probe_common). Unless that fails, the system will use the apix module — even if there is no iommu and therefore the x2APIC needs to operate in xAPIC mode.

What mode are you using? Easy, just check the apic_mode global in the kernel:

# echo apic_mode::whatis | mdb -k
fffffffffbd0ee4c is apic_mode, in apix's data segment
# echo apic_mode::print | mdb -k
0x2

2 (LOCAL_APIC) indicates xAPIC mode, while 3 (LOCAL_X2APIC) indicates x2APIC mode.

Because this part is as clear as mud, I made a table that tells you what module and mode to expect given your hardware, what CPUID says, and the presence and state of the iommu.

APIC hw CPUID IOMMU IOMMU state Module apic_mode
xAPIC off pcplusmp LOCAL_APIC
x2APIC off pcplusmp LOCAL_APIC
x2APIC on absent apix LOCAL_APIC
x2APIC on present off apix LOCAL_APIC
x2APIC on present on apix LOCAL_X2APIC

Defaults

I’ve never seen apic_mode equal to LOCAL_X2APIC in the wild. This was very puzzling. Yesterday, I discovered why. As I mentioned earlier, in order for the x2APIC to operate in x2APIC mode an iommu is required. Long story short, the default config that Illumos ships disables iommus on boot. Specifically:

$ cat /platform/i86pc/kernel/drv/rootnex.conf | grep -v '^\(#.*\|\)$'
immu-enable="false";

In order to get LOCAL_X2APIC mode, you need to set:

immu-enable="true";
immu-intrmap-enable="true";

Once you put those into the config file, update you boot archive and reboot. You should be set… except the iommu support in Illumos is… shall we say… poor.

(I should point out that it is possible for the BIOS to enable x2APIC mode before handing control off to the OS. This is pretty rare unless you have a really big x86 system.)

1394

It would seem that the hci1394 driver doesn’t quite know how to deal with an iommu “messing” with its I/Os and its interrupt service routine shuts down the driver. (On a debug build it throws is ASSERT(0) for good measure.) I just disabled 1394 in the BIOS since I don’t have any Firewire devices handy and therefore no use for the port at the moment.

immu-enable Details

In case you want to know how iommu initialization affects the apix initialization…

During boot, immu_init gets called to initialize iommus. If the config option (immu-enable) is not true, the function just returns instead of calling immu_subsystems_setup which calls immu_intrmap_setup which sets psm_vt_ops to non-NULL value.

Later on, when apix is loaded and is initializing itself in apix_picinit, it calls apic_intrmap_init. This function does nothing if psm_vt_ops are NULL.

The Hang

I might as well tell you a bit about my progress on tracking down the hang. It happens only if I’m using the apix module and I allow deep C states in the idle thread (technically, it could also be an mwait related issue since I cannot disable just mwait without disabling deep C states). It does not matter if the apic_mode is LOCAL_APIC or LOCAL_X2APIC.

Assorted Documentation

  1. Intel 64 Architecture x2APIC Specification
  2. Intel MP Spec 1.4

Powered by blahgd