Josef “Jeff” Sipek

Exclusive Or Character

A couple of years ago I blogged about the CCS instruction in the Apollo Guidance Computer. Today I want to tell you about the XC instruction from the System/360 ISA.

Many ISAs have some sort of xor instruction. The 360 is no different. It offers several different xor instructions which differ in the type of operands that they operate on. In all cases, the operation they perform could be summarized as (using C syntax):

A ^= B;

That is one of the operands is used as both a source and a destination.

There are the boring X (reg ^= memory), XR (reg ^= reg), and XI (reg ^= immediate). Then there is XC which is what inspired this post. XC, or Exclusive Or Character, takes two memory locations and a length and performs what appears as byte by byte xor of the two buffers. (The hardware is smart enough to operate on bigger chunks of memory but the effect is as if it was done byte at a time.) In assembly XC looks like:

XC d1(l,b1),d2(b2)

The d are 12-bit unsigned displacements while the b specify the registers with the base address. For each of the operands the actual address is dX plus the value of the bX register. The l is a length field which encodes a length between 1 and 256.

To use more C pseudocode, XC does:

void XC(unsigned char *op1, size_t len, unsigned char *op2)
{
	while (len--) {
		*op1 ^= *op2;
		op1++;
		op2++;
	}
}

(This pseudo code ignores the condition code calculation and exception generation which are not relevant to the discussion.)

This by itself is neat but not every exciting…until you remember that xor can be used to zero out a register. You can use XC to zero out up to 256 bytes of memory. It turns out this idiom is used pretty often in handwritten assembly, and compilers such as gcc even produce such instructions without any special effort on the programmer’s behalf.

For example, in HVF I have this line:

memset(&psw, 0, sizeof(struct psw));

Which GCC helpfully turns into (struct psw is 16 bytes in size):

xc      160(16,%r15),160(%r15)

When I first saw that line in the disassembly of HVF years ago, it blew my mind. It is elegant, fast thanks to the microarchitecture optimizations, and once you are used to the idiom it is clear about what it does. I hope your mind was as blown as mine. Till next time!

2017-03-23

The million dollar engineering problem — Scaling infrastructure in the cloud is easy, so it’s easy to fall into the trap of scaling infrastructure instead of improving efficiency.

Some Notes on the “Who wrote Linux” Kerfuffle

The Ghosts of Internet Time

How a personal project became an exhibition of the most beautifully photographed and detailed bugs you ever saw — Amazing photos of various bugs.

Calculator for Field of View of a Camera and Lens

The Megaprocessor — A microprocessor built from discrete transistors.

Why Pascal is Not My Favorite Programming Language

EAA Video — An assortment of EAA produced videos related to just about anything aircraft related (from homebuilding to aerobatics to history).

The Unreasonable Effectiveness of Recurrent Neural Networks

Boston

Two weeks ago I ended up going to Wikipedia article: Boston for a day. I spent my day in three places—the Wikipedia article: Boston Public Library, the Wikipedia article: Massachusetts State House, and the Wikipedia article: Boston Common.

In this post, I will share my photos of anything that did not fit into the other two posts—the post with the Boston Public Library photos and the post with the Massachusetts State House. (All three posts share the same gallery.)

This is a view of the eastern end of Boston Common. There was a window at the State House that offered a good view, so I snapped it.

The weather was quite nice—low 20°C, sunny, light breeze—and so the Common was full of people enjoying the day. Both passively:

and actively:

Industry by Wikipedia article: Adio diBiccari:

Heading back toward Copley Square and the Boston Public Library, we encounter the John Hancock tower:

At this point, it was time to start heading back to Harvard where I left my car. I noticed an interesting ad at the bus stop right by the library. It had three panels filled with water and bubbles. I realize it isn’t the sharpest photo.

When I got off the red line at Harvard, I tried some long exposures of the trains. It turns out that unless the trains are packed, they keep their doors open for only about 10 seconds. In this 13 second exposure, you can see that the door was closed for a part of it.

An 8 second exposure worked quite well. (Unfortunately, I like the first composition better.)

So, this concludes the three post series about my one day excursion to Boston. I certainly learned a couple of things about photography in the 401 shots I took. First of all, tripods are amazingly useful indoors. Second, anyone can take a shot of a subject—it takes the “know what you’re doing” to consistently get an image that is not just good but better than average. Third, I need to read up on architecture photography before my next excursion so I know what I am doing. :)

2016-05-01

The resolution of the Bitcoin experiment

Needs washed — A (strange) regional syntactic construct.

The Inside Story of an Aid Worker’s Secret Weapon: The Tarp

3L: The Operating System for the Future — An OS written in Scheme.

Dear Headhunters… — Joerg Moellenkamp’s rant directed at headhunters.

Princeton University Computer Architecture — Computer architecture course from Princeton offered through Coursera.

Glendix — Plan 9 userspace on top of the Linux kernel.

awk-raycaster — Pseudo-3D shooter written completely in awk using raycasting.

2015-09-23

The Apple ISA — An interesting view of what Apple could aim for instruction set architecture-wise.

Internetová jazyková příručka — A reference book with grammar and dictionary detailing how to conjugate each Czech word.

Java is Magic: the Gathering (or Poker) and Haskell is Go (the game)

An Interview with Brian Kernighan (July 2000)

Booting a Raspberry Pi2, with u-boot and HYP enabled

The SmPL Grammar — Description of the grammar used by Coccinelle.

Netbooting Debian Squeeze — A link I had sitting around for a couple of years when I last set up a NFS-root netbooting Linux system.

Are there any 3 dimensional items wwe can’t print layer by layer — A humorous story about Wikipedia article: Fubini’s theorem and its relation to 3D printing.

The Diagnosis of Mistakes in Programmes on the EDSAC — In some ways, debugging hasn’t changed much since 1951.

2015-06-11

O(n) binary search for when O(log n) is just too fast.

A Constructive Look At TempleOS

D3 gallery — chock full of neat visualizations.

The Apollo Guidance Computer: Architecture and Operation seems to be a nice book.

Detecting In-Flight Page Changes with Web Tripwires

Count, Compare, Skip

I just remembered a fun fact about the Wikipedia article: Apollo Guidance Computer (AGC).

A few years ago, I was tinkering with emulating one. It kind of worked, but that’s not important.

The AGC was an accumulator architecture, with 15-bit words (the accumulator (A) and few other “registers” were 16 bits), and 1’s complement arithmetic.

Now, the fun fact. There was an instruction called ’CCS’. It took an address, loaded the accumulator with the value at the address, and then performed a 4-way branch. The easiest way to explain it is with a some code that demonstrates what happened with some C-style pseudocode (A = accumulator, Z = program counter):

Z = Z + 1;
A = mem[operand];

switch(A) {
        case POSITIVE:
        	A = A - 1;
        	/* Z is already incremented */
        	break;
        case POS_ZERO:
        	/* A is already a zero */
        	Z = Z + 1;
        	break;
        case NEGATIVE:
        	A = (~A) - 1;
        	Z = Z + 2;
        	break;
        case NEG_ZERO:
        	/* A is already a zero */
        	Z = Z + 3;
        	break;
}

So, in a program, you could see something like:

CCS addr1
TC  addr2
TC  addr3
TC  addr4
TC  addr5

So, data from addr1 was loaded, and then one of the TC instructions (TC = Transfer Control = a branch instruction) was jumped to depending on the value, then when the TC got executed, it unconditionally branched to some other address.

Of course, you didn’t have to use TC, you could use any valid instruction and CCS would happily jump to it.

My understanding is that sometimes when CCS was used, some of the 4 possible targets were impossible. Not to waste memory, a committee was organized to keep track of these holes, and fill them in with useful constants.

Compare and Swap Manifesto

The fall semester, I was taking a computer architecture course, and during it, I came to the conclusion that compare-and-swap is better than the load-locked/store-conditional combination some architectures offer instead. As the class was very Alpha ISA-centric, the below text reflects that.

As a result I wrote this manifesto. I see no reason to keep it to myself! So, here it is. (Note: I haven’t really read though it in a couple of months, and overall, I didn’t take too long to write it, so there might be some nasty corner cases.)

Definition of Compare and Swap

Let us assume for the moment, that the compare and swap instruction is:

        CS      R1,R2,D3(R3)

Where, Rn specifies a register used as the n-th operand, and Dn specifies an immediate used as a displacement in the n-th operand.

The first and third operands are compared. If they are equal, the second operand is stored at the third-operand location and the first operand is set to the value zero. If they are unequal, the first operand is set to the value one.

When an equal comparison occurs, the second operand is stored at the third-operand location and the first operand is set to the value zero. The fetch of the third operand for purposes of comparison and the store into the third-operand location appear to be a block-concurrent interlocked-update reference as observed by other CPUs.

When the result of the comparison is unequal, the first operand is set to the value one, and the second-operand location remains unchanged.

Sample Code

This section illustrates how certain atomic primitives can be implemented with compare and swap. They assume Alpha-style instruction set and ignore Alpha’s weak memory ordering.

spinlock_lock:
        lda     r4,0x0
        lda     r5,0x1
        cs      r4,r5,lock_var
        bne     r4,spinlock_lock

spinlock_unlock:
        stq     r31,lock_var

atomic_<op>:
        ldq     r4,atomic_var
        <some operations that yield r5, and do not modify r4>
        cs      r4,r5,atomic_var
        bne     r4,atomic_<op>

Instruction Encoding

So far, we have ignored all aspects of how a compare and swap instruction may be encoded.

Before looking at the 4 instruction formats present in Alpha, let us consider the requirements of compare and swap.

Compare and swap has three operands:

  • the expected ("old") value
  • the new value
  • memory address

Additionally, it also returns a zero or one value in the first operand. Obviously, the instruction requires a minimum of three registers. While memory references are traditionally represented as <base-register, displacement> pairs (or as <base-reg, index-reg, displacement>, or even <base-reg, index-reg, scale-factor, displacement>) only a single register is necessary to specify an arbitrary memory location. Of the 4 instruction formats present on Alpha, only the Operate Instruction Format can encode 3 registers.

        +--------+----+----+---------+----+
        | Opcode | Ra | Rb | /////// | Rc |
        +--------+----+----+---------+----+
         31     26   21   16         5    0

Not having an immediate value to use as a displacement is unfortunate, however it does not make it impossible to implement compare and swap.

The actual mapping of the 3 registers to the Ra, Rb, and Rc instruction fields trivial, and can be essentially arbitrary. It is worth noting that the mapping should try to stay consistent with the other Operate format instructions. Namely, Rc should encode the destination register. For compare and swap, the only register that gets written to is the first operand (R1 in the example from the definition section). The order of the remaining two register mapping can be arbitrary with little to no effect on decoder complexity.

There are many unused opcodes that could be used for a compare and swap without adding any significant complexity to the decoder logic.

It should be noted that the bits 15:6 can be used as a displacement. The Op instruction format specifies that bit 12, must be zero if the instruction uses three register, or one if it uses two registers and an 8-bit immediate.

Potential Microarchitectural Issues

Even though only three registers are specified, compare and swap essentially uses 4 operands. The "additional" operand is the result of the fact that the first operand is first read, and then at a later time, written to.

This dictates that the Register Address Table (or some other register renaming logic) must handle translating three architected register numbers to physical register numbers, as well as allocating a new physical register number for the first operand. Aside from the RAT, other structures may need additional read ports (e.g., the physical register file must read 3 values concurrently).

Since compare and swap is a memory operation, special care must be taken to handle any interactions with a load/store queue. Given the fact that compare and swap potentially stores to memory, a mechanism must be provided to either not speculate (e.g., not begin execution of the compare and swap instruction until there are no proceeding instructions in the reorder buffer), or to speculatively launch the instruction and have an undo mechanism in case a conflicting store (from within the same core, or from another core) occurs. Obviously, it is possible to allow only a certain degree of speculation.

Another interesting question is, what component does the actual "compare and swap". The comparison could be located within the core, within the L1, or potentially somewhere else. Each has its own set of benefits and drawbacks.

in-core

This allows the caches to remain (relatively) simple. The cache needs to support a regular load, a regular store, and a "conditional" store.

The conditional store functions as follows:

  1. core issues a load
  2. core compares the loaded value with expected value
  3. if mis-match, destination register is set to 1
  4. if match, tell the cache to do a conditional store
  5. the cache takes the conditional store, and if the cache line was not invalidated since the load, store the new value; at the same time, signal the core informing it whether or not the line was invalidated. This bit is then used to determine the value stored in the destination register.

in-cache

This design makes the cache do the comparison. The advantage is that the cache does not need to send the data to the core, wait for the store/don’t-store signal, and then tell the core whether or not the store succeeded. Instead, it gets the expected value, does the comparison itself, and if the value matched, it proceeds with the store. Then, it signals the core, specifying whether or not the store occurred.

The cache can obtain the line in the exclusive state to avoid having to go to the bus for the store portion of the operation.

Compare and Swap vs. Load-locked/Store-conditional

Compare and swap has the interesting property of not caring whether or not the memory got changed. It simply does a comparison to determine if the in-memory value is what it expects.

Load-locked/Store-conditional lacks this. On the other hand, Store-conditional does not have to even attempt to access memory if the in-core lock register does not match the address.

References

[1] Alpha Architecture Handbook, version 4. EC–QD2KC–TE. Compaq
[2] z/Architecture Principles of Operation. page 7-66. SA22-7832-06. IBM

HVF & z/VM

After a couple of days of trying to figure out why HVF didn’t want to IPL under z/VM, I finally managed to find out that z/VM is trying to act like a real 3215 printer-keyboard console, and refusing to do Sense-ID. It just rejects it. Grrr. So I set up a “device blacklist” where I say that device number 0009 is a 3215 console. It works well enough. This is what things look like on a real mainframe :)

ipl c clear
HVF version 0.13
 Memory:
    4096 kB/page
    131072 kB
    32768 pages
    PSA for each CPU     0..1024 kB
    nucleus              1024..4096 kB
    struct page array    4096..4608 kB
    generic pages        4608..131072 kB
 Devices:
    3215-00 @ 0009 (sch 10003)
    3390-12 @ 0190 (sch 10004)
    3390-12 @ 019d (sch 10005)
    3390-12 @ 019e (sch 10006)
    3390-12 @ 0191 (sch 10007)
    3390-12 @ 0192 (sch 10008)
 Scheduler:
    no task max

Shiny isn’t it? Subchannels 0-2 are missing because the devices (card reader, punch, and printer) don’t support Sense-ID, but I haven’t bothered to set up the “blacklist” entries because I won’t need these devices anytime soon.

Putting the V in HVF

After a number of days trying to avoid having to implement the Dynamic Address Translation (DAT) related code, I finally came to the conclusion that I had to…and so I did…in one afternoon. Woohoo! It actually wasn’t all that bad. Sure, it’s not complete, I still need a way to invalidate entries in the TLB, but right now, the only user that exists is the nucleus, so I just set up the address space, load up a pointer to the first level of page tables (called Region-Third-Table), and toggle the right bit in the Process Status Word (PSW). The DAT hardware turns on, and happily starts translating the addresses from virtual to physical.

One thing that struck me as somewhat wasteful was the fact that if I have, say 128MB to set up page tables for (each page being 4kB), I need to set up 32768 page table entries (PTEs)! Sure, if they are physically fragmented, then you have no choice, but if they are in a couple of (if not one) contiguous chunk, then what? You still need 32768 entries. Oh, and have I mentioned that each entry is 8 bytes? That means, that 128MB needs 256kB of PTEs. Now, you can’t just have just PTEs, that would make the translation far too inefficient. You have segment tables (with segment table entries — STEs) which point to sections of the page tables. As it turns out, each STE can point to a section of only 256 PTEs. In other words, the 32768 PTEs will require 128 STEs — at 8 bytes each, that’s just a single page…but…

Now, imagine we have 2 GB to set up page tables for, that’s…
…2147483648 bytes
…524288 pages
…524288 pages table entries
…4194304 bytes of page table entries (= 4 MB)
…2048 segment table entries
…16384 bytes of segment table entries (= 16 KB)
…~4MB overhead for 2GB

Still doesn’t sound that bad…but if the hardware allowed some kind of extent-based PTE, or if for example the STE had a flag that said this is the physical address and the begining of a 1MB chunk of memory (256 * 4kB = 1MB). A whole lot of memory could be saved.

Powered by blahgd