The fall semester, I was taking a computer architecture course, and during it, I came to the conclusion that compare-and-swap is better than the load-locked/store-conditional combination some architectures offer instead. As the class was very Alpha ISA-centric, the below text reflects that.
As a result I wrote this manifesto. I see no reason to keep it to myself! So, here it is. (Note: I haven’t really read though it in a couple of months, and overall, I didn’t take too long to write it, so there might be some nasty corner cases.)
Definition of Compare and Swap
Let us assume for the moment, that the compare and swap instruction is:
CS R1,R2,D3(R3)
Where, Rn specifies a register used as the n-th operand, and Dn specifies an immediate used as a displacement in the n-th operand.
The first and third operands are compared. If they are equal, the second operand is stored at the third-operand location and the first operand is set to the value zero. If they are unequal, the first operand is set to the value one.
When an equal comparison occurs, the second operand is stored at the third-operand location and the first operand is set to the value zero. The fetch of the third operand for purposes of comparison and the store into the third-operand location appear to be a block-concurrent interlocked-update reference as observed by other CPUs.
When the result of the comparison is unequal, the first operand is set to the value one, and the second-operand location remains unchanged.
Sample Code
This section illustrates how certain atomic primitives can be implemented with compare and swap. They assume Alpha-style instruction set and ignore Alpha’s weak memory ordering.
spinlock_lock:
lda r4,0x0
lda r5,0x1
cs r4,r5,lock_var
bne r4,spinlock_lock
spinlock_unlock:
stq r31,lock_var
atomic_<op>:
ldq r4,atomic_var
<some operations that yield r5, and do not modify r4>
cs r4,r5,atomic_var
bne r4,atomic_<op>
Instruction Encoding
So far, we have ignored all aspects of how a compare and swap instruction may be encoded.
Before looking at the 4 instruction formats present in Alpha, let us consider the requirements of compare and swap.
Compare and swap has three operands:
- the expected ("old") value
- the new value
- memory address
Additionally, it also returns a zero or one value in the first operand. Obviously, the instruction requires a minimum of three registers. While memory references are traditionally represented as <base-register, displacement> pairs (or as <base-reg, index-reg, displacement>, or even <base-reg, index-reg, scale-factor, displacement>) only a single register is necessary to specify an arbitrary memory location. Of the 4 instruction formats present on Alpha, only the Operate Instruction Format can encode 3 registers.
+--------+----+----+---------+----+
| Opcode | Ra | Rb | /////// | Rc |
+--------+----+----+---------+----+
31 26 21 16 5 0
Not having an immediate value to use as a displacement is unfortunate, however it does not make it impossible to implement compare and swap.
The actual mapping of the 3 registers to the Ra, Rb, and Rc instruction fields trivial, and can be essentially arbitrary. It is worth noting that the mapping should try to stay consistent with the other Operate format instructions. Namely, Rc should encode the destination register. For compare and swap, the only register that gets written to is the first operand (R1 in the example from the definition section). The order of the remaining two register mapping can be arbitrary with little to no effect on decoder complexity.
There are many unused opcodes that could be used for a compare and swap without adding any significant complexity to the decoder logic.
It should be noted that the bits 15:6 can be used as a displacement. The Op instruction format specifies that bit 12, must be zero if the instruction uses three register, or one if it uses two registers and an 8-bit immediate.
Potential Microarchitectural Issues
Even though only three registers are specified, compare and swap essentially uses 4 operands. The "additional" operand is the result of the fact that the first operand is first read, and then at a later time, written to.
This dictates that the Register Address Table (or some other register renaming logic) must handle translating three architected register numbers to physical register numbers, as well as allocating a new physical register number for the first operand. Aside from the RAT, other structures may need additional read ports (e.g., the physical register file must read 3 values concurrently).
Since compare and swap is a memory operation, special care must be taken to handle any interactions with a load/store queue. Given the fact that compare and swap potentially stores to memory, a mechanism must be provided to either not speculate (e.g., not begin execution of the compare and swap instruction until there are no proceeding instructions in the reorder buffer), or to speculatively launch the instruction and have an undo mechanism in case a conflicting store (from within the same core, or from another core) occurs. Obviously, it is possible to allow only a certain degree of speculation.
Another interesting question is, what component does the actual "compare and swap". The comparison could be located within the core, within the L1, or potentially somewhere else. Each has its own set of benefits and drawbacks.
in-core
This allows the caches to remain (relatively) simple. The cache needs to support a regular load, a regular store, and a "conditional" store.
The conditional store functions as follows:
- core issues a load
- core compares the loaded value with expected value
- if mis-match, destination register is set to 1
- if match, tell the cache to do a conditional store
- the cache takes the conditional store, and if the cache line was not invalidated since the load, store the new value; at the same time, signal the core informing it whether or not the line was invalidated. This bit is then used to determine the value stored in the destination register.
in-cache
This design makes the cache do the comparison. The advantage is that the cache does not need to send the data to the core, wait for the store/don’t-store signal, and then tell the core whether or not the store succeeded. Instead, it gets the expected value, does the comparison itself, and if the value matched, it proceeds with the store. Then, it signals the core, specifying whether or not the store occurred.
The cache can obtain the line in the exclusive state to avoid having to go to the bus for the store portion of the operation.
Compare and Swap vs. Load-locked/Store-conditional
Compare and swap has the interesting property of not caring whether or not the memory got changed. It simply does a comparison to determine if the in-memory value is what it expects.
Load-locked/Store-conditional lacks this. On the other hand, Store-conditional does not have to even attempt to access memory if the in-core lock register does not match the address.
References
[1] Alpha Architecture Handbook, version 4. EC–QD2KC–TE. Compaq
[2] z/Architecture Principles of Operation. page 7-66. SA22-7832-06. IBM