Sum addressed decoder
Encyclopedia
In CPU design
CPU design
CPU design is the design engineering task of creating a central processing unit , a component of computer hardware. It is a subfield of electronics engineering and computer engineering.- Overview :CPU design focuses on these areas:...

, a Sum Addressed Decoder or Sum Addressed Memory (SAM) Decoder is a method of reducing the latency of the CPU cache
CPU cache
A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations...

 access. This is achieved by fusing the address generation sum operation with the decode operation in the cache SRAM
Static random access memory
Static random-access memory is a type of semiconductor memory where the word static indicates that, unlike dynamic RAM , it does not need to be periodically refreshed, as SRAM uses bistable latching circuitry to store each bit...

.

Overview

The L1 data cache should usually be in the most critical recurrence on the CPU, because few things improve instructions per cycle (IPC) as directly as a larger data cache, a larger data cache takes longer to access, and pipelining
Instruction pipeline
An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput ....

 the data cache makes IPC worse. This article explains a way of reducing the latency of the L1 data cache access by fusing the address generation sum operation with the decode operation in the cache SRAM.

The address generation sum operation still must be performed, because other units in the memory pipe will use the resulting virtual address. That sum will be performed in parallel with the fused add/decode described here.

The most profitable recurrence to accelerate is a load, followed by a use of that load in a chain of integer operations leading to another load. Assuming that load results are bypassed with the same priority as integer results, then it's possible to summarize this recurrence as a load followed by another load—as if the program was following a linked list.

The rest of this page assumes an Instruction set architecture (ISA) with a single addressing mode (register+offset), a virtually indexed data cache, and sign-extending loads that may be variable-width. Most RISC ISAs fit this description. In ISAs such as the Intel x86, three or four inputs are summed to generate the virtual address. Multiple-input additions can be reduced to a two-input addition with carry save adders, and the remaining problem is as described below.
The critical recurrence, then, is an adder
Adder (electronics)
In electronics, an adder or summer is a digital circuit that performs addition of numbers.In many computers and other kinds of processors, adders are used not only in the arithmetic logic unit, but also in other parts of the processor, where they are used to calculate addresses, table indices, and...

, a decoder
Decoder
A decoder is a device which does the reverse operation of an encoder, undoing the encoding so that the original information can be retrieved. The same method used to encode is usually just reversed in order to decode...

, the SRAM word line, the SRAM bit line(s), the sense amp(s), the byte steering muxes
Multiplexer
In electronics, a multiplexer is a device that selects one of several analog or digital input signals and forwards the selected input into a single line. A multiplexer of 2n inputs has n select lines, which are used to select which input line to send to the output...

, and the bypass muxes.

For this example, a direct-mapped 16KB data cache which returns doubleword (8-byte) aligned values is assumed. Each line of the SRAM is 8 bytes, and there are 2048 lines, addressed by Addr[13:3]. The sum addressed SRAM idea applies equally well to set associative caches.

Sum-addressed cache: Collapse the adder and decoder

The SRAM decoder for this example has an 11 bit input, Addr[13:3],
and 2048 outputs, the decoded word lines. One word line is driven high
in response to each unique Addr[13:3] value.

In the simplest form of decoder, each of the 2048 lines is
logically an AND gate
AND gate
The AND gate is a basic digital logic gate that implements logical conjunction - it behaves according to the truth table to the right. A HIGH output results only if both the inputs to the AND gate are HIGH . If neither or only one input to the AND gate is HIGH, a LOW output results...

. The 11 bits (call them A[13:3] and their
complements (call them B[13:3]) are driven up the decoder. For each
line, 11 bits or complements are fed into an 11-input AND gate. For
instance, 1026 decimal is equal to 10000000010 binary. The function
for line 1026 would be:

wordline[1026] = A[13] & B[12] & B[11] & B[10] & B[9] & B[8] & B[7] & B[6] & B[5] & A[4] & B[3]

Both the carry chain of the adder and the decoder combine
information from the entire width of the index portion of the address.
Combining information across the entire width twice is redundant. A
sum-addressed SRAM combines the information just once by implementing
the adder and decoder together in one structure.

Recall that the SRAM is indexed with the result of an add. Call
the summands R (for register) and O (for the offset to that register).
The sum-addressed decoder is going to decode R+O. For each decoder
line, call the line number L.

Suppose that our decoder drove both R and O over each decoder line,
and each decoder line implemented:

wordline[L] = (R+O)

L

(R+O)

L <=> R+O-L

0
<=> R+O+~L+1

0
<=> R+O+~L

-1

11..1.

A set of full adders can be used to reduce R+O+~L to S+C (this is
carry save addition). S+C

11..1 <=> S

~C. There will be no carries
in the final add. Note that since C is a row of carries, it's shifted
up one bit, so that R[13:3]+O[13:3]+~L[13:3] {0,S[13:3]} + {C[14:4],0}

With this formulation, each row in the decoder is a set of full
adders which reduce the base register, the offset, and the row number
to a carry-save format, and a comparator. Most of
this hardware will be proven redundant below, but for now it's simpler to think of
it all existing in each row.
Ignoring the LSBs: Late select on carry
The formulation above checks the entire result of an add.
However, in a CPU cache decoder, the entire result of the add is a
byte address, and the cache is usually indexed with a larger address,
in our example, that of an 8-byte block. It is preferable to ignore
a few of the LSBs of the address. However, the LSBs of the two
addends can't be ignored because they may produce a carry-out which would
change the doubleword addressed.

If R[13:3] and O[13:3] are added to get some index I[13:3],
then the actual address Addr[13:3] is equal to either I[13:3], or I[13:3] +
1, depending on whether R[2:0]+O[2:0] generates a carry-out. Both I and I+1 can
be fetched if there are two banks of SRAM, one with even
addresses and one with odd. The even bank holds addresses 000xxx,
010xxx, 100xxx, 110xxx, etc., and the odd bank holds addresses 001xxx,
011xxx, 101xxx, 111xxx, etc. The carry-out from
R[2:0]+O[2:0] can then be used to select the even or odd doubleword fetched later.

Note that fetching from two half-size banks of SRAM will dissipate
more power than fetching from one full-size bank, since we are
switching more sense amps and data steering logic.
Match generation


Referring to the diagram to the right, we can see that the even bank
will fetch line 110 when I[13:3]

101 or I[13:3]

110. The odd bank
will fetch line 101 when I[13:3]

100 or I[13:3]

101.

In general, the odd SRAM bank should fetch line Lo

2N+1 when
either I[13:3]

2N or I[13:3]

2N+1. We can write these two
conditions as:

I[13:3] = Lo-1 => R[13:3] + O[13:3] + ~Lo+1 = 11..11
=> R[13:3] + O[13:3] + ~Lo = 11..10
I[13:3] = Lo => R[13:3] + O[13:3] + ~Lo = 11..11

We ignore the last digit of the compare: (S+C)[13:4]

11..1

Similarly, the even SRAM bank fetches line Le

2N when either
I[13:3]

2N or I[13:3]

2N-1. We can write these two conditions
as follows, and once again ignore the last digit of the compare.

I[13:3] = Le-1 => R[13:3] + O[13:3] + ~Le = 11..10
I[13:3] = Le => R[13:3] + O[13:3] + ~Le = 11..11

Gate Level Implementation

Before we begin collapsing redundancy between rows, let's review:
Each row of each decoder for each of two banks implements a set of
full adders which reduce the three numbers to be added (R[13:3],
O[13:3], and L) to two numbers (S[14:4] and C[13:3]). The LSB
(

S[3]) is discarded. Carry out (

C[14]) is also discarded. The
row matches if S[13:4] ~C[13:4], which is &( xor(S[13:4],
C[13:4])).

We can partially specialize the full adders to 2-input and, or,
xor, and xnor because the L input is constant. The resulting
expressions are common to all lines of the decoder and can be
collected at the bottom.

S0;i = S(Ri, Oi, 0) = Ri xor Oi
S1;i = S(Ri, Oi, 1) = Ri xnor Oi
C0;i+1 = C(Ri, Oi, 0) = Ri and Oi
C1;i+1 = C(Ri, Oi, 1) = Ri or Oi.

At each digit position, there are only two possible Si,
two possible Ci, and four possible xors between them:

Li=0 and Li-1=0: X0;0;i = S0;i xor C0;i = Ri xor Oi xor (Ri-1 and Oi-1)
Li=0 and Li-1=1: X0;1;i = S0;i xor C1;i = Ri xor Oi xor (Ri-1 or Oi-1)
Li=1 and Li-1=0: X1;0;i = S1;i xor C0;i = Ri xnor Oi xor (Ri-1 and Oi-1) = !X0;0;i
Li=1 and Li-1=1: X1;1;i = S1;i xor C1;i = Ri xnor Oi xor (Ri-1 or Oi-1) = !X0;1;i

One possible decoder for our example might calculate these four
expressions for each of the bits 4..13, and drive all 40 wires up the
decoder. Each line of the decoder would select one of the four wires
for each bit, and consist of an 10-input AND.
What has been saved?
A simpler data cache path would have an adder followed by a
traditional decoder. For our example cache subsystem, the critical
path would be a 14 bit adder, producing true and complement values,
followed by an 11 bit AND gate for each row of the decoder.

In the sum-addressed design, the final AND gate in the decoder
remains, although 10 bits wide instead of 11. The adder has been
replaced by a four input logical expression at each bit. The latency
savings comes from the speed difference between the adder and that
four input expression, a savings of perhaps three simple CMOS
gates.

If the reader feels that this was an inordinate amount of
brain-twisting work for a three gate improvement in a multi-cycle
critical path, then the reader has a better appreciation for the level
to which modern CPUs are optimized.
Further optimizations: predecode
Many decoder designs avoid high-Fan-In
Fan-In
Fan-in is the number of inputs of an electronic logic gate. For instance the fan-in for the AND gate shown below is 3. Physical logic gates with a large fan-in tend to be slower than those with a small fan-in, because the complexity of the input circuitry increases the input capacitance of the...

AND gates in the decode line
itself by employing a predecode stage. For instance, an 11 bit
decoder might be predecoded into three groups of 4, 4, and 3 bits
each. Each 3 bit group would drive 8 wires up the main decode array,
each 4 bit group would drive 16 wires. The decoder line then becomes
a 3 input AND gate. This reorganization can save significant
implementation area and some power.

This same reorganization can be applied to the sum-addressed
decoder. Each bit in the non-predecoded formulation above can be
viewed as a local two-bit add. With predecoding, each predecode group
is a local three, four, or even five bit add, with the predecode
groups overlapping by one bit.

Predecoding generally increases the number of wires traversing the
decoder, and sum-addressed decoders generally have about twice as many
wires as the equivalent simple decoder. These wires can be the
limiting factor on the amount of feasible predecoding.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK