Branch delay slot
Encyclopedia
In computer architecture
, a delay slot is an instruction slot that gets executed without the effects of a preceding instruction. The most common form is a single arbitrary instruction located immediately after a branch
instruction on a RISC or DSP
architecture; this instruction will execute even if the preceding branch is taken. Thus, by design, the instructions appear to execute in an illogical or incorrect order. It is typical for assembler
s to automatically reorder instructions by default, hiding the awkwardness from assembly developers and compilers.
may be called a branch delay slot. Branch delay slots are found mainly in DSP
architectures and older RISC architectures. MIPS
, PA-RISC
, ETRAX CRIS
, SuperH
, and SPARC
are RISC architectures that each have a single branch delay slot; PowerPC
, ARM
, and the more recently designed Alpha
do not have any. DSP
architectures that each have a single branch delay slot include the VS DSP, µPD77230 and TMS320C3x
. The SHARC
DSP and MIPS-X
use a double branch delay slot; such a processor will execute a pair of instructions following a branch instruction before the branch takes effect.
The following example shows delayed branches in assembly language for the SHARC DSP. Registers R0 through R9 are cleared to zero in order by number (the register cleared after R6 is R7, not R9). No instruction executes more than once.
The goal of a pipelined architecture
is to complete an instruction every clock cycle. To maintain this rate, the pipeline must be full of instructions at all times. The branch delay slot is a side effect of pipelined architectures due to the branch hazard
, i.e. the fact that the branch would not be resolved until the instruction has worked its way through the pipeline. A simple design would insert stalls into the pipeline after a branch instruction until the new branch target address is computed and loaded into the program counter
. Each cycle where a stall is inserted is considered one branch delay slot. A more sophisticated design would execute program instructions which are not dependent on the result of the branch instruction. This optimization can be performed in software at compile time
by moving instructions into branch delay slots in the in-memory instruction stream, if the hardware supports this. Another side effect is that special handling should be taken care of managing breakpoint
on instructions as well as stepping while debugging
within branch delay slot.
The ideal number of branch delay slots in a particular pipeline implementation is dictated by the number of pipeline stages, the presence of register forwarding, what stage of the pipeline the branch conditions are computed, whether or not a branch target buffer (BTB) is used and many other factors. Software compatibility requirements dictate that an architecture may not change the number of delay slots from one generation to the next. This inevitably requires that newer hardware implementations contain extra hardware to ensure that the architectural behavior is followed despite no longer being relevant.
and R3000
microprocessors) suffers from this problem.
The following example is MIPS I assembly code, showing both a load delay slot and a branch delay slot.
lw v0,4(v1) # load word from address v1+4 into v0
nop # useless load delay slot
jr v0 # jump to the address specified by v0
nop # useless branch delay slot
Computer architecture
In computer science and engineering, computer architecture is the practical art of selecting and interconnecting hardware components to create computers that meet functional, performance and cost goals and the formal modelling of those systems....
, a delay slot is an instruction slot that gets executed without the effects of a preceding instruction. The most common form is a single arbitrary instruction located immediately after a branch
Branch (computer science)
A branch is sequence of code in a computer program which is conditionally executed depending on whether the flow of control is altered or not . The term can be used when referring to programs in high level languages as well as program written in machine code or assembly language...
instruction on a RISC or DSP
Digital signal processor
A digital signal processor is a specialized microprocessor with an architecture optimized for the fast operational needs of digital signal processing.-Typical characteristics:...
architecture; this instruction will execute even if the preceding branch is taken. Thus, by design, the instructions appear to execute in an illogical or incorrect order. It is typical for assembler
Assembly language
An assembly language is a low-level programming language for computers, microprocessors, microcontrollers, and other programmable devices. It implements a symbolic representation of the machine codes and other constants needed to program a given CPU architecture...
s to automatically reorder instructions by default, hiding the awkwardness from assembly developers and compilers.
Branch delay slots
When a branch instruction is involved, the location of the following delay slot instruction in the pipelineInstruction pipeline
An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput ....
may be called a branch delay slot. Branch delay slots are found mainly in DSP
Digital signal processor
A digital signal processor is a specialized microprocessor with an architecture optimized for the fast operational needs of digital signal processing.-Typical characteristics:...
architectures and older RISC architectures. MIPS
MIPS architecture
MIPS is a reduced instruction set computer instruction set architecture developed by MIPS Technologies . The early MIPS architectures were 32-bit, and later versions were 64-bit...
, PA-RISC
PA-RISC
PA-RISC is an instruction set architecture developed by Hewlett-Packard. As the name implies, it is a reduced instruction set computer architecture, where the PA stands for Precision Architecture...
, ETRAX CRIS
ETRAX CRIS
The ETRAX CRIS is a series of CPUs designed and manufactured by Axis Communications for use in embedded systems since 1993. The name is an acronym of the chip's features: Ethernet, Token Ring, AXis - Code Reduced Instruction Set...
, SuperH
SuperH
SuperH is a 32-bit reduced instruction set computer instruction set architecture developed by Hitachi. It is implemented by microcontrollers and microprocessors for embedded systems....
, and SPARC
SPARC
SPARC is a RISC instruction set architecture developed by Sun Microsystems and introduced in mid-1987....
are RISC architectures that each have a single branch delay slot; PowerPC
PowerPC
PowerPC is a RISC architecture created by the 1991 Apple–IBM–Motorola alliance, known as AIM...
, ARM
ARM architecture
ARM is a 32-bit reduced instruction set computer instruction set architecture developed by ARM Holdings. It was named the Advanced RISC Machine, and before that, the Acorn RISC Machine. The ARM architecture is the most widely used 32-bit ISA in numbers produced...
, and the more recently designed Alpha
DEC Alpha
Alpha, originally known as Alpha AXP, is a 64-bit reduced instruction set computer instruction set architecture developed by Digital Equipment Corporation , designed to replace the 32-bit VAX complex instruction set computer ISA and its implementations. Alpha was implemented in microprocessors...
do not have any. DSP
Digital signal processor
A digital signal processor is a specialized microprocessor with an architecture optimized for the fast operational needs of digital signal processing.-Typical characteristics:...
architectures that each have a single branch delay slot include the VS DSP, µPD77230 and TMS320C3x
Texas Instruments TMS320
Texas Instruments TMS320 is a blanket name for a series of digital signal processors from Texas Instruments. It was introduced on April 8, 1983 through the TMS32010 processor, which was then the fastest DSP on the market....
. The SHARC
Super Harvard Architecture Single-Chip Computer
The Super Harvard Architecture Single-Chip Computer is a high performance floating-point and fixed-point DSP from Analog Devices,...
DSP and MIPS-X
MIPS-X
MIPS-X is a microprocessor and instruction set architecture developed as a follow-on project to the MIPS architecture at Stanford University by the same team that developed MIPS. The project, supported by the Defense Advanced Research Projects Agency, started in 1984, and its final form was...
use a double branch delay slot; such a processor will execute a pair of instructions following a branch instruction before the branch takes effect.
The following example shows delayed branches in assembly language for the SHARC DSP. Registers R0 through R9 are cleared to zero in order by number (the register cleared after R6 is R7, not R9). No instruction executes more than once.
R0 = 0;
CALL fn (DB); /* call a function, below at label "fn" */
R1 = 0; /* first delay slot */
R2 = 0; /* second delay slot */
/***** discontinuity here (the CALL takes effect) *****/
R6 = 0; /* the CALL/RTS comes back here, not at "R1 = 0" */
JUMP end (DB);
R7 = 0; /* first delay slot */
R8 = 0; /* second delay slot */
/***** discontinuity here (the JUMP takes effect) *****/
/* next 4 instructions are called from above, as function "fn" */
fn: R3 = 0;
RTS (DB); /* return to caller, past the caller's delay slots */
R4 = 0; /* first delay slot */
R5 = 0; /* second delay slot */
/***** discontinuity here (the RTS takes effect) *****/
end: R9 = 0;
The goal of a pipelined architecture
Classic RISC pipeline
In the history of computer hardware, some early reduced instruction set computer central processing units used a very similar architectural solution, now called a classic RISC pipeline. Those CPUs were: MIPS, SPARC, Motorola 88000, and later DLX....
is to complete an instruction every clock cycle. To maintain this rate, the pipeline must be full of instructions at all times. The branch delay slot is a side effect of pipelined architectures due to the branch hazard
Hazard (computer architecture)
Hazards are problems with the instruction pipeline in central processing unit microarchitectures that potentially result in incorrect computation...
, i.e. the fact that the branch would not be resolved until the instruction has worked its way through the pipeline. A simple design would insert stalls into the pipeline after a branch instruction until the new branch target address is computed and loaded into the program counter
Program counter
The program counter , commonly called the instruction pointer in Intel x86 microprocessors, and sometimes called the instruction address register, or just part of the instruction sequencer in some computers, is a processor register that indicates where the computer is in its instruction sequence...
. Each cycle where a stall is inserted is considered one branch delay slot. A more sophisticated design would execute program instructions which are not dependent on the result of the branch instruction. This optimization can be performed in software at compile time
Compile time
In computer science, compile time refers to either the operations performed by a compiler , programming language requirements that must be met by source code for it to be successfully compiled , or properties of the program that can be reasoned about at compile time.The operations performed at...
by moving instructions into branch delay slots in the in-memory instruction stream, if the hardware supports this. Another side effect is that special handling should be taken care of managing breakpoint
Breakpoint
In software development, a breakpoint is an intentional stopping or pausing place in a program, put in place for debugging purposes. It is also sometimes simply referred to as a pause....
on instructions as well as stepping while debugging
Debugging
Debugging is a methodical process of finding and reducing the number of bugs, or defects, in a computer program or a piece of electronic hardware, thus making it behave as expected. Debugging tends to be harder when various subsystems are tightly coupled, as changes in one may cause bugs to emerge...
within branch delay slot.
The ideal number of branch delay slots in a particular pipeline implementation is dictated by the number of pipeline stages, the presence of register forwarding, what stage of the pipeline the branch conditions are computed, whether or not a branch target buffer (BTB) is used and many other factors. Software compatibility requirements dictate that an architecture may not change the number of delay slots from one generation to the next. This inevitably requires that newer hardware implementations contain extra hardware to ensure that the architectural behavior is followed despite no longer being relevant.
Load delay slot
A load delay slot is an instruction which executes immediately after a load (of a register from memory) but does not see the result of the load. Load delay slots are very uncommon because load delays are highly unpredictable on modern hardware. A load may be satisfied from RAM or from a cache, and may be slowed by resource contention. Load delays were seen on very early RISC processor designs. The MIPS I ISA (implemented in the R2000R2000 (microprocessor)
The R2000 is a microprocessor chip set developed by MIPS Computer Systems that implemented the MIPS I instruction set architecture . Introduced in January 1986, it was the first commercial implementation of the MIPS architecture and the first merchant RISC processor available to all companies...
and R3000
R3000
The R3000 is a microprocessor chip set developed by MIPS Computer Systems that implemented the MIPS I instruction set architecture . Introduced in June 1988, it was the second MIPS implementation, succeeding the R2000 as the flagship MIPS microprocessor...
microprocessors) suffers from this problem.
The following example is MIPS I assembly code, showing both a load delay slot and a branch delay slot.
lw v0,4(v1) # load word from address v1+4 into v0
nop # useless load delay slot
jr v0 # jump to the address specified by v0
nop # useless branch delay slot