R8000
Encyclopedia
The R8000 is a microprocessor
chipset
developed by MIPS Technologies, Inc.
(MTI), Toshiba
, and Weitek
. It was the first implementation of the MIPS IV instruction set architecture. The R8000 is also known as the TFP, for Tremendous Floating-Point, its name during development.
, Inc. (SGI). The R8000 was specifically designed to provide the performance of circa 1990s supercomputer
s with a microprocessor instead of a central processing unit (CPU) built from many discrete components such as gate array
s. At the time, the performance of traditional supercomputers was not advancing as rapidly as reduced instruction set computer
(RISC) microprocessors. It was predicted that RISC microprocessors would eventually match the performance of more expensive and larger supercomputers at a fraction of the cost and size, making computers with this level of performance more accessible and enabling deskside workstations and servers to replace supercomputers in many situations.
First details of the R8000 emerged in April 1992 in an announcement by MIPS Computer Systems detailing future MIPS microprocessors. In March 1992, SGI announced it was acquiring MIPS Computer Systems, which became a subsidiary of SGI called MIPS Technologies, Inc. (MTI) in mid-1992. Development of the R8000 was transferred to MTI, where it continued. The R8000 was expected to be introduced in 1993, but it was delayed until mid-1994. The first R8000, a 75 MHz part, was introduced on 7 June 1994. It was priced at US$2,500 at the time. In mid-1995, a 90 MHz part appeared in systems from SGI. The R8000's high cost and narrow market (technical and scientific computing) restricted its market share, and although it was popular in its intended market, it was largely replaced with the cheaper and generally better performing R10000
introduced January 1996.
Users of the R8000 were SGI, who used it in their Power Indigo2 workstation
, Power Challenge
server, Power ChallengeArray cluster
and Power Onyx
visualization system. In the November 1994 TOP500
list, 50 systems out of 500 used the R8000. The highest ranked R8000-based systems were four Power Challenges at positions 154 to 157. Each had 18 R8000s.
, capable of issuing up to four instructions per cycle, and executes instructions in program order. It has a five-stage integer pipeline
.
, primary cache
s and hardware for instruction fetch, branch prediction the translation lookaside buffer
s (TLBs).
In stage one, four instructions are fetched from the instruction cache. The instruction cache is 16 KB large, direct-mapped, virtually tagged and virtually indexed, and has a 32-byte line size. Instruction decoding occurs during stage two. Load and store instructions begin execution in stage three, and integer instructions in stage four. Integer execution was delayed until stage four so integer instructions that use the result of a load as an operand may be issued with those loads. Results are written to the integer register file in stage five.
The integer register file
has nine read ports and four write ports. Four read ports are used to supply operands to two of the four execution units, and an additional four are used to supply operands to the two address generators. A single read port is used to deliver data to the two banks of data cache. Two write ports are used to write results from two functional units to the register file, and two write ports are used by the register file to read from the data cache, one for each bank.
Integer functional units consisted of two integer units, a shift unit, a multiply-divide unit, and two address generator units. Multiply and divide instructions are executed in the multiply-divide unit, which is not pipelined. As a result, the latency for a multiply instruction is four cycles for 32-bit operands and six cycles for 64-bit. The latency for a divide instruction depends on the number of significant digits in the result and thus it varies from 21 to 73 cycles.
or by error correcting code (ECC). In the event of a cache miss, the data must be loaded from the streaming cache with an eight-cycle penalty. The cache is virtually indexed, physically tagged, direct mapped, has a 32-byte line size and uses a write-through with allocate protocol. If the loads hit in the data cache, the result is written to the integer register file in stage five.
by allowing floating-point instructions to execute when possible after or before the integer instructions from the same group are issued. The pipelines were decoupled to help mitigate some of the streaming cache latency.
It contained the floating-point register file, a load queue, a store queue, and two identical floating-point units. All instructions except for divide and square-root are pipelined. The R8010 implements an iterative division and square-root algorithm that uses the multiplier for a key part, requiring the pipeline to be stalled the unit for the duration of the operation.
Arithmetic instructions except for compares have a four-cycle latency. Single and double precision divides have latencies of 14 and 20 cycles, respectively; and single and double precision square-roots have latencies of 14 and 23 cycles, respectively.
The streaming cache is two-way interleaved. It has two independent banks
, each containing data from even or odd addresses. It can therefore perform two reads, two writes, or a read and a write every cycle, provided that the two accesses are to separate banks. Each bank is accessed via two 64-bit unidirectional buses, one for reads, and the other for writes. This scheme was used to avoid bus turnover, which is required by bidirectional buses. By avoiding bus turnover, the cache can be read from in one cycle and then written to in the next cycle without an intervening cycle for turnover, resulting in improved performance.
The streaming cache's tags are contained on two Tag RAM chips, one for each bank. Both chips contain identical data. Each chip contains 1.189 Mbit of cache tags implemented by four-transistor SRAM cells. The chips are implemented in a 0.7 μm BiCMOS
process with two levels of polysilicon and two levels of aluminium interconnect. BiCMOS circuitry was used in the decoders and combined sense amplifier and comparator portions of the chip to reduce cycle time. Each Tag RAM is 14.8 mm by 14.8 mm large, packaged in a 155-pin CPGA, and dissipates 3 W at 75 MHz. In addition to providing the cache tags, the Tag RAMs are responsible for the streaming cache being four-way set associative. To avoid high a pin count, the cache tags are four-way set associative and logic selects which set to access after lookup instead of the usual way of implementing set-associative caches.
Access to the streaming cache is pipelined to mitigate some of the latency. The pipeline has five stages: in stage one, addresses are sent to the Tag RAMs, which are accessed in stage two. Stage three is for the signals from the Tag RAMs to propagate to the SSRAMs. In stage four, the SSRAMs are accessed and data is returned to the R8000 or R8010 in stage five.
in their VHMOSIII process, a 0.7 µm, triple-layer metal complementary metal–oxide–semiconductor
(CMOS) process. Both are packaged in 591-pin ceramic pin grid array (CPGA) packages. Both chips used a 3.3 V power supply, and the R8000 dissipated 13 W at 75 MHz.
Microprocessor
A microprocessor incorporates the functions of a computer's central processing unit on a single integrated circuit, or at most a few integrated circuits. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and...
chipset
Chipset
A chipset, PC chipset, or chip set refers to a group of integrated circuits, or chips, that are designed to work together. They are usually marketed as a single product.- Computers :...
developed by MIPS Technologies, Inc.
MIPS Technologies
MIPS Technologies, Inc. , formerly MIPS Computer Systems, Inc., is most widely known for developing the MIPS architecture and a series of pioneering RISC chips. MIPS provides processor architectures and cores for digital home, networking and mobile applications.MIPS Computer Systems Inc. was...
(MTI), Toshiba
Toshiba
is a multinational electronics and electrical equipment corporation headquartered in Tokyo, Japan. It is a diversified manufacturer and marketer of electrical products, spanning information & communications equipment and systems, Internet-based solutions and services, electronic components and...
, and Weitek
Weitek
Weitek Corporation was a chip-design company that originally concentrated on floating point units for a number of commercial CPU designs. During the early to mid-1980s, Weitek designs could be found powering a number of high-end designs and parallel processing supercomputers...
. It was the first implementation of the MIPS IV instruction set architecture. The R8000 is also known as the TFP, for Tremendous Floating-Point, its name during development.
History
Development of the R8000 started in the early 1990s at Silicon GraphicsSilicon Graphics
Silicon Graphics, Inc. was a manufacturer of high-performance computing solutions, including computer hardware and software, founded in 1981 by Jim Clark...
, Inc. (SGI). The R8000 was specifically designed to provide the performance of circa 1990s supercomputer
Supercomputer
A supercomputer is a computer at the frontline of current processing capacity, particularly speed of calculation.Supercomputers are used for highly calculation-intensive tasks such as problems including quantum physics, weather forecasting, climate research, molecular modeling A supercomputer is a...
s with a microprocessor instead of a central processing unit (CPU) built from many discrete components such as gate array
Gate array
A gate array or uncommitted logic array is an approach to the design and manufacture of application-specific integrated circuits...
s. At the time, the performance of traditional supercomputers was not advancing as rapidly as reduced instruction set computer
Reduced instruction set computer
Reduced instruction set computing, or RISC , is a CPU design strategy based on the insight that simplified instructions can provide higher performance if this simplicity enables much faster execution of each instruction. A computer based on this strategy is a reduced instruction set computer...
(RISC) microprocessors. It was predicted that RISC microprocessors would eventually match the performance of more expensive and larger supercomputers at a fraction of the cost and size, making computers with this level of performance more accessible and enabling deskside workstations and servers to replace supercomputers in many situations.
First details of the R8000 emerged in April 1992 in an announcement by MIPS Computer Systems detailing future MIPS microprocessors. In March 1992, SGI announced it was acquiring MIPS Computer Systems, which became a subsidiary of SGI called MIPS Technologies, Inc. (MTI) in mid-1992. Development of the R8000 was transferred to MTI, where it continued. The R8000 was expected to be introduced in 1993, but it was delayed until mid-1994. The first R8000, a 75 MHz part, was introduced on 7 June 1994. It was priced at US$2,500 at the time. In mid-1995, a 90 MHz part appeared in systems from SGI. The R8000's high cost and narrow market (technical and scientific computing) restricted its market share, and although it was popular in its intended market, it was largely replaced with the cheaper and generally better performing R10000
R10000
The R10000, code-named "T5", is a RISC microprocessor implementation of the MIPS IV instruction set architecture developed by MIPS Technologies, Inc. , then a division of Silicon Graphics, Inc. . The chief designers were Chris Rowen and Kenneth C. Yeager...
introduced January 1996.
Users of the R8000 were SGI, who used it in their Power Indigo2 workstation
Workstation
A workstation is a high-end microcomputer designed for technical or scientific applications. Intended primarily to be used by one person at a time, they are commonly connected to a local area network and run multi-user operating systems...
, Power Challenge
SGI Challenge
The Challenge, code-named Eveready and Terminator , is a family of server computers and supercomputers developed and manufactured by Silicon Graphics in the early to mid-1990s that succeeded the earlier Power series systems...
server, Power ChallengeArray cluster
Cluster (computing)
A computer cluster is a group of linked computers, working together closely thus in many respects forming a single computer. The components of a cluster are commonly, but not always, connected to each other through fast local area networks...
and Power Onyx
Onyx
Onyx is a banded variety of chalcedony. The colors of its bands range from white to almost every color . Commonly, specimens of onyx contain bands of black and/or white.-Etymology:...
visualization system. In the November 1994 TOP500
TOP500
The TOP500 project ranks and details the 500 most powerful known computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year...
list, 50 systems out of 500 used the R8000. The highest ranked R8000-based systems were four Power Challenges at positions 154 to 157. Each had 18 R8000s.
Description
The chip set consisted of the R8000 microprocessor, the R8010 floating-point unit, two Tag RAMs, and the streaming cache. The R8000 is superscalarSuperscalar
A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate...
, capable of issuing up to four instructions per cycle, and executes instructions in program order. It has a five-stage integer pipeline
Instruction pipeline
An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput ....
.
R8000
The R8000 controlled the chip set and executed integer instructions. It contained the integer execution units, integer register fileRegister file
A register file is an array of processor registers in a central processing unit . Modern integrated circuit-based register files are usually implemented by way of fast static RAMs with multiple ports...
, primary cache
CPU cache
A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations...
s and hardware for instruction fetch, branch prediction the translation lookaside buffer
Translation Lookaside Buffer
A translation lookaside buffer is a CPU cache that memory management hardware uses to improve virtual address translation speed. All current desktop and server processors use a TLB to map virtual and physical address spaces, and it is ubiquitous in any hardware which utilizes virtual memory.The...
s (TLBs).
In stage one, four instructions are fetched from the instruction cache. The instruction cache is 16 KB large, direct-mapped, virtually tagged and virtually indexed, and has a 32-byte line size. Instruction decoding occurs during stage two. Load and store instructions begin execution in stage three, and integer instructions in stage four. Integer execution was delayed until stage four so integer instructions that use the result of a load as an operand may be issued with those loads. Results are written to the integer register file in stage five.
The integer register file
Register file
A register file is an array of processor registers in a central processing unit . Modern integrated circuit-based register files are usually implemented by way of fast static RAMs with multiple ports...
has nine read ports and four write ports. Four read ports are used to supply operands to two of the four execution units, and an additional four are used to supply operands to the two address generators. A single read port is used to deliver data to the two banks of data cache. Two write ports are used to write results from two functional units to the register file, and two write ports are used by the register file to read from the data cache, one for each bank.
Integer functional units consisted of two integer units, a shift unit, a multiply-divide unit, and two address generator units. Multiply and divide instructions are executed in the multiply-divide unit, which is not pipelined. As a result, the latency for a multiply instruction is four cycles for 32-bit operands and six cycles for 64-bit. The latency for a divide instruction depends on the number of significant digits in the result and thus it varies from 21 to 73 cycles.
Loads and stores
Loads and stores begin execution in stage three. The R8000 has two address generation units that calculate virtual address for loads and stores. In stage four, the virtual addresses are translated to physical addresses by a dual-ported TLB that contains 384 entries and is three-way set associative. The 16 KB data cache is accessed in the same cycle. It is dual-ported, and is accessed via two 64-bit buses. It can service two loads or one load and one store per cycle. The cache is not protected by parityParity
Parity may refer to:* Parity , a symmetry property of physical quantities or processes under spatial inversion* Parity , indicates whether a number is even or odd...
or by error correcting code (ECC). In the event of a cache miss, the data must be loaded from the streaming cache with an eight-cycle penalty. The cache is virtually indexed, physically tagged, direct mapped, has a 32-byte line size and uses a write-through with allocate protocol. If the loads hit in the data cache, the result is written to the integer register file in stage five.
R8010
The R8010 executed floating-point instructions provided by an instruction queue on the R8000. The queue decoupled the floating-point pipeline from the integer pipeline, implementing a limited form of out-of-order executionOut-of-order execution
In computer engineering, out-of-order execution is a paradigm used in most high-performance microprocessors to make use of instruction cycles that would otherwise be wasted by a certain type of costly delay...
by allowing floating-point instructions to execute when possible after or before the integer instructions from the same group are issued. The pipelines were decoupled to help mitigate some of the streaming cache latency.
It contained the floating-point register file, a load queue, a store queue, and two identical floating-point units. All instructions except for divide and square-root are pipelined. The R8010 implements an iterative division and square-root algorithm that uses the multiplier for a key part, requiring the pipeline to be stalled the unit for the duration of the operation.
Arithmetic instructions except for compares have a four-cycle latency. Single and double precision divides have latencies of 14 and 20 cycles, respectively; and single and double precision square-roots have latencies of 14 and 23 cycles, respectively.
Streaming cache and Tag RAMs
The streaming cache is an external 1 to 16 MB cache that serves as the R8000's L2 unified cache and the R8010's L1 data cache. It operates at the same clock rate as the R8000 and is built from commodity synchronous static RAMs. This scheme was used to attain sustained floating point performance, which requires frequent access to data. A small low-latency primary cache would not contain enough data and frequently miss, necessitating long latency refiles that reduce performance.The streaming cache is two-way interleaved. It has two independent banks
Memory Bank
A memory bank is a logical unit of storage in electronics, which is hardware dependent. In computer the memory bank may be determined by the memory access controller and the CPU along with physical organization of the hardware memory slots....
, each containing data from even or odd addresses. It can therefore perform two reads, two writes, or a read and a write every cycle, provided that the two accesses are to separate banks. Each bank is accessed via two 64-bit unidirectional buses, one for reads, and the other for writes. This scheme was used to avoid bus turnover, which is required by bidirectional buses. By avoiding bus turnover, the cache can be read from in one cycle and then written to in the next cycle without an intervening cycle for turnover, resulting in improved performance.
The streaming cache's tags are contained on two Tag RAM chips, one for each bank. Both chips contain identical data. Each chip contains 1.189 Mbit of cache tags implemented by four-transistor SRAM cells. The chips are implemented in a 0.7 μm BiCMOS
BiCMOS
BiCMOS is an evolved semiconductor technology that integrates two formerly separate semiconductor technologies - those of the analog bipolar junction transistor and the digital CMOS transistor - in a single integrated circuit device....
process with two levels of polysilicon and two levels of aluminium interconnect. BiCMOS circuitry was used in the decoders and combined sense amplifier and comparator portions of the chip to reduce cycle time. Each Tag RAM is 14.8 mm by 14.8 mm large, packaged in a 155-pin CPGA, and dissipates 3 W at 75 MHz. In addition to providing the cache tags, the Tag RAMs are responsible for the streaming cache being four-way set associative. To avoid high a pin count, the cache tags are four-way set associative and logic selects which set to access after lookup instead of the usual way of implementing set-associative caches.
Access to the streaming cache is pipelined to mitigate some of the latency. The pipeline has five stages: in stage one, addresses are sent to the Tag RAMs, which are accessed in stage two. Stage three is for the signals from the Tag RAMs to propagate to the SSRAMs. In stage four, the SSRAMs are accessed and data is returned to the R8000 or R8010 in stage five.
Physical
The R8000 contained 2.6 million transistors and measured 17.34 mm by 17.30 mm (299.98 mm2). The R8010 contained 830,000 transistors. In total, the two chips contained 3.43 million transistors. Both were fabricated by ToshibaToshiba
is a multinational electronics and electrical equipment corporation headquartered in Tokyo, Japan. It is a diversified manufacturer and marketer of electrical products, spanning information & communications equipment and systems, Internet-based solutions and services, electronic components and...
in their VHMOSIII process, a 0.7 µm, triple-layer metal complementary metal–oxide–semiconductor
CMOS
Complementary metal–oxide–semiconductor is a technology for constructing integrated circuits. CMOS technology is used in microprocessors, microcontrollers, static RAM, and other digital logic circuits...
(CMOS) process. Both are packaged in 591-pin ceramic pin grid array (CPGA) packages. Both chips used a 3.3 V power supply, and the R8000 dissipated 13 W at 75 MHz.
Further reading
- Ikumi, N. et al. (February 1994). "A 300 MIPS, 300 MFLOPS four-issue CMOS superscalar microprocessor". ISSCC Digest of Technical Papers.
- Unekawa, Y. et al. (April 1994). "A 110-MHz/1-Mb synchronous TagRAM". IEEE Journal of Solid-State Circuits 29 (4): pp. 403–410.