Tandem Computers
Encyclopedia
Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems
for ATM networks, bank
s, stock exchange
s, telephone switching centers, and other similar commercial transaction processing
applications requiring maximum uptime and zero data loss. The company was founded in 1974 and remained independent until 1997. It is now a server division within Hewlett Packard.
Tandem's NonStop
systems use a number of independent identical processors and redundant storage devices and controllers to provide automatic high-speed "failover
" in the case of a hardware or software failure.
To contain the scope of failures and of corrupted data, these multi-computer systems have no shared central components, not even main memory. Conventional multi-computer systems all use shared memories and work directly on shared data objects. Instead, NonStop processors cooperate by exchanging messages across a reliable fabric, and software takes periodic snapshots for possible rollback of program memory state.
Besides handling failures well, this "shared-nothing" messaging system design also scales extremely well to the largest commercial workloads. Each doubling of the total number of processors would double system throughput, up to the maximum configuration of 4000 processors. In contrast, the performance of conventional multiprocessor systems is limited by the speed of some shared memory, bus, or switch. Adding more than 4–8 processors that way gives no further system speedup. NonStop systems have more often been bought to meet scaling requirements than for extreme fault tolerance. They compete well against IBM's largest mainframes, despite being built from simpler minicomputer technology.
Besides fault tolerance and scaling, NonStop machines also featured an industry-leading implementation of a SQL relational database, and industry-leading support for networking and for geographically dispersed systems.
, a loquacious Texan. Treybig first saw the market need for fault tolerance in OLTP (online transaction processing) systems while running a marketing team for Hewlett Packard's HP 3000
computer division, but HP was not interested in developing for this niche. He then joined the venture capital firm Kleiner & Perkins and developed the Tandem business plan there. Treybig pulled together a core engineering team hired away from the HP 3000
division: Mike Green, Jim Katzman, and Jack Loustaunou. Their business plan called for ultra-reliable systems that never had outages and never lost or corrupted data. These were modular in a new way that was safe from all "single-point failures", yet would be only marginally more expensive than conventional non-fault-tolerant systems. These would be less expensive and support more throughput than some existing ad-hoc toughened systems that used redundant but usually wasted "hot spares".
Each engineer was confident they could quickly pull off their own part of this tricky new design, but doubted that others' areas could be worked out. Those parts of the hardware and software design that did not have to be different, were largely based on incremental improvements to the familiar hardware and software designs of the HP 3000. Many subsequent engineers and programmers also came from HP. Tandem headquarters in Cupertino, California
, were a quarter mile away from the HP 3000 offices. Initial venture-capital (VC) investment in Tandem Computers came from Tom Perkins, who was formerly a general manager of the HP 3000 division.
The business plan included detailed ideas for building a unique corporate culture reflecting Treybig's values.
The design of the initial Tandem/16 hardware was completed in 1975 and the first system shipped to Citibank in May 1976.
The company enjoyed uninterrupted exponential growth up through 1983. Inc.
magazine ranked Tandem as the fastest growing public company in America.
product line has grown and evolved in an upward-compatible way from the initial T/16 fault-tolerant system, with three major changes to date to its top-level modular architecture or its programming-level instruction set architecture. Within each series, there have been several major re-implementations as chip technology progressed.
While conventional systems of the era, including large mainframes
, had mean-time-between-failures (MTBF) on the order of a few days, the NonStop system was designed to failure intervals 100 times longer, with uptime
s measured in years. Nevertheless, the NonStop was designed to be price-competitive with conventional systems, with a simple 2-CPU system priced at just over twice that of a competing single-processor mainframe, as opposed to four or more times of other fault-tolerant solutions.
The first system was the Tandem/16 or T/16, later re-branded NonStop I. The machine consisted of between two and 16 cpus, organized as a fault-tolerant computer cluster packaged in a single rack. Each CPU had its own private, unshared memory, its own I/O
processor, its own private I/O bus to connect to I/O controllers, and dual connections to all the other CPUs over a custom inter-CPU backplane bus
called Dynabus. Each disk controller or network controller was duplicated and had dual connections to both CPUs and devices. Each disk was mirrored, with separate connections to two independent disk controllers. If a disk failed, its data was still available from its mirrored copy. If a CPU or controller or bus failed, the disk was still reachable through alternative CPU, controller, and/or bus. Each disk or network controller was connected to two independent CPUs. Power supplies were each wired to only one side of some pair of CPUs, controllers, or buses, so that the system would keep running well without loss of connections if one power supply failed. The careful complex arrangement of parts and connections in customers' larger configurations were documented in a Mackie diagram, named after lead salesman David Mackie who invented the notation.
None of these duplicated parts were wasted "hot spares"; everything added to system throughput during normal operations.
Besides recovering well from failed parts, the T/16 was also designed to detect as many kinds of intermittent failures as possible, as soon as possible. This prompt detection is called "fail fast". The point was to find and isolate corrupted data before it was permanently written into databases and other disk files. In the T/16, error detection was by some added custom circuits that added little cost to the total design; no major parts were duplicated just to get error detection.
The T/16 CPU was a proprietary design. It was greatly influenced by the HP 3000
minicomputer. They were both microprogrammed
, 16-bit
, stack-based machines
with segmented, 16-bit
virtual addressing. Both were intended to be programmed exclusively in high-level languages, with no use of assembler
. Both were initially implemented via standard low-density TTL chips, each holding a 4-bit slice of the 16-bit ALU
. Both had a small number of top-of-stack, 16-bit data registers plus some extra address registers for accessing the memory stack. Both used Huffman encoding of operand address offsets, to fit a large variety of address modes and offset sizes into the 16-bit instruction format with very good code density. Both relied heavily on pools of indirect addresses to overcome the short instruction format. Both supported larger 32- and 64-bit operands via multiple ALU cycles, and memory-to-memory string operations. Both used "big-endian" addressing of long versus short memory operands. These features had all been inspired by Burroughs B5500-B6800 mainframe stack machines.
The T/16 instruction set changed several features from the HP 3000 design. The T/16 supported paged virtual memory from the beginning. The HP 3000 series did not add paging until the PA-RISC generation, 10 years later. Tandem added support for 32-bit addressing in its second machine; HP 3000 lacked this until its PA-RISC generation. Paging and long addresses was critical for supporting complex system software and large applications. The T/16 treated its top-of-stack registers in a novel way; the compiler, not the microcode, was responsible for deciding when full registers were spilled to the memory stack and when empty registers were re-filled from the memory stack. On the HP 3000, this decision took extra microcode cycles in every instruction. The HP 3000 supported COBOL
with several instructions for calculating directly on arbitrary-length BCD (binary-coded decimal) strings of digits. The T/16 simplified this to single instructions for converting between BCD strings and 64-bit binary integers.
In the T/16, each CPU consisted of two boards of TTL logic and SRAMs, and ran at about 0.7 MIPS
.
At any instant, it could access only four virtual memory segments (System Data, System Code, User Data, User Code), each limited to 128 kB in size. The 16-bit address spaces were already too small for major applications when it shipped.
The first release of T/16 had only a single programming language, Tandem Application Language (TAL). This was an efficient machine-dependent systems programming language (for operating systems, compilers, etc.) but could also be used for non-portable applications. It was derived from HP 3000's System Programming Language (SPL). Both had semantics similar to C but a syntax based on Burroughs' ALGOL
. Subsequent releases added support for Cobol74, Fortran
, and MUMPS
.
The Tandem NonStop
series ran a custom operating system
which was significantly different from Unix or HP 3000's MPE. It was initially called T/TOS (Tandem Transactional Operating System) but soon named Guardian for its ability to protect all data from machine faults or software faults. In contrast to all other commercial operating systems, Guardian was based on message passing as the basic way for all processes to interact, without shared memory, regardless of where the processes were running.
This approach easily scaled to multiple-computer clusters and helped isolate corrupted data before it propagates.
All file system processes and all transactional application processes were structured as master/slave pairs of processes running in separate cpus. The slave process periodically took snapshots of the master's memory state, and took over the workload if and when the master process ran into trouble. This allowed the application to survive failures in any cpu or its associated devices, without data loss. It further allowed recovery from some intermittent-style software failures. Between failures, the monitoring by the slave process added some performance overhead but this was far less than the 100% duplication in other system designs. Some major early applications were directly coded in this checkpoint style, but most instead used various Tandem software layers which hid the details of this in a semi-portable way.
In 1981, all T/16 CPUs were replaced by the NonStop II. Its main difference from the T/16 was support for occasional 32-bit addressing via a user-switchable "extended data segment". This supported the next ten years of growth in software and was a huge advantage over the T/16 or HP 3000. Unfortunately, visible registers remained 16-bit, and this unplanned addition to the instruction set required executing many instructions per memory reference compared to most 32-bit minicomputers. All subsequent TNS computers were hampered by this instruction set inefficiency. Also, the NonStop II lacked wider internal data paths and so required additional microcode steps for 32-bit addresses. A NonStop II CPU had 3 boards, using chips and design similar to the T/16. The NonStop II also replaced core memory with battery-backed DRAM memory.
In 1983, the NonStop TXP CPU was the first entirely new implementation of the TNS instruction set architecture.
It was built from standard TTL chips and Programmed Array Logic chips, with 4 boards per CPU module. It had Tandem's first use of cache memory. It had a more direct implementation of 32-bit addressing, but still sent them through 16-bit adders. A wider microcode store allowed a major reduction in the cycles executed per instruction; speed increased to 2.0 MIPS. It used the same rack packaging, controllers, backplane, and buses as before. The Dynabus and I/O buses has been overdesigned in the T/16 so they would work for several generations of upgrades.
Up to 14 TXP and NonStop II systems could now be combined via FOX, a long-distance fault-tolerant fibre optic bus for connecting TNS clusters across a business campus; a cluster of clusters with a total of 224 cpus. This allowed further scale-up for taking on the largest mainframe applications.
Like the CPU modules within the computers, Guardian could failover entire task sets to other machines in the network. World-wide clusters of 4000 CPUs could also be built via conventional long-haul network links.
In 1986, Tandem introduced a third generation CPU, the NonStop VLX.
It had 32-bit datapaths, wider microcode, 12 MHz cycle time, and a peak rate of one instruction per microcycle. It was built from 3 boards of ECL gate array chips (with TTL pinout). It had a revised Dynabus with speed raised to 20 Mbytes/sec per link, 40 Mbytes/sec total. FOX II increased the physical diameter of TNS clusters to 4 km.
Tandem's initial database support was only for hierarchical, non-relational databases via the ENSCRIBE
file system. This was extended into a relational database called ENCOMPASS.
In 1986 Tandem introduced the first fault-tolerant SQL
database, NonStop SQL
.
Developed totally in-house, NonStop SQL includes a number of features based on Guardian to ensure data validity
across nodes. NonStop SQL is famous for scaling linearly
in performance with the number of nodes added to the system, whereas most databases had performance that plateaued quite quickly, often after just two CPUs. A later version released in 1989 added transactions that could be spread over nodes, a feature that remained unique for some time. Later, the SQL database group was first co-opted and then absorbed into Microsoft's SQL development effort. One outcome of this collaboration was Microsoft's clustered system technology.
In 1987 Tandem introduced the NonStop CLX, a low-cost less-expandable minicomputer system.
Its role was for growing the low end of the fault-tolerant market, and for deploying on the remote edges of large Tandem networks. Its initial performance was roughly similar to the TXP; later versions were about 20% slower than a VLX. Its small cabinet could be installed into any "copier room" office environment. A CLX CPU was one board, containing six "compiled silicon" ASIC CMOS chips. The CPU core chip was duplicated and lock stepped for maximal error detection. Pinout was a main limitation of this chip technology. Microcode, cache, and TLB were all external to the CPU core and shared a single bus and single SRAM memory bank. As a result, CLX required at least 2 machine cycles per instruction.
In 1989 Tandem introduced the NonStop Cyclone, a fast but expensive system for the mainframe end of the market.
Each self-checking CPU took 3 boards full of hot-running ECL gate array chips, plus memory boards. Despite being microprogrammed, the CPU was superscalar
, often completing two instructions per cache cycle. This was accomplished by having a separate microcode routine for every common pair of instructions.
That fused pair of stack instructions generally accomplished the same work as a single instruction of normal 32-bit minicomputers. Cyclone processors were packaged as sections of 4 CPUs each, and the sections joined by a fiber optic version of Dynabus.
Like Tandem's prior high end machines, Cyclone cabinets were styled with lots of angular black to suggest strength and power. Advertising videos directly compared Cyclone to the SR-71 Blackbird Mach-3 spy plane. Cyclone's name was supposed to represent its unstoppable speed in roaring through OLTP workloads. Announcement day was Oct 17 and the press came to town. That afternoon, the region was struck by the 6.9 Loma Prieta earthquake, causing freeway collapses in Oakland and major fires in San Francisco. Tandem offices were shaken but no one was badly hurt on site. This was the first and last time that Tandem named its products after a natural disaster.
Development of Rainbow's advanced client/server application development framework called "Crystal" continued awhile longer and was spun off as the "Ellipse" product of Cooperative Systems Inc.
In 1985, Tandem attempted to grab a piece of the rapidly growing personal computer
market with its introduction of the MS-DOS
based Dynamite PC/workstation. Sadly, numerous design compromises (including a unique 8086-based hardware platform incompatible with expansion cards of the day and extremely limited compatibility with IBM
-based PCs) relegated the Dynamite to serving primarily as a smart terminal. It was quietly and quickly withdrawn from the market.
Tandem's message-based NonStop operating system had advantages for scaling, extreme reliability, and efficiently using expensive "spare" resources. But many potential customers wanted just good-enough reliability in a small system, using a familiar Unix operating system and industry-standard programs. Tandem's various fault-tolerant competitors all adopted a simpler hardware-only memory-centric design where all recovery was done by switching between hot spares. The most successful competitor was Stratus Technologies
, whose machines were re-marketed by IBM as "IBM System/88".
In such systems, the spare processors do not contribute to system throughput between failures, but merely redundantly execute exactly the same data thread as the active processor at the same instant, in "lock step". Faults are detected by seeing when the cloned processors' outputs diverged. To detect failures, the system must have 2 physical processors for each logical, active processor. To also implement automatic failover recovery, the system must have 3 or 4 physical processors for each logical processor. The 3x-4x cost of this sparing is practical when the duplicated parts are commodity single-chip microprocessors.
Tandem's products for this market began with the Integrity
line in 1989, using MIPS processors and a "NonStop UX" variant of Unix. It was developed in Austin TX. In 1991, the Integrity S2 used TMR, Triple Modular Redundancy, where each logical CPU used three MIPS
R2000 microprocessors to execute the same data thread, with voting to find and lock out a failed part. Their fast clocks could not be synchronized as in strict lock stepping, so voting instead happened at each interrupt.
Some other version of Integrity used 4x "pair and spares" redundancy. Pairs of processors ran in lock-step to check each other. When they disagreed, both processors were marked untrusted and their workload was taken over by a hot-spare pair of processors whose state was already current. In 1995, the Integrity S4000 was the first to use ServerNet and moved toward sharing peripherals with the NonStop line.
In 1995-1997, Tandem partnered with Microsoft to implement high-availability features and advanced SQL configurations in clusters of commodity Windows NT machines. This project was called "Wolfpack" and first shipped as Microsoft Cluster Server
in 1997. Microsoft benefited greatly from this partnership; Tandem did not.
HP's HP 3000 MPE division had similar roadmap problems but found a clever way forward in 1986. HP Labs designed a RISC computer core which was stripped of all non-essentials so it could soon fit into one chip. And it was efficiently pipelined and ran even faster than the ECL mainframes of that time.
It was many times faster than the microprogrammed CMOS stack machines that the rest of HP was then designing.
But how to migrate all the vendor, customer, and third-party software for those existing product lines? Some software was portable and could be directly recompiled for the new instruction set. Other software was not easily recompiled as is. HP Labs invented efficient ways to run the old binaries of that software on the new machine, by emulation and by automatic translation of binary object code. And they told everyone how they did it. Similar object code translation techniques were subsequently used by Apple Computers, to move Macintosh software from M68000 machines to PowerPC machines, and by Digital Equipment Corporation, to move VMS users from VAXs to Alpha machines.
One flaw in the HP 3000 migration plan, is that HP also ambitiously tried to rewrite the entire MPE operating system in a new language at that same time. They didn't plan to use the same emulation techniques on their own primary code. But their rewrite to native mode took years longer to complete than expected. HP's first generation RISC hardware was already obsolete before its MPE software was ready to release. Tandem learned from this mistake.
Tandem could not use HP's PA-RISC or Sun's SPARC CPUs, for business reasons. Instead, Tandem partnered with
MIPS
and adopted its R3000
and successor chipsets and their advanced optimizing compiler. Subsequent NonStop Guardian machines using the MIPS instruction set were known to programmers as TNS/R machines, but had a variety of marketing names.
In 1991, Tandem released the Cyclone/R, also known as CLX/R. This was a low cost mid-range system based on CLX components, but used R3000 microprocessors instead of the much slower CLX stack machine board. To minimize time to market, this machine was initially shipped without any MIPS native-mode software. Everything, including its NSK operating system and SQL database, was compiled to TNS stack machine code. That object code was then translated to equivalent partially optimized MIPS instruction sequences at kernel install time by a tool called the Accelerator.
Less-important programs could also be executed directly without pre-translation, via a TNS code interpreter. These migration techniques were very successful and are still in use today. Everyone's software was brought over without extra work, and the performance was good enough for mid-range machines, and programmers could ignore the instruction differences, even when debugging at machine code level. These Cyclone/R machines were updated with a faster native-mode NSK in a follow-up release.
The R3000 and later microprocessors had only a typical amount of internal error checking, insufficient for Tandem's needs. So the Cyclone/R ran pairs of R3000 processors in lock step, running the same data thread. It used a curious variation of lock stepping. The checker processor ran 1 cycle behind the primary processor. This allowed them to share a single copy of external code and data caches without putting excessive pinout load on the sysbus and lowering the system clock rate. To successfully run microprocessors in lock step, the chips must be designed to be fully deterministic. Any hidden internal state must be cleared by the chip's reset mechanism. Otherwise, the matched chips will sometimes get out of sync for no visible reason and without any faults, long after the chips are restarted. All chip designers agree that these are good principles because it helps them test chips at manufacturing time. But all new microprocessor chips seemed to have bugs in this area, and required months of shared work between MIPS and Tandem to eliminate or work around the final subtle bugs. )
In 1993, Tandem released the NonStop Himalaya K-series with the faster MIPS R4400, a native mode NSK, and fully expandable Cyclone system components. These were still connected by Dynabus, Dynabus+, and the original I/O bus, which by now were all running out of performance headroom.
In 1994, the NonStop Kernel was extended with a Unix-like POSIX
environment called Open System Services. The original Guardian shell and ABI remained available.
In 1997 Tandem introduced the NonStop Himalaya S-Series with a new top-level system architecture based on ServerNet
connections. ServerNet replaced the obsolete Dynabus, FOX, and I/O buses. It was much faster, more general, and could be extended to more than just two-way redundancy via an arbitrary fabric of point-to-point connections.
Tandem designed ServerNet for its own needs but then promoted its use by others; it evolved into the InfiniBand
industry standard.
All S-Series machines used MIPS processors, including the R4400, R10000
, R12000, and R14000.
The design of the later, faster MIPS cores was primarily funded by Silicon Graphics Inc. But Intel's Pentium Pro overtook the performance of RISC designs and also SGI's graphics business shrunk. After the R10000, there was no investment in significant new MIPS core designs for high-end servers. So Tandem needed to eventually move its NonStop product line yet again onto some other microprocessor architecture with competitive fast chips.
Compaq
's x86-based server division was an early outside adopter of Tandem's ServerNet/Infiniband interconnect technology. In 1997, Compaq acquired the Tandem Computers company and NonStop customer base to balance Compaq's heavy focus on low-end PCs. In 1998, Compaq also acquired the much larger Digital Equipment Corporation
and inherited its DEC Alpha
RISC servers with OpenVMS
and Tru64 Unix
customer bases. Tandem was then midway in porting its NonStop product line from MIPS R12000 microprocessors to Intel's new Itanium
Merced microprocessors. This project was restarted with Alpha as the new target to align NonStop with Compaq's other large server lines. But in 2001, Compaq terminated all Alpha engineering investments in favor of the unproven Itanium microprocessors. The Alpha version of NonStop died before shipping. So the NonStop migration project was restarted yet again, targeting Itanium McKinley.
The combined companies' PC-centric sales force did not understand how to sell large complex systems to large enterprises. A single sale takes many months of proposals and education rather than a single negotiation.
product lines in favor of Intel's Itanium microprocessors that HP helped to design. Shortly afterwards, Compaq and HP announced their plan to merge, and consolidate their similar product lines. This contentious merger became official in May 2002. The consolidations were painful and destroyed the DEC and 'HP Way' engineer-oriented cultures. But the combined company did know how to sell complex systems to enterprises and profit, so it was an improvement for the surviving NonStop division and its customers.
In some ways, Tandem's journey from HP-inspired startup, to an HP-inspired competitor, then to an HP division was "bringing Tandem back to its original roots." But this was definitely not the same HP.
The re-port of the NSK-based NonStop product line from MIPS processors to Itanium-based processors was finally completed and is branded as 'HP Integrity NonStop Servers'. (This NSK Integrity NonStop is unrelated to Tandem's original 'Integrity' series for Unix.)
It was not possible to run Itanium McKinley chips with clock-level lock stepping. So the Integrity NonStop machines instead use comparisons between chip states at longer time scales, at interrupt points and at various software sync points in between interrupts. The intermediate sync points are automatically triggered at every N'th taken branch instruction, and are also explicitly inserted into long loop bodies by all NonStop compilers. The machine design supports both dual and triple redundancy, with either 2 or 3 physical microprocessors per logical Itanium processor. The triple version is sold to customers needing the utmost reliability. This new checking approach is called NSAA, NonStop Advanced Architecture.
As in the earlier migration from stack machines to MIPS microprocessors, all customer software was carried forward without source changes. 'Native mode' source code compiled directly to MIPS machine code was simply recompiled for Itanium. Some older 'non native' software was still in TNS stack machine form. These were automatically ported onto Itanium via object code translation techniques.
Integrity NonStop continues to be HP's answer for the extreme scaling needs of its very largest customers. The NSK operating system, now termed NonStop OS, continues as the base software environment for the NonStop Servers, and has been extended to include support for Java
and integration with popular development tools like Visual Studio
and Eclipse
.
NSK Guardian also became the base for the HP Neoview OS, the operating system used in the HP Neoview systems that are tailored for use in Business Intelligence and Enterprise Data Warehouse use. NonStop SQL was also the starting point for Neoview SQL, which has been tailored to Business Intelligence use.
Tandem's weekly beer busts were the most visible sign that this company was a bit different. These had keg beer, wine, soft drinks, popcorn, unshelled peanuts, and good conversation. Besides being fun, they encouraged employees of different groups and levels to mix and learn what others in the company needed. Jimmy was always present and accessible.
Many fast-growing tech companies with rising stock prices award stock options to top employees. But Tandem was unique in also granting 100 shares every year to absolutely every employee, no matter how lowly. Similarly, every US employee was given a paid six-week sabbatical every four years, beyond generous regular vacation accruals.
Tandem experimented with new ways to keep the entire company aligned and feeling like a smaller company. This included monthly "First Friday" telecasts that were broadcast live worldwide over private satellite links. These were produced by an award-winning in-house Tandem TV staff. While generally educational about some aspect of the company, the programs usually featured some member of the senior management team in a humorous way.
As a side effect of its worldwide networking of all corporate computers, Tandem was a very early implementor of a worldwide corporate email system. This helped a lot. But the sociology of this new medium required some debugging. In the first release, the reply button defaulted to sending to all Tandem employees worldwide. Besides being annoying and embarrassing, this soon led to flame wars between people who had never met and those flames were difficult to extinguish. Subsequent releases fixed that and added support for nonbusiness mail, classifieds, and special interest groups.
Another distinctive employee program was TOPS (Tandem Outstanding PerformerS). This award was given to the top 5% of employees annually; any employee could be nominated. Winners and their guest were treated to an all-expenses paid trip to resort locations such as Hawaii or Vail for several days of fun and team building with top management.
Satisfying customers with reliable systems and topnotch field support was just as important as these internal programs.
One quirk of Tandem was that its customers invariably delayed placing their orders for new or expanded systems until the last weeks of their fiscal quarter. So Tandem mastered the art of doing all its manufacturing and testing in those final two weeks. Exciting for awhile, but this made it difficult to manage parts inventory. And it just encouraged customers and salesmen to continue waiting to the last minutes.
Jimmy was clear that a satisfying workplace required continued strong growth. The Philosophy class included a very complicated 8-page flowchart that attempted to show how every part and aspect of the company was driven by and help drive rising revenues and stock prices. This all worked well up to Tandem's billion-dollar year. But eventually, Tandem's spectacular growth stagnated due to saturated markets, economic slowdowns, and the costs and limitations of the main product line.
As a nearly anonymous division within Compaq and now HP, Tandem's culture is now just history.
Fault-tolerant computer systems
Fault-tolerant computer systems are systems designed around the concepts of fault tolerance. In essence, they have to be able to keep working to a level of satisfaction in the presence of faults.- Types of fault tolerance :...
for ATM networks, bank
Bank
A bank is a financial institution that serves as a financial intermediary. The term "bank" may refer to one of several related types of entities:...
s, stock exchange
Stock exchange
A stock exchange is an entity that provides services for stock brokers and traders to trade stocks, bonds, and other securities. Stock exchanges also provide facilities for issue and redemption of securities and other financial instruments, and capital events including the payment of income and...
s, telephone switching centers, and other similar commercial transaction processing
Transaction processing
In computer science, transaction processing is information processing that is divided into individual, indivisible operations, called transactions. Each transaction must succeed or fail as a complete unit; it cannot remain in an intermediate state...
applications requiring maximum uptime and zero data loss. The company was founded in 1974 and remained independent until 1997. It is now a server division within Hewlett Packard.
Tandem's NonStop
NonStop
NonStop can refer to the line of HP Integrity NonStop computers, the line of Tandem NonStop computers that preceded them, or the NonStop OS operating system that is designed for them. NonStop systems are based on an integrated hardware/software stack...
systems use a number of independent identical processors and redundant storage devices and controllers to provide automatic high-speed "failover
Failover
In computing, failover is automatic switching to a redundant or standby computer server, system, or network upon the failure or abnormal termination of the previously active application, server, system, or network...
" in the case of a hardware or software failure.
To contain the scope of failures and of corrupted data, these multi-computer systems have no shared central components, not even main memory. Conventional multi-computer systems all use shared memories and work directly on shared data objects. Instead, NonStop processors cooperate by exchanging messages across a reliable fabric, and software takes periodic snapshots for possible rollback of program memory state.
Besides handling failures well, this "shared-nothing" messaging system design also scales extremely well to the largest commercial workloads. Each doubling of the total number of processors would double system throughput, up to the maximum configuration of 4000 processors. In contrast, the performance of conventional multiprocessor systems is limited by the speed of some shared memory, bus, or switch. Adding more than 4–8 processors that way gives no further system speedup. NonStop systems have more often been bought to meet scaling requirements than for extreme fault tolerance. They compete well against IBM's largest mainframes, despite being built from simpler minicomputer technology.
Besides fault tolerance and scaling, NonStop machines also featured an industry-leading implementation of a SQL relational database, and industry-leading support for networking and for geographically dispersed systems.
Founding
Tandem Computers was founded in 1974 by James TreybigJames Treybig
James Treybig founded Tandem Computers, a pioneering Silicon Valley manufacturer of fault tolerant computer systems which were marketed to the growing number of transaction processing customers who used them for ATMs, banks, stock exchanges and other similar needs.He attended Rice University,...
, a loquacious Texan. Treybig first saw the market need for fault tolerance in OLTP (online transaction processing) systems while running a marketing team for Hewlett Packard's HP 3000
HP 3000
The HP 3000 series is a family of minicomputers released by Hewlett-Packard in 1973. It was designed to be the first minicomputer delivered with a full featured operating system with time-sharing. The first models were withdrawn from the market until speed improvements could be made. It ultimately...
computer division, but HP was not interested in developing for this niche. He then joined the venture capital firm Kleiner & Perkins and developed the Tandem business plan there. Treybig pulled together a core engineering team hired away from the HP 3000
HP 3000
The HP 3000 series is a family of minicomputers released by Hewlett-Packard in 1973. It was designed to be the first minicomputer delivered with a full featured operating system with time-sharing. The first models were withdrawn from the market until speed improvements could be made. It ultimately...
division: Mike Green, Jim Katzman, and Jack Loustaunou. Their business plan called for ultra-reliable systems that never had outages and never lost or corrupted data. These were modular in a new way that was safe from all "single-point failures", yet would be only marginally more expensive than conventional non-fault-tolerant systems. These would be less expensive and support more throughput than some existing ad-hoc toughened systems that used redundant but usually wasted "hot spares".
Each engineer was confident they could quickly pull off their own part of this tricky new design, but doubted that others' areas could be worked out. Those parts of the hardware and software design that did not have to be different, were largely based on incremental improvements to the familiar hardware and software designs of the HP 3000. Many subsequent engineers and programmers also came from HP. Tandem headquarters in Cupertino, California
Cupertino, California
Cupertino is an affluent suburban city in Santa Clara County, California in the U.S., directly west of San Jose on the western edge of the Santa Clara Valley with portions extending into the foothills of the Santa Cruz Mountains. The population was 58,302 at the time of the 2010 census. Forbes...
, were a quarter mile away from the HP 3000 offices. Initial venture-capital (VC) investment in Tandem Computers came from Tom Perkins, who was formerly a general manager of the HP 3000 division.
The business plan included detailed ideas for building a unique corporate culture reflecting Treybig's values.
The design of the initial Tandem/16 hardware was completed in 1975 and the first system shipped to Citibank in May 1976.
The company enjoyed uninterrupted exponential growth up through 1983. Inc.
Inc. (magazine)
Inc. magazine, founded in 1979 and based in New York City, is a monthly publication focused on growing companies. The magazine publishes an annual list of the 500 fastest-growing private companies in the U.S., the "Inc...
magazine ranked Tandem as the fastest growing public company in America.
TNS stack machines
Over 35 years, Tandem's main NonStopNonStop
NonStop can refer to the line of HP Integrity NonStop computers, the line of Tandem NonStop computers that preceded them, or the NonStop OS operating system that is designed for them. NonStop systems are based on an integrated hardware/software stack...
product line has grown and evolved in an upward-compatible way from the initial T/16 fault-tolerant system, with three major changes to date to its top-level modular architecture or its programming-level instruction set architecture. Within each series, there have been several major re-implementations as chip technology progressed.
While conventional systems of the era, including large mainframes
Mainframe computer
Mainframes are powerful computers used primarily by corporate and governmental organizations for critical applications, bulk data processing such as census, industry and consumer statistics, enterprise resource planning, and financial transaction processing.The term originally referred to the...
, had mean-time-between-failures (MTBF) on the order of a few days, the NonStop system was designed to failure intervals 100 times longer, with uptime
Uptime
Uptime is a measure of the time a machine has been up without any downtime.It is often used as a measure of computer operating system reliability or stability, in that this time represents the time a computer can be left unattended without crashing, or needing to be rebooted for administrative or...
s measured in years. Nevertheless, the NonStop was designed to be price-competitive with conventional systems, with a simple 2-CPU system priced at just over twice that of a competing single-processor mainframe, as opposed to four or more times of other fault-tolerant solutions.
The first system was the Tandem/16 or T/16, later re-branded NonStop I. The machine consisted of between two and 16 cpus, organized as a fault-tolerant computer cluster packaged in a single rack. Each CPU had its own private, unshared memory, its own I/O
I/O
I/O may refer to:* Input/output, a system of communication for information processing systems* Input-output model, an economic model of flow prediction between sectors...
processor, its own private I/O bus to connect to I/O controllers, and dual connections to all the other CPUs over a custom inter-CPU backplane bus
Computer bus
In computer architecture, a bus is a subsystem that transfers data between components inside a computer, or between computers.Early computer buses were literally parallel electrical wires with multiple connections, but the term is now used for any physical arrangement that provides the same...
called Dynabus. Each disk controller or network controller was duplicated and had dual connections to both CPUs and devices. Each disk was mirrored, with separate connections to two independent disk controllers. If a disk failed, its data was still available from its mirrored copy. If a CPU or controller or bus failed, the disk was still reachable through alternative CPU, controller, and/or bus. Each disk or network controller was connected to two independent CPUs. Power supplies were each wired to only one side of some pair of CPUs, controllers, or buses, so that the system would keep running well without loss of connections if one power supply failed. The careful complex arrangement of parts and connections in customers' larger configurations were documented in a Mackie diagram, named after lead salesman David Mackie who invented the notation.
None of these duplicated parts were wasted "hot spares"; everything added to system throughput during normal operations.
Besides recovering well from failed parts, the T/16 was also designed to detect as many kinds of intermittent failures as possible, as soon as possible. This prompt detection is called "fail fast". The point was to find and isolate corrupted data before it was permanently written into databases and other disk files. In the T/16, error detection was by some added custom circuits that added little cost to the total design; no major parts were duplicated just to get error detection.
The T/16 CPU was a proprietary design. It was greatly influenced by the HP 3000
HP 3000
The HP 3000 series is a family of minicomputers released by Hewlett-Packard in 1973. It was designed to be the first minicomputer delivered with a full featured operating system with time-sharing. The first models were withdrawn from the market until speed improvements could be made. It ultimately...
minicomputer. They were both microprogrammed
Microcode
Microcode is a layer of hardware-level instructions and/or data structures involved in the implementation of higher level machine code instructions in many computers and other processors; it resides in special high-speed memory and translates machine instructions into sequences of detailed...
, 16-bit
16-bit
-16-bit architecture:The HP BPC, introduced in 1975, was the world's first 16-bit microprocessor. Prominent 16-bit processors include the PDP-11, Intel 8086, Intel 80286 and the WDC 65C816. The Intel 8088 was program-compatible with the Intel 8086, and was 16-bit in that its registers were 16...
, stack-based machines
Stack machine
A stack machine may be* A real or emulated computer that evaluates each sub-expression of a program statement via a pushdown data stack and uses a reverse Polish notation instruction set....
with segmented, 16-bit
16-bit
-16-bit architecture:The HP BPC, introduced in 1975, was the world's first 16-bit microprocessor. Prominent 16-bit processors include the PDP-11, Intel 8086, Intel 80286 and the WDC 65C816. The Intel 8088 was program-compatible with the Intel 8086, and was 16-bit in that its registers were 16...
virtual addressing. Both were intended to be programmed exclusively in high-level languages, with no use of assembler
Assembly language
An assembly language is a low-level programming language for computers, microprocessors, microcontrollers, and other programmable devices. It implements a symbolic representation of the machine codes and other constants needed to program a given CPU architecture...
. Both were initially implemented via standard low-density TTL chips, each holding a 4-bit slice of the 16-bit ALU
Arithmetic logic unit
In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logical operations.The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers...
. Both had a small number of top-of-stack, 16-bit data registers plus some extra address registers for accessing the memory stack. Both used Huffman encoding of operand address offsets, to fit a large variety of address modes and offset sizes into the 16-bit instruction format with very good code density. Both relied heavily on pools of indirect addresses to overcome the short instruction format. Both supported larger 32- and 64-bit operands via multiple ALU cycles, and memory-to-memory string operations. Both used "big-endian" addressing of long versus short memory operands. These features had all been inspired by Burroughs B5500-B6800 mainframe stack machines.
The T/16 instruction set changed several features from the HP 3000 design. The T/16 supported paged virtual memory from the beginning. The HP 3000 series did not add paging until the PA-RISC generation, 10 years later. Tandem added support for 32-bit addressing in its second machine; HP 3000 lacked this until its PA-RISC generation. Paging and long addresses was critical for supporting complex system software and large applications. The T/16 treated its top-of-stack registers in a novel way; the compiler, not the microcode, was responsible for deciding when full registers were spilled to the memory stack and when empty registers were re-filled from the memory stack. On the HP 3000, this decision took extra microcode cycles in every instruction. The HP 3000 supported COBOL
COBOL
COBOL is one of the oldest programming languages. Its name is an acronym for COmmon Business-Oriented Language, defining its primary domain in business, finance, and administrative systems for companies and governments....
with several instructions for calculating directly on arbitrary-length BCD (binary-coded decimal) strings of digits. The T/16 simplified this to single instructions for converting between BCD strings and 64-bit binary integers.
In the T/16, each CPU consisted of two boards of TTL logic and SRAMs, and ran at about 0.7 MIPS
Instructions per second
Instructions per second is a measure of a computer's processor speed. Many reported IPS values have represented "peak" execution rates on artificial instruction sequences with few branches, whereas realistic workloads typically lead to significantly lower IPS values...
.
At any instant, it could access only four virtual memory segments (System Data, System Code, User Data, User Code), each limited to 128 kB in size. The 16-bit address spaces were already too small for major applications when it shipped.
The first release of T/16 had only a single programming language, Tandem Application Language (TAL). This was an efficient machine-dependent systems programming language (for operating systems, compilers, etc.) but could also be used for non-portable applications. It was derived from HP 3000's System Programming Language (SPL). Both had semantics similar to C but a syntax based on Burroughs' ALGOL
ALGOL
ALGOL is a family of imperative computer programming languages originally developed in the mid 1950s which greatly influenced many other languages and became the de facto way algorithms were described in textbooks and academic works for almost the next 30 years...
. Subsequent releases added support for Cobol74, Fortran
Fortran
Fortran is a general-purpose, procedural, imperative programming language that is especially suited to numeric computation and scientific computing...
, and MUMPS
MUMPS
MUMPS , or alternatively M, is a programming language created in the late 1960s, originally for use in the healthcare industry. It was designed for the production of multi-user database-driven applications...
.
The Tandem NonStop
NonStop
NonStop can refer to the line of HP Integrity NonStop computers, the line of Tandem NonStop computers that preceded them, or the NonStop OS operating system that is designed for them. NonStop systems are based on an integrated hardware/software stack...
series ran a custom operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
which was significantly different from Unix or HP 3000's MPE. It was initially called T/TOS (Tandem Transactional Operating System) but soon named Guardian for its ability to protect all data from machine faults or software faults. In contrast to all other commercial operating systems, Guardian was based on message passing as the basic way for all processes to interact, without shared memory, regardless of where the processes were running.
This approach easily scaled to multiple-computer clusters and helped isolate corrupted data before it propagates.
All file system processes and all transactional application processes were structured as master/slave pairs of processes running in separate cpus. The slave process periodically took snapshots of the master's memory state, and took over the workload if and when the master process ran into trouble. This allowed the application to survive failures in any cpu or its associated devices, without data loss. It further allowed recovery from some intermittent-style software failures. Between failures, the monitoring by the slave process added some performance overhead but this was far less than the 100% duplication in other system designs. Some major early applications were directly coded in this checkpoint style, but most instead used various Tandem software layers which hid the details of this in a semi-portable way.
In 1981, all T/16 CPUs were replaced by the NonStop II. Its main difference from the T/16 was support for occasional 32-bit addressing via a user-switchable "extended data segment". This supported the next ten years of growth in software and was a huge advantage over the T/16 or HP 3000. Unfortunately, visible registers remained 16-bit, and this unplanned addition to the instruction set required executing many instructions per memory reference compared to most 32-bit minicomputers. All subsequent TNS computers were hampered by this instruction set inefficiency. Also, the NonStop II lacked wider internal data paths and so required additional microcode steps for 32-bit addresses. A NonStop II CPU had 3 boards, using chips and design similar to the T/16. The NonStop II also replaced core memory with battery-backed DRAM memory.
In 1983, the NonStop TXP CPU was the first entirely new implementation of the TNS instruction set architecture.
It was built from standard TTL chips and Programmed Array Logic chips, with 4 boards per CPU module. It had Tandem's first use of cache memory. It had a more direct implementation of 32-bit addressing, but still sent them through 16-bit adders. A wider microcode store allowed a major reduction in the cycles executed per instruction; speed increased to 2.0 MIPS. It used the same rack packaging, controllers, backplane, and buses as before. The Dynabus and I/O buses has been overdesigned in the T/16 so they would work for several generations of upgrades.
Up to 14 TXP and NonStop II systems could now be combined via FOX, a long-distance fault-tolerant fibre optic bus for connecting TNS clusters across a business campus; a cluster of clusters with a total of 224 cpus. This allowed further scale-up for taking on the largest mainframe applications.
Like the CPU modules within the computers, Guardian could failover entire task sets to other machines in the network. World-wide clusters of 4000 CPUs could also be built via conventional long-haul network links.
In 1986, Tandem introduced a third generation CPU, the NonStop VLX.
It had 32-bit datapaths, wider microcode, 12 MHz cycle time, and a peak rate of one instruction per microcycle. It was built from 3 boards of ECL gate array chips (with TTL pinout). It had a revised Dynabus with speed raised to 20 Mbytes/sec per link, 40 Mbytes/sec total. FOX II increased the physical diameter of TNS clusters to 4 km.
Tandem's initial database support was only for hierarchical, non-relational databases via the ENSCRIBE
Enscribe
Enscribe is the native hierarchical database in HP NonStop servers. It supports the five file structures: unstructured, key-sequenced, entry-sequenced, relative and queue. Enscribe supports partitioned files which spans across multiple physical disks. It supports locking at file and record levels...
file system. This was extended into a relational database called ENCOMPASS.
In 1986 Tandem introduced the first fault-tolerant SQL
SQL
SQL is a programming language designed for managing data in relational database management systems ....
database, NonStop SQL
NonStop SQL
Nonstop SQL is software that is developed and sold by Hewlett Packard. Nonstop SQL is a commercial relational database management system that is designed for fault tolerance and scalability. The latest version of the product is SQL/MX 3.0. This was released in February 2011.The product was...
.
Developed totally in-house, NonStop SQL includes a number of features based on Guardian to ensure data validity
Consistency model
In computer science, consistency models are used in distributed systems like distributed shared memory systems or distributed data stores . The system supports a given model, if operations on memory follow specific rules...
across nodes. NonStop SQL is famous for scaling linearly
Scalability
In electronics scalability is the ability of a system, network, or process, to handle growing amount of work in a graceful manner or its ability to be enlarged to accommodate that growth...
in performance with the number of nodes added to the system, whereas most databases had performance that plateaued quite quickly, often after just two CPUs. A later version released in 1989 added transactions that could be spread over nodes, a feature that remained unique for some time. Later, the SQL database group was first co-opted and then absorbed into Microsoft's SQL development effort. One outcome of this collaboration was Microsoft's clustered system technology.
In 1987 Tandem introduced the NonStop CLX, a low-cost less-expandable minicomputer system.
Its role was for growing the low end of the fault-tolerant market, and for deploying on the remote edges of large Tandem networks. Its initial performance was roughly similar to the TXP; later versions were about 20% slower than a VLX. Its small cabinet could be installed into any "copier room" office environment. A CLX CPU was one board, containing six "compiled silicon" ASIC CMOS chips. The CPU core chip was duplicated and lock stepped for maximal error detection. Pinout was a main limitation of this chip technology. Microcode, cache, and TLB were all external to the CPU core and shared a single bus and single SRAM memory bank. As a result, CLX required at least 2 machine cycles per instruction.
In 1989 Tandem introduced the NonStop Cyclone, a fast but expensive system for the mainframe end of the market.
Each self-checking CPU took 3 boards full of hot-running ECL gate array chips, plus memory boards. Despite being microprogrammed, the CPU was superscalar
Superscalar
A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate...
, often completing two instructions per cache cycle. This was accomplished by having a separate microcode routine for every common pair of instructions.
That fused pair of stack instructions generally accomplished the same work as a single instruction of normal 32-bit minicomputers. Cyclone processors were packaged as sections of 4 CPUs each, and the sections joined by a fiber optic version of Dynabus.
Like Tandem's prior high end machines, Cyclone cabinets were styled with lots of angular black to suggest strength and power. Advertising videos directly compared Cyclone to the SR-71 Blackbird Mach-3 spy plane. Cyclone's name was supposed to represent its unstoppable speed in roaring through OLTP workloads. Announcement day was Oct 17 and the press came to town. That afternoon, the region was struck by the 6.9 Loma Prieta earthquake, causing freeway collapses in Oakland and major fires in San Francisco. Tandem offices were shaken but no one was badly hurt on site. This was the first and last time that Tandem named its products after a natural disaster.
Other product lines
In 1980-1983, Tandem attempted to re-design its entire hardware and software stack to put its NonStop methods on a stronger foundation than its inherited HP 3000 traits. Rainbow's hardware was a 32-bit register-file machine that aimed to be better than a VAX. For reliable programming, the main programming language was "TPL", a subset of Ada. At that time, people barely understood how to compile Ada to unoptimized code. There was no migration path for existing NonStop system software coded in TAL. The OS and database and Cobol compilers were entirely redesigned. Customers would see it as a totally disjoint product line requiring all-new software from them. The software side of this ambitious project took much longer than planned. The hardware was already obsolete and out-performed by TXP before its software was ready, so the Rainbow project was abandoned. All subsequent efforts emphasized upward compatibility and easy migration paths.Development of Rainbow's advanced client/server application development framework called "Crystal" continued awhile longer and was spun off as the "Ellipse" product of Cooperative Systems Inc.
In 1985, Tandem attempted to grab a piece of the rapidly growing personal computer
Personal computer
A personal computer is any general-purpose computer whose size, capabilities, and original sales price make it useful for individuals, and which is intended to be operated directly by an end-user with no intervening computer operator...
market with its introduction of the MS-DOS
MS-DOS
MS-DOS is an operating system for x86-based personal computers. It was the most commonly used member of the DOS family of operating systems, and was the main operating system for IBM PC compatible personal computers during the 1980s to the mid 1990s, until it was gradually superseded by operating...
based Dynamite PC/workstation. Sadly, numerous design compromises (including a unique 8086-based hardware platform incompatible with expansion cards of the day and extremely limited compatibility with IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
-based PCs) relegated the Dynamite to serving primarily as a smart terminal. It was quietly and quickly withdrawn from the market.
Tandem's message-based NonStop operating system had advantages for scaling, extreme reliability, and efficiently using expensive "spare" resources. But many potential customers wanted just good-enough reliability in a small system, using a familiar Unix operating system and industry-standard programs. Tandem's various fault-tolerant competitors all adopted a simpler hardware-only memory-centric design where all recovery was done by switching between hot spares. The most successful competitor was Stratus Technologies
Stratus Technologies
Stratus Technologies, Inc. a major producer of fault tolerant computer servers. The company was founded in 1980 as Stratus Computer, Inc. in Natick, Massachusetts, and adopted its present name in 1999. The current CEO and president is Dave Laurello. Stratus Technologies, Inc. is a privately held...
, whose machines were re-marketed by IBM as "IBM System/88".
In such systems, the spare processors do not contribute to system throughput between failures, but merely redundantly execute exactly the same data thread as the active processor at the same instant, in "lock step". Faults are detected by seeing when the cloned processors' outputs diverged. To detect failures, the system must have 2 physical processors for each logical, active processor. To also implement automatic failover recovery, the system must have 3 or 4 physical processors for each logical processor. The 3x-4x cost of this sparing is practical when the duplicated parts are commodity single-chip microprocessors.
Tandem's products for this market began with the Integrity
Integrity (disambiguation)
The ethical concept of integrity is that of basing of one's actions on a consistent framework of principles.Integrity may also refer to:...
line in 1989, using MIPS processors and a "NonStop UX" variant of Unix. It was developed in Austin TX. In 1991, the Integrity S2 used TMR, Triple Modular Redundancy, where each logical CPU used three MIPS
MIPS architecture
MIPS is a reduced instruction set computer instruction set architecture developed by MIPS Technologies . The early MIPS architectures were 32-bit, and later versions were 64-bit...
R2000 microprocessors to execute the same data thread, with voting to find and lock out a failed part. Their fast clocks could not be synchronized as in strict lock stepping, so voting instead happened at each interrupt.
Some other version of Integrity used 4x "pair and spares" redundancy. Pairs of processors ran in lock-step to check each other. When they disagreed, both processors were marked untrusted and their workload was taken over by a hot-spare pair of processors whose state was already current. In 1995, the Integrity S4000 was the first to use ServerNet and moved toward sharing peripherals with the NonStop line.
In 1995-1997, Tandem partnered with Microsoft to implement high-availability features and advanced SQL configurations in clusters of commodity Windows NT machines. This project was called "Wolfpack" and first shipped as Microsoft Cluster Server
Microsoft Cluster Server
Microsoft Cluster Server is software designed to allow servers to work together as a computer cluster, to provide failover and increased availability of applications, or parallel calculating power in case of high-performance computing clusters .Microsoft has three technologies for clustering:...
in 1997. Microsoft benefited greatly from this partnership; Tandem did not.
TNS/R NonStop migration to MIPS
When Tandem was formed in 1974, every computer company had to design and build its CPUs from basic circuits, using its own proprietary instruction set and own compilers etc. With each year of semiconductor progress with Moore's Law, more of a CPU's core circuits could fit into single chips, and run faster and much cheaper as a result. But it became increasingly expensive for a computer company to design those advanced custom chips, or build the plants to fabricate the chips. By 1991, only the very biggest companies could continue to design and build their own competitive CPUs. Tandem was not big enough for that, so it needed to move its NonStop product line and customer base onto some advanced microprocessor chip set designed and built by others.HP's HP 3000 MPE division had similar roadmap problems but found a clever way forward in 1986. HP Labs designed a RISC computer core which was stripped of all non-essentials so it could soon fit into one chip. And it was efficiently pipelined and ran even faster than the ECL mainframes of that time.
It was many times faster than the microprogrammed CMOS stack machines that the rest of HP was then designing.
But how to migrate all the vendor, customer, and third-party software for those existing product lines? Some software was portable and could be directly recompiled for the new instruction set. Other software was not easily recompiled as is. HP Labs invented efficient ways to run the old binaries of that software on the new machine, by emulation and by automatic translation of binary object code. And they told everyone how they did it. Similar object code translation techniques were subsequently used by Apple Computers, to move Macintosh software from M68000 machines to PowerPC machines, and by Digital Equipment Corporation, to move VMS users from VAXs to Alpha machines.
One flaw in the HP 3000 migration plan, is that HP also ambitiously tried to rewrite the entire MPE operating system in a new language at that same time. They didn't plan to use the same emulation techniques on their own primary code. But their rewrite to native mode took years longer to complete than expected. HP's first generation RISC hardware was already obsolete before its MPE software was ready to release. Tandem learned from this mistake.
Tandem could not use HP's PA-RISC or Sun's SPARC CPUs, for business reasons. Instead, Tandem partnered with
MIPS
MIPS Technologies
MIPS Technologies, Inc. , formerly MIPS Computer Systems, Inc., is most widely known for developing the MIPS architecture and a series of pioneering RISC chips. MIPS provides processor architectures and cores for digital home, networking and mobile applications.MIPS Computer Systems Inc. was...
and adopted its R3000
R3000
The R3000 is a microprocessor chip set developed by MIPS Computer Systems that implemented the MIPS I instruction set architecture . Introduced in June 1988, it was the second MIPS implementation, succeeding the R2000 as the flagship MIPS microprocessor...
and successor chipsets and their advanced optimizing compiler. Subsequent NonStop Guardian machines using the MIPS instruction set were known to programmers as TNS/R machines, but had a variety of marketing names.
In 1991, Tandem released the Cyclone/R, also known as CLX/R. This was a low cost mid-range system based on CLX components, but used R3000 microprocessors instead of the much slower CLX stack machine board. To minimize time to market, this machine was initially shipped without any MIPS native-mode software. Everything, including its NSK operating system and SQL database, was compiled to TNS stack machine code. That object code was then translated to equivalent partially optimized MIPS instruction sequences at kernel install time by a tool called the Accelerator.
Less-important programs could also be executed directly without pre-translation, via a TNS code interpreter. These migration techniques were very successful and are still in use today. Everyone's software was brought over without extra work, and the performance was good enough for mid-range machines, and programmers could ignore the instruction differences, even when debugging at machine code level. These Cyclone/R machines were updated with a faster native-mode NSK in a follow-up release.
The R3000 and later microprocessors had only a typical amount of internal error checking, insufficient for Tandem's needs. So the Cyclone/R ran pairs of R3000 processors in lock step, running the same data thread. It used a curious variation of lock stepping. The checker processor ran 1 cycle behind the primary processor. This allowed them to share a single copy of external code and data caches without putting excessive pinout load on the sysbus and lowering the system clock rate. To successfully run microprocessors in lock step, the chips must be designed to be fully deterministic. Any hidden internal state must be cleared by the chip's reset mechanism. Otherwise, the matched chips will sometimes get out of sync for no visible reason and without any faults, long after the chips are restarted. All chip designers agree that these are good principles because it helps them test chips at manufacturing time. But all new microprocessor chips seemed to have bugs in this area, and required months of shared work between MIPS and Tandem to eliminate or work around the final subtle bugs. )
In 1993, Tandem released the NonStop Himalaya K-series with the faster MIPS R4400, a native mode NSK, and fully expandable Cyclone system components. These were still connected by Dynabus, Dynabus+, and the original I/O bus, which by now were all running out of performance headroom.
In 1994, the NonStop Kernel was extended with a Unix-like POSIX
POSIX
POSIX , an acronym for "Portable Operating System Interface", is a family of standards specified by the IEEE for maintaining compatibility between operating systems...
environment called Open System Services. The original Guardian shell and ABI remained available.
In 1997 Tandem introduced the NonStop Himalaya S-Series with a new top-level system architecture based on ServerNet
ServerNet (Tandem)
- History :ServerNet is a switched fabric communications link primarily used in proprietary computers made by Tandem Computers, Compaq, and HP. Its features include good scalability, clean fault containment, error detection and failover. The ServerNet architecture specification defines a connection...
connections. ServerNet replaced the obsolete Dynabus, FOX, and I/O buses. It was much faster, more general, and could be extended to more than just two-way redundancy via an arbitrary fabric of point-to-point connections.
Tandem designed ServerNet for its own needs but then promoted its use by others; it evolved into the InfiniBand
InfiniBand
InfiniBand is a switched fabric communications link used in high-performance computing and enterprise data centers. Its features include high throughput, low latency, quality of service and failover, and it is designed to be scalable...
industry standard.
All S-Series machines used MIPS processors, including the R4400, R10000
R10000
The R10000, code-named "T5", is a RISC microprocessor implementation of the MIPS IV instruction set architecture developed by MIPS Technologies, Inc. , then a division of Silicon Graphics, Inc. . The chief designers were Chris Rowen and Kenneth C. Yeager...
, R12000, and R14000.
The design of the later, faster MIPS cores was primarily funded by Silicon Graphics Inc. But Intel's Pentium Pro overtook the performance of RISC designs and also SGI's graphics business shrunk. After the R10000, there was no investment in significant new MIPS core designs for high-end servers. So Tandem needed to eventually move its NonStop product line yet again onto some other microprocessor architecture with competitive fast chips.
Acquisition by Compaq, attempted migration to Alpha
Jimmy Treybig remained CEO and energetic center of the company he founded until a downturn in 1996.Compaq
Compaq
Compaq Computer Corporation is a personal computer company founded in 1982. Once the largest supplier of personal computing systems in the world, Compaq existed as an independent corporation until 2002, when it was acquired for US$25 billion by Hewlett-Packard....
's x86-based server division was an early outside adopter of Tandem's ServerNet/Infiniband interconnect technology. In 1997, Compaq acquired the Tandem Computers company and NonStop customer base to balance Compaq's heavy focus on low-end PCs. In 1998, Compaq also acquired the much larger Digital Equipment Corporation
Digital Equipment Corporation
Digital Equipment Corporation was a major American company in the computer industry and a leading vendor of computer systems, software and peripherals from the 1960s to the 1990s...
and inherited its DEC Alpha
DEC Alpha
Alpha, originally known as Alpha AXP, is a 64-bit reduced instruction set computer instruction set architecture developed by Digital Equipment Corporation , designed to replace the 32-bit VAX complex instruction set computer ISA and its implementations. Alpha was implemented in microprocessors...
RISC servers with OpenVMS
OpenVMS
OpenVMS , previously known as VAX-11/VMS, VAX/VMS or VMS, is a computer server operating system that runs on VAX, Alpha and Itanium-based families of computers. Contrary to what its name suggests, OpenVMS is not open source software; however, the source listings are available for purchase...
and Tru64 Unix
Tru64 UNIX
Tru64 UNIX is a 64-bit UNIX operating system for the Alpha instruction set architecture , currently owned by Hewlett-Packard . Previously, Tru64 UNIX was a product of Compaq, and before that, Digital Equipment Corporation , where it was known as Digital UNIX .As its original name suggests, Tru64...
customer bases. Tandem was then midway in porting its NonStop product line from MIPS R12000 microprocessors to Intel's new Itanium
Itanium
Itanium is a family of 64-bit Intel microprocessors that implement the Intel Itanium architecture . Intel markets the processors for enterprise servers and high-performance computing systems...
Merced microprocessors. This project was restarted with Alpha as the new target to align NonStop with Compaq's other large server lines. But in 2001, Compaq terminated all Alpha engineering investments in favor of the unproven Itanium microprocessors. The Alpha version of NonStop died before shipping. So the NonStop migration project was restarted yet again, targeting Itanium McKinley.
The combined companies' PC-centric sales force did not understand how to sell large complex systems to large enterprises. A single sale takes many months of proposals and education rather than a single negotiation.
Acquisition by Hewlett Packard, TNS/E migration to Itanium
In 2001, Hewlett Packard similarly made the choice to abdicate its successful PA-RISCPA-RISC
PA-RISC is an instruction set architecture developed by Hewlett-Packard. As the name implies, it is a reduced instruction set computer architecture, where the PA stands for Precision Architecture...
product lines in favor of Intel's Itanium microprocessors that HP helped to design. Shortly afterwards, Compaq and HP announced their plan to merge, and consolidate their similar product lines. This contentious merger became official in May 2002. The consolidations were painful and destroyed the DEC and 'HP Way' engineer-oriented cultures. But the combined company did know how to sell complex systems to enterprises and profit, so it was an improvement for the surviving NonStop division and its customers.
In some ways, Tandem's journey from HP-inspired startup, to an HP-inspired competitor, then to an HP division was "bringing Tandem back to its original roots." But this was definitely not the same HP.
The re-port of the NSK-based NonStop product line from MIPS processors to Itanium-based processors was finally completed and is branded as 'HP Integrity NonStop Servers'. (This NSK Integrity NonStop is unrelated to Tandem's original 'Integrity' series for Unix.)
It was not possible to run Itanium McKinley chips with clock-level lock stepping. So the Integrity NonStop machines instead use comparisons between chip states at longer time scales, at interrupt points and at various software sync points in between interrupts. The intermediate sync points are automatically triggered at every N'th taken branch instruction, and are also explicitly inserted into long loop bodies by all NonStop compilers. The machine design supports both dual and triple redundancy, with either 2 or 3 physical microprocessors per logical Itanium processor. The triple version is sold to customers needing the utmost reliability. This new checking approach is called NSAA, NonStop Advanced Architecture.
As in the earlier migration from stack machines to MIPS microprocessors, all customer software was carried forward without source changes. 'Native mode' source code compiled directly to MIPS machine code was simply recompiled for Itanium. Some older 'non native' software was still in TNS stack machine form. These were automatically ported onto Itanium via object code translation techniques.
Integrity NonStop continues to be HP's answer for the extreme scaling needs of its very largest customers. The NSK operating system, now termed NonStop OS, continues as the base software environment for the NonStop Servers, and has been extended to include support for Java
Java (Sun)
Java refers to several computer software products and specifications from Sun Microsystems, a subsidiary of Oracle Corporation, that together provide a system for developing application software and deploying it in a cross-platform environment...
and integration with popular development tools like Visual Studio
Microsoft Visual Studio
Microsoft Visual Studio is an integrated development environment from Microsoft. It is used to develop console and graphical user interface applications along with Windows Forms applications, web sites, web applications, and web services in both native code together with managed code for all...
and Eclipse
Eclipse (software)
Eclipse is a multi-language software development environment comprising an integrated development environment and an extensible plug-in system...
.
NSK Guardian also became the base for the HP Neoview OS, the operating system used in the HP Neoview systems that are tailored for use in Business Intelligence and Enterprise Data Warehouse use. NonStop SQL was also the starting point for Neoview SQL, which has been tailored to Business Intelligence use.
Culture
Tandem was famous for its unique Silicon Valley company culture. Some of this was inherited from the 'HP Way' created by Bill Hewlett and Dave Packard. Some of it was formed in founder Treybig's business plan. And some just reflected Jimmy's egalitarian values and Texan personality. The goals of that culture were formalized in a 'Tandem Philosophy' class for new employees in 1980 and refreshed in 1983:- An employee-oriented workplace.
- Sustained profitability.
- High customer satisfaction.
Tandem's weekly beer busts were the most visible sign that this company was a bit different. These had keg beer, wine, soft drinks, popcorn, unshelled peanuts, and good conversation. Besides being fun, they encouraged employees of different groups and levels to mix and learn what others in the company needed. Jimmy was always present and accessible.
Many fast-growing tech companies with rising stock prices award stock options to top employees. But Tandem was unique in also granting 100 shares every year to absolutely every employee, no matter how lowly. Similarly, every US employee was given a paid six-week sabbatical every four years, beyond generous regular vacation accruals.
Tandem experimented with new ways to keep the entire company aligned and feeling like a smaller company. This included monthly "First Friday" telecasts that were broadcast live worldwide over private satellite links. These were produced by an award-winning in-house Tandem TV staff. While generally educational about some aspect of the company, the programs usually featured some member of the senior management team in a humorous way.
As a side effect of its worldwide networking of all corporate computers, Tandem was a very early implementor of a worldwide corporate email system. This helped a lot. But the sociology of this new medium required some debugging. In the first release, the reply button defaulted to sending to all Tandem employees worldwide. Besides being annoying and embarrassing, this soon led to flame wars between people who had never met and those flames were difficult to extinguish. Subsequent releases fixed that and added support for nonbusiness mail, classifieds, and special interest groups.
Another distinctive employee program was TOPS (Tandem Outstanding PerformerS). This award was given to the top 5% of employees annually; any employee could be nominated. Winners and their guest were treated to an all-expenses paid trip to resort locations such as Hawaii or Vail for several days of fun and team building with top management.
Satisfying customers with reliable systems and topnotch field support was just as important as these internal programs.
One quirk of Tandem was that its customers invariably delayed placing their orders for new or expanded systems until the last weeks of their fiscal quarter. So Tandem mastered the art of doing all its manufacturing and testing in those final two weeks. Exciting for awhile, but this made it difficult to manage parts inventory. And it just encouraged customers and salesmen to continue waiting to the last minutes.
Jimmy was clear that a satisfying workplace required continued strong growth. The Philosophy class included a very complicated 8-page flowchart that attempted to show how every part and aspect of the company was driven by and help drive rising revenues and stock prices. This all worked well up to Tandem's billion-dollar year. But eventually, Tandem's spectacular growth stagnated due to saturated markets, economic slowdowns, and the costs and limitations of the main product line.
As a nearly anonymous division within Compaq and now HP, Tandem's culture is now just history.
User groups
- ITUG (International Tandem User Group) now part of Connect (users group)
External links
- NonStop Computing Home - the main Nonstop Computing page at HP
- Tandem Technical Reports - a page at HP with a number of Tandem white papers
- Tandem Systems Review PDFs 1983-1994