TCP Offload Engine
Encyclopedia
TCP offload engine or TOE is a technology used in network interface cards (NIC) to offload processing of the entire TCP/IP stack to the network controller. It is primarily used with high-speed network interfaces, such as gigabit Ethernet
and 10 Gigabit Ethernet
, where processing overhead of the network stack becomes significant.
The term, TOE, is often used to refer to the NIC itself, although circuit board engineers may use it to refer only to the integrated circuit
included on the card which processes the TCP
header
s. TOEs are often suggested as a way to reduce the overhead associated with IP
storage protocols such as iSCSI
and NFS.
was designed for unreliable
low speed networks (such as early dial-up modem
s) but with the growth of the Internet in terms of internet backbone
transmission speeds (Optical Carrier
, gigabit Ethernet
and 10 Gigabit Ethernet
links) and faster and more reliable
access mechanisms (such as digital subscriber line
and cable modem
s) it is frequently used in datacenters and desktop PC
environments at speeds over 1 gigabit per second. The TCP software implementations on host systems require extensive computing power. Full duplex gigabit TCP
communication using software processing alone is enough to consume more than 80% of a 2.4 GHz Pentium 4
processor (see freed-up CPU cycles), resulting in little or no processing resources left for the applications to run on the system.
As TCP is a connection-oriented
protocol, this adds to the complexity and processing overhead of the protocol. These aspects include:
Moving some or all of these functions to dedicated hardware, a TCP offload engine, frees the system's main CPU for other tasks. As of 2008, very few consumer network interface cards support TOE.
Instead of replacing the TCP stack with a TOE entirely, there are alternative techniques to offload some operations in co-operation with the operating system's TCP stack. TCP checksum offload and large segment offload are supported by the majority of today's Ethernet NICs. Newer techniques like large receive offload
and TCP acknowledgment offload are already implemented in some high-end Ethernet hardware, but are effective even when implemented purely in software.
Many of the CPU cycles used for TCP/IP processing are "freed up" by TCP/IP offload and may be used by the CPU (usually a server
CPU) to perform other tasks such as file system processing (in a file server) or indexing (in a backup media server). In other words, a server with TCP/IP offload can do more server work than a server without TCP/IP offload NICs.
Currently most end point hosts are PCI
bus based, which provides a standard interface for the addition of certain peripherals such as Network Interfaces to Server
s and PCs.
PCI is inefficient for transferring small bursts of data from host
memory, across the PCI bus to the network interface ICs, but its efficiency improves as the data burst size increases. Within the TCP protocol, a large number of small packets are created (e.g. acknowledgements) and as these are typically generated on the host CPU and transmitted across the PCI bus and out the network physical interface, this impacts the host computer IO throughput.
A TOE solution, located on the network interface, is located on the other side of the PCI bus from the CPU host so it can address this I/O efficiency issue, as the data to be sent across the TCP connection can be sent to the TOE from the CPU across the PCI bus using large data burst sizes with none of the smaller TCP packets having to traverse the PCI bus.
Systems in early 1990
) whose founder Larry Boucher and a number of Auspex engineers went on to found Alacritech
in 1997 with the idea of extending the concept of network stack offload to TCP and implementing it in custom silicon. They introduced the first parallel-stack full offload network card in early 1999 and the company’s SLIC (Session Layer Interface Card) was the predecessor to its current TOE offerings. Alacritech holds a number of patents in the area of TCP/IP offload.
By 2002 as the emergence of TCP-based storage such as iSCSI
spurred interest it was said that "At least a dozen newcomers, most founded toward the end of the dot-com bubble, are chasing the opportunity for merchant semiconductor accelerators for storage protocols and applications, vying with half a dozen entrenched vendors and in-house ASIC designs."
In 2005 Microsoft licensed Alacritech's patent base and along with Alacritech created the partial TCP offload architecture that has become known as TCP chimney offload. TCP chimney offload centers on the Alacritech "Communication Block Passing Patent". At the same time, Broadcom also obtained a license to build TCP chimney offload chips.
naming conventions) and the Transport Layer (TCP) using a "vampire tap". The vampire tap intercepts TCP connection requests by applications and is responsible for TCP connection management as well as TCP data transfer. Many of the criticisms in the following section relate to this type of TCP offload.
Storage Device. This type of TCP offload not only offloads TCP/IP processing but it also offloads the iSCSI Initiator Function. Because the HBA appears to the host as a Disk Controller it can only be used with iSCSI devices and is not appropriate for general TCP/IP offload.
or Qlogic
that add support, the Linux kernel developers are opposed to this technology for several reason, including::
interface cards, such as Alacritech
, intilop corporation, Broadcom Corporation, Chelsio Communications, Emulex
, LeWiz Communications, Mellanox, Neterion Technologies, QLogic
and Tehuti Networks Ltd.
Gigabit Ethernet
Gigabit Ethernet is a term describing various technologies for transmitting Ethernet frames at a rate of a gigabit per second , as defined by the IEEE 802.3-2008 standard. It came into use beginning in 1999, gradually supplanting Fast Ethernet in wired local networks where it performed...
and 10 Gigabit Ethernet
10 Gigabit Ethernet
The 10 gigabit Ethernet computer networking standard was first published in 2002. It defines a version of Ethernet with a nominal data rate of 10 Gbit/s , ten times faster than gigabit Ethernet.10 gigabit Ethernet defines only full duplex point to point links which are generally connected by...
, where processing overhead of the network stack becomes significant.
The term, TOE, is often used to refer to the NIC itself, although circuit board engineers may use it to refer only to the integrated circuit
Integrated circuit
An integrated circuit or monolithic integrated circuit is an electronic circuit manufactured by the patterned diffusion of trace elements into the surface of a thin substrate of semiconductor material...
included on the card which processes the TCP
Transmission Control Protocol
The Transmission Control Protocol is one of the core protocols of the Internet Protocol Suite. TCP is one of the two original components of the suite, complementing the Internet Protocol , and therefore the entire suite is commonly referred to as TCP/IP...
header
Header (information technology)
In information technology, header refers to supplemental data placed at the beginning of a block of data being stored or transmitted. In data transmission, the data following the header are sometimes called the payload or body....
s. TOEs are often suggested as a way to reduce the overhead associated with IP
Internet Protocol
The Internet Protocol is the principal communications protocol used for relaying datagrams across an internetwork using the Internet Protocol Suite...
storage protocols such as iSCSI
ISCSI
In computing, iSCSI , is an abbreviation of Internet Small Computer System Interface, an Internet Protocol -based storage networking standard for linking data storage facilities. By carrying SCSI commands over IP networks, iSCSI is used to facilitate data transfers over intranets and to manage...
and NFS.
Purpose
Originally TCPTransmission Control Protocol
The Transmission Control Protocol is one of the core protocols of the Internet Protocol Suite. TCP is one of the two original components of the suite, complementing the Internet Protocol , and therefore the entire suite is commonly referred to as TCP/IP...
was designed for unreliable
Communications protocol
A communications protocol is a system of digital message formats and rules for exchanging those messages in or between computing systems and in telecommunications...
low speed networks (such as early dial-up modem
Modem
A modem is a device that modulates an analog carrier signal to encode digital information, and also demodulates such a carrier signal to decode the transmitted information. The goal is to produce a signal that can be transmitted easily and decoded to reproduce the original digital data...
s) but with the growth of the Internet in terms of internet backbone
Internet backbone
The Internet backbone refers to the principal data routes between large, strategically interconnected networks and core routers in the Internet...
transmission speeds (Optical Carrier
Optical Carrier
Optical Carrier transmission rates are a standardized set of specifications of transmission bandwidth for digital signals that can be carried on Synchronous Optical Networking fiber optic networks...
, gigabit Ethernet
Gigabit Ethernet
Gigabit Ethernet is a term describing various technologies for transmitting Ethernet frames at a rate of a gigabit per second , as defined by the IEEE 802.3-2008 standard. It came into use beginning in 1999, gradually supplanting Fast Ethernet in wired local networks where it performed...
and 10 Gigabit Ethernet
10 Gigabit Ethernet
The 10 gigabit Ethernet computer networking standard was first published in 2002. It defines a version of Ethernet with a nominal data rate of 10 Gbit/s , ten times faster than gigabit Ethernet.10 gigabit Ethernet defines only full duplex point to point links which are generally connected by...
links) and faster and more reliable
Communications protocol
A communications protocol is a system of digital message formats and rules for exchanging those messages in or between computing systems and in telecommunications...
access mechanisms (such as digital subscriber line
Digital Subscriber Line
Digital subscriber line is a family of technologies that provides digital data transmission over the wires of a local telephone network. DSL originally stood for digital subscriber loop. In telecommunications marketing, the term DSL is widely understood to mean Asymmetric Digital Subscriber Line ,...
and cable modem
Cable modem
A cable modem is a type of network bridge and modem that provides bi-directional data communication via radio frequency channels on a HFC and RFoG infrastructure. Cable modems are primarily used to deliver broadband Internet access in the form of cable Internet, taking advantage of the high...
s) it is frequently used in datacenters and desktop PC
Personal computer
A personal computer is any general-purpose computer whose size, capabilities, and original sales price make it useful for individuals, and which is intended to be operated directly by an end-user with no intervening computer operator...
environments at speeds over 1 gigabit per second. The TCP software implementations on host systems require extensive computing power. Full duplex gigabit TCP
Transmission Control Protocol
The Transmission Control Protocol is one of the core protocols of the Internet Protocol Suite. TCP is one of the two original components of the suite, complementing the Internet Protocol , and therefore the entire suite is commonly referred to as TCP/IP...
communication using software processing alone is enough to consume more than 80% of a 2.4 GHz Pentium 4
Pentium 4
Pentium 4 was a line of single-core desktop and laptop central processing units , introduced by Intel on November 20, 2000 and shipped through August 8, 2008. They had a 7th-generation x86 microarchitecture, called NetBurst, which was the company's first all-new design since the introduction of the...
processor (see freed-up CPU cycles), resulting in little or no processing resources left for the applications to run on the system.
As TCP is a connection-oriented
Connection-oriented protocol
A connection-oriented networking protocol is one that establishes a communication session, then delivers a stream of data in the same order as it was sent. It may be a circuit switched connection, or a virtual circuit connection in a packet switched network...
protocol, this adds to the complexity and processing overhead of the protocol. These aspects include:
- Connection establishment using the "3-way handshake" (SYNchronize; SYNchronize-ACKnowledge; ACKnowledge).
- Acknowledgment of packets as they are received by the far end, adding to the message flow between the endpoints and thus the protocol load.
- ChecksumChecksumA checksum or hash sum is a fixed-size datum computed from an arbitrary block of digital data for the purpose of detecting accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and...
and sequence numberTransmission Control ProtocolThe Transmission Control Protocol is one of the core protocols of the Internet Protocol Suite. TCP is one of the two original components of the suite, complementing the Internet Protocol , and therefore the entire suite is commonly referred to as TCP/IP...
calculations - again a burden on a general purpose CPU to perform. - Sliding window calculations for packet acknowledgement and congestion control.
- Connection terminationTransmission Control ProtocolThe Transmission Control Protocol is one of the core protocols of the Internet Protocol Suite. TCP is one of the two original components of the suite, complementing the Internet Protocol , and therefore the entire suite is commonly referred to as TCP/IP...
.
Moving some or all of these functions to dedicated hardware, a TCP offload engine, frees the system's main CPU for other tasks. As of 2008, very few consumer network interface cards support TOE.
Instead of replacing the TCP stack with a TOE entirely, there are alternative techniques to offload some operations in co-operation with the operating system's TCP stack. TCP checksum offload and large segment offload are supported by the majority of today's Ethernet NICs. Newer techniques like large receive offload
Large receive offload
In computer networking, large receive offload is a technique for increasing inbound throughput of high-bandwidth network connections by reducing CPU overhead. It works by aggregating multiple incoming packets from a single stream into a larger buffer before they are passed higher up the networking...
and TCP acknowledgment offload are already implemented in some high-end Ethernet hardware, but are effective even when implemented purely in software.
Freed Up CPU Cycles
A generally accepted rule of thumb is that 1 hertz of CPU processing is required to send or receive 1 bit/s of TCP/IP. For example 5 Gbit/s (625 MB/s) of network traffic requires 5 GHz of CPU Processing. This implies that 2 entire cores of a 2.5 GHz multi-core processor will be required to handle the TCP/IP processing associated with 5 Gbit/s of TCP/IP traffic. Since Ethernet (10Ge in this example) is bidirectional it is possible to send and receive 10 Gbit/s (for an aggregate throughput of 20 Gbit/s). Using the 1 Hz/(bit/s) rule this equates to eight 2.5 GHz cores. (Few if any current day servers have a requirement to move 10 Gbit/s in both directions but not so long ago 1 Gbit/s full duplex was thought to be more than enough bandwidth.)Many of the CPU cycles used for TCP/IP processing are "freed up" by TCP/IP offload and may be used by the CPU (usually a server
Server (computing)
In the context of client-server architecture, a server is a computer program running to serve the requests of other programs, the "clients". Thus, the "server" performs some computational task on behalf of "clients"...
CPU) to perform other tasks such as file system processing (in a file server) or indexing (in a backup media server). In other words, a server with TCP/IP offload can do more server work than a server without TCP/IP offload NICs.
Reduction of PCI traffic
In addition to the protocol overhead that TOE can address, it can also address some architectural issues that affect a large percentage of host based (Server and PC) endpoints.Currently most end point hosts are PCI
Peripheral Component Interconnect
Conventional PCI is a computer bus for attaching hardware devices in a computer...
bus based, which provides a standard interface for the addition of certain peripherals such as Network Interfaces to Server
Server (computing)
In the context of client-server architecture, a server is a computer program running to serve the requests of other programs, the "clients". Thus, the "server" performs some computational task on behalf of "clients"...
s and PCs.
PCI is inefficient for transferring small bursts of data from host
Server (computing)
In the context of client-server architecture, a server is a computer program running to serve the requests of other programs, the "clients". Thus, the "server" performs some computational task on behalf of "clients"...
memory, across the PCI bus to the network interface ICs, but its efficiency improves as the data burst size increases. Within the TCP protocol, a large number of small packets are created (e.g. acknowledgements) and as these are typically generated on the host CPU and transmitted across the PCI bus and out the network physical interface, this impacts the host computer IO throughput.
A TOE solution, located on the network interface, is located on the other side of the PCI bus from the CPU host so it can address this I/O efficiency issue, as the data to be sent across the TCP connection can be sent to the TOE from the CPU across the PCI bus using large data burst sizes with none of the smaller TCP packets having to traverse the PCI bus.
History
One of the first patents in this technology, for UDP offload, was issued to AuspexAuspex
Auspex, Inc. was founded 1985 by Michael Henney, Burt Brockman and Terry Kisner to develop SCADA software systems. In 1986, Rich Newell joined Auspex and the company started providing software support for SGM, Inc.'s RCS-7 line of SCADA products. In 1987 Auspex acquired the RCS-7 product line from...
Systems in early 1990
) whose founder Larry Boucher and a number of Auspex engineers went on to found Alacritech
Alacritech
Alacritech is a Silicon Valley manufacturer of storage network acceleration solutions that improve the performance of the existing enterprise network storage infrastructure without having to replace the investments made in the NFS-based storage network...
in 1997 with the idea of extending the concept of network stack offload to TCP and implementing it in custom silicon. They introduced the first parallel-stack full offload network card in early 1999 and the company’s SLIC (Session Layer Interface Card) was the predecessor to its current TOE offerings. Alacritech holds a number of patents in the area of TCP/IP offload.
By 2002 as the emergence of TCP-based storage such as iSCSI
ISCSI
In computing, iSCSI , is an abbreviation of Internet Small Computer System Interface, an Internet Protocol -based storage networking standard for linking data storage facilities. By carrying SCSI commands over IP networks, iSCSI is used to facilitate data transfers over intranets and to manage...
spurred interest it was said that "At least a dozen newcomers, most founded toward the end of the dot-com bubble, are chasing the opportunity for merchant semiconductor accelerators for storage protocols and applications, vying with half a dozen entrenched vendors and in-house ASIC designs."
In 2005 Microsoft licensed Alacritech's patent base and along with Alacritech created the partial TCP offload architecture that has become known as TCP chimney offload. TCP chimney offload centers on the Alacritech "Communication Block Passing Patent". At the same time, Broadcom also obtained a license to build TCP chimney offload chips.
Parallel-stack full offload
Parallel-stack full offload gets its name from the concept of two parallel TCP/IP Stacks. The first is the main host stack which is included with the host OS. The second or "parallel stack" is connected between the Application Layer (using the Internet protocol suiteInternet protocol suite
The Internet protocol suite is the set of communications protocols used for the Internet and other similar networks. It is commonly known as TCP/IP from its most important protocols: Transmission Control Protocol and Internet Protocol , which were the first networking protocols defined in this...
naming conventions) and the Transport Layer (TCP) using a "vampire tap". The vampire tap intercepts TCP connection requests by applications and is responsible for TCP connection management as well as TCP data transfer. Many of the criticisms in the following section relate to this type of TCP offload.
HBA full offload
HBA full offload is found in iSCSI Host Bus Adapters which present themselves as Disk Controllers to the Host System while connecting (via TCP/IP) to an iSCSIISCSI
In computing, iSCSI , is an abbreviation of Internet Small Computer System Interface, an Internet Protocol -based storage networking standard for linking data storage facilities. By carrying SCSI commands over IP networks, iSCSI is used to facilitate data transfers over intranets and to manage...
Storage Device. This type of TCP offload not only offloads TCP/IP processing but it also offloads the iSCSI Initiator Function. Because the HBA appears to the host as a Disk Controller it can only be used with iSCSI devices and is not appropriate for general TCP/IP offload.
TCP chimney partial offload
TCP chimney offload addresses the major security criticism of parallel-stack full offload. In partial offload, the main system stack controls all connections to the host. After a connection has been established between the local host (usually a server) and a foreign host (usually a client) the connection and its state are passed to the TCP offload engine. The heavy lifting of data transmit and receive is handled by the offload device. Almost all TCP offload engines use some type of TCP/IP hardware implementation to perform the data transfer without host CPU intervention. When the connection is closed, the connection state is returned from the offload engine to the main system stack. Maintaining control of TCP connections allows the main system stack to implement and control connection security.Support in Linux
The standard Linux kernel does not include support for TOE hardware. While there are patches from the hardware manufacturers such as ChelsioChelsio
Chelsio Communications is leading the convergence of networking, storage and clustering interconnects with its robust, high-performance and proven unified wire technology...
or Qlogic
QLogic
QLogic Corporation is an Aliso Viejo, California-based designer and supplier of storage networking, high performance computing networking, and converged infrastructure solutions...
that add support, the Linux kernel developers are opposed to this technology for several reason, including::
- Security — because TOE is implemented in hardware, patches must be applied to the TOE firmwareFirmwareIn electronic systems and computing, firmware is a term often used to denote the fixed, usually rather small, programs and/or data structures that internally control various electronic devices...
, instead of just software, to address any security vulnerabilities found in a particular TOE implementation. This is further compounded by the newness and vendor-specificity of this hardware, as compared to a well tested TCP/IP stack as is found in an operating system that does not use TOE. - Limitations of hardware — because connections are buffered and processed on the TOE chip, resource starvation can more easily occur as compared to the generous CPU and memory available to the operating system.
- Complexity — TOE breaks the assumption that kernels make about having access to all resources at all times — details such as memory used by open connections are not available with TOE. TOE also requires very large changes to a networking stack in order to be supported properly, and even when that is done, features like Quality of ServiceQuality of serviceThe quality of service refers to several related aspects of telephony and computer networks that allow the transport of traffic with special requirements...
and packet filtering typically do not work. - Proprietary — TOE is implemented differently by each hardware vendor. This means more code must be rewritten to deal with the various TOE implementations, at a cost of the aforementioned complexity and, possibly, security. Furthermore, TOE firmware cannot be easily modified since it is closed-source.
- Obsolescence — Each TOE NIC has a limited lifetime of usefulness, because system hardware rapidly catches up to TOE performance levels, and eventually exceeds TOE performance levels.
Suppliers
Much of the current work on TOE technology is by manufacturers of 10 Gigabit Ethernet10 Gigabit Ethernet
The 10 gigabit Ethernet computer networking standard was first published in 2002. It defines a version of Ethernet with a nominal data rate of 10 Gbit/s , ten times faster than gigabit Ethernet.10 gigabit Ethernet defines only full duplex point to point links which are generally connected by...
interface cards, such as Alacritech
Alacritech
Alacritech is a Silicon Valley manufacturer of storage network acceleration solutions that improve the performance of the existing enterprise network storage infrastructure without having to replace the investments made in the NFS-based storage network...
, intilop corporation, Broadcom Corporation, Chelsio Communications, Emulex
Emulex
Emulex Corporation is a California based manufacturer of storage networking infrastructure solutions. The company's products include Fibre Channel host bus adapters , Fibre Channel over Ethernet converged network adapters , embedded storage switches, storage I/O controller and SAN storage switch...
, LeWiz Communications, Mellanox, Neterion Technologies, QLogic
QLogic
QLogic Corporation is an Aliso Viejo, California-based designer and supplier of storage networking, high performance computing networking, and converged infrastructure solutions...
and Tehuti Networks Ltd.
See also
- Large segment offload (LSO)
- Large receive offloadLarge receive offloadIn computer networking, large receive offload is a technique for increasing inbound throughput of high-bandwidth network connections by reducing CPU overhead. It works by aggregating multiple incoming packets from a single stream into a larger buffer before they are passed higher up the networking...
(LRO) - Scalable Networking PackScalable Networking PackScalable Networking Pack is a set of additions that adds new features to Microsoft's Windows Server 2003 Service Pack 1 or later with architectural enhancements and APIs to support the new capabilities of network acceleration and hardware-based offload technologies.-Features:*TCP Chimney Offload –...
External links
- Article: TCP Offload to the Rescue by Andy Currid at ACM Queue
- Patent Application 20040042487