Direct memory access
Encyclopedia
Direct memory access is a feature of modern computer
s that allows certain hardware subsystems within the computer to access system memory
independently of the central processing unit
(CPU).
Without DMA, the CPU using programmed input/output
is typically fully occupied for the entire duration of the read or write operation, and is thus unavailable to perform other work. With DMA, the CPU would initiate the transfer, do other operations while the transfer is in progress, and receive an interrupt from the DMA controller once the operation has been done. This is useful any time the CPU cannot keep up with the rate of data transfer, or where the CPU can perform useful work while waiting for a relatively slow I/O data transfer. Many hardware systems use DMA including disk drive controllers, graphics cards, network card
s and sound card
s. DMA is also used for intra-chip data transfer in multi-core processors. Computers that have DMA channels can transfer data to and from devices with much less CPU overhead than computers without a DMA channel. Similarly, a processing element inside a multi-core processor can transfer data to and from its local memory without occupying its processor time, allowing computation and data transfer to proceed in parallel.
DMA can also be used for "memory to memory" copying or moving of data within memory. DMA can offload expensive memory operations, such as large copies or scatter-gather
operations, from the CPU to a dedicated DMA engine. Intel includes such engines on high-end servers, called I/O Acceleration Technology
(I/OAT).
To carry out an input, output or memory-to-memory operation, the host processor initializes the DMA controller with a count of the number of words to transfer, and the memory address to use. The CPU then sends commands to a peripheral device to initiate transfer of data. The DMA controller then provides addresses and read/write control lines to the system memory. Each time a word of data is ready to be transferred between the peripheral device and memory, the DMA controller increments its internal address register until the full block of data is transferred.
DMA transfers can either occur one word at a time, allowing the CPU to access memory on alternate bus cycles - this is called cycle stealing
since the DMA controller and CPU contend for memory access. In burst mode DMA (Also called as Block Transfer Mode, An entire block of data is transferred in one continuous sequence. Once the DMA controller is granted access to the system buses by the CPU, it transfers all bytes of data in the data block before releasing control of the system buses back to the CPU and also this mode is useful for loading programs or data files into memory, but it does render the CPU inactive for relatively long periods of time), the CPU can be put on hold while the DMA transfer occurs and a full block of possibly hundreds or thousands of words can be moved. Where memory cycles are much faster than processor cycles, an interleaved DMA cycle is possible, where the DMA controller uses memory while the CPU cannot.
In a bus mastering
system, both the CPU and peripherals can be granted control of the memory bus. Where a peripheral can become bus master, it can directly write to system memory without involvement of the CPU, providing memory address and control signals as required. Some measure must be provided to put the processor into a hold condition so that bus contention does not occur.
problems. Imagine a CPU equipped with a cache and an external memory that can be accessed directly by devices using DMA. When the CPU accesses location X in the memory, the current value will be stored in the cache. Subsequent operations on X will update the cached copy of X, but not the external memory version of X, assuming a write-back cache. If the cache is not flushed to the memory before the next time a device tries to access X, the device will receive a stale value of X.
Similarly, if the cached copy of X is not invalidated when a device writes a new value to the memory, then the CPU will operate on a stale value of X.
This issue can be addressed in one of two ways in system design: Cache-coherent systems implement a method in hardware whereby external writes are signaled to the cache controller which then performs a cache invalidation
for DMA writes or cache flush for DMA reads. Non-coherent systems leave this to software, where the OS must then ensure that the cache lines are flushed before an outgoing DMA transfer is started and invalidated before a memory range affected by an incoming DMA transfer is accessed. The OS must make sure that the memory range is not accessed by any running threads in the meantime. The latter approach introduces some overhead to the DMA operation, as most hardware requires a loop to invalidate each cache line individually.
Hybrids also exist, where the secondary L2 cache is coherent while the L1 cache (typically on-CPU) is managed by software.
, there was only one Intel 8237
DMA controller capable of providing four DMA channels (numbered 0-3). These DMA channels performed 8-bit transfers and could only address the first megabyte of RAM. With the IBM PC/AT, a second 8237 DMA controller was added (channels 5-7; channel 4 is unusable), and the page register was rewired to address the full 16 MB memory address space of the 80286 CPU. This second controller performed 16-bit transfers.
Due to their lagging performance (2.5 Mbit/s), these devices have been largely obsolete since the advent of the 80386 processor and its capacity for 32-bit transfers. They are still supported to the extent they are required to support built-in legacy PC hardware on modern machines. The only pieces of legacy hardware that use ISA DMA and are still fairly common are the built-in Floppy disk
controllers of many PC mainboards and those IEEE 1284
parallel ports that support the fast ECP mode.
Each DMA channel has a 16-bit address register and a 16-bit count register associated with it. To initiate a data transfer the device driver sets up the DMA channel's address and count registers together with the direction of the data transfer, read or write. It then instructs the DMA hardware to begin the transfer. When the transfer is complete, the device interrupts the CPU.
Scatter-gather DMA allows the transfer of data to and from multiple memory areas in a single DMA transaction. It is equivalent to the chaining together of multiple simple DMA requests. The motivation is to off-load multiple input/output
interrupt and data copy tasks from the CPU.
DRQ stands for DMA request; DACK for DMA acknowledge. These symbols, seen on hardware schematic
s of computer systems with DMA functionality, represent electronic signaling lines between the CPU and DMA controller. Each DMA channel has one Request and one Acknowledge line. A device that uses DMA must be configured to use both lines of the assigned DMA channel.
Standard ISA DMA assignments:
architecture has no central DMA controller, unlike ISA. Instead, any PCI component can request control of the bus ("become the bus master") and request to read from and write to system memory. More precisely, a PCI component requests bus ownership from the PCI bus controller (usually the southbridge
in a modern PC design), which will arbitrate
if several devices request bus ownership simultaneously, since there can only be one bus master at one time. When the component is granted ownership, it will issue normal read and write commands on the PCI bus, which will be claimed by the bus controller and forwarded to the memory controller using a scheme which is specific to every chipset.
As an example, on a modern AMD Socket AM2
-based PC, the southbridge will forward the transactions to the northbridge
(which is integrated on the CPU die) using HyperTransport
, which will in turn convert them to DDR2
operations and send them out on the DDR2 memory bus. As can be seen, there are quite a number of steps involved in a PCI DMA transfer; however, that poses little problem, since the PCI device or PCI bus itself are an order of magnitude slower than rest of components (see list of device bandwidths).
A modern x86 CPU may use more than 4 GB of memory, utilizing PAE
, a 36-bit addressing mode, or the native 64-bit mode of x86-64
CPUs. In such a case, a device using DMA with a 32-bit address bus is unable to address memory above the 4 GB line. The new Double Address Cycle (DAC) mechanism, if implemented on both the PCI bus and the device itself, enables 64-bit DMA addressing. Otherwise, the operating system would need to work around the problem by either using costly double buffers (Windows nomenclature) also known as bounce buffers (Linux), or it could use an IOMMU
to provide address translation services if one is present.
chipsets include a DMA engine technology called I/O Acceleration Technology
(I/OAT), meant to improve network performance on high-throughput network interfaces, in particular gigabit Ethernet
and faster. However, various benchmarks with this approach by Intel's Linux kernel
developer Andrew Grover indicate no more than 10% improvement in CPU utilization with receiving workloads, and no improvement when transmitting data.
and embedded system
s, typical system bus infrastructure is a complex on-chip bus such as AMBA High-performance Bus. AMBA defines two kinds of AHB components: master and slave. A slave interface is similar to programmed I/O through which the software (running on embedded CPU, e.g. ARM
) can write/read I/O registers or (less commonly) local memory blocks inside the device. A master interface can be used by the device to perform DMA transactions to/from system memory without heavily loading the CPU.
Therefore high bandwidth devices such as network controllers that need to transfer huge amounts of data to/from system memory will have two interface adapters to the AHB: a master and a slave interface. This is because on-chip buses like AHB do not support tri-stating
the bus or alternating the direction of any line on the bus. Like PCI, no central DMA controller is required since the DMA is bus-mastering, but an arbiter
is required in case of multiple masters present on the system.
Internally, a multichannel DMA engine is usually present in the device to perform multiple concurrent scatter-gather
operations as programmed by the software.
incorporates a DMA engine for each of its 9 processing elements including one Power processor element (PPE) and eight synergistic processor elements (SPEs). Since the SPE's load/store instructions can read/write only its own local memory, an SPE entirely depends on DMAs to transfer data to and from the main memory and local memories of other SPEs. Thus the DMA acts as a primary means of data transfer among cores inside this CPU
(in contrast to cache-coherent CMP architectures such as Intel's coming general-purpose GPU
, Larrabee).
DMA in Cell is fully cache coherent (note however local stores of SPEs operated upon by DMA do not act as globally coherent cache in the standard sense
). In both read ("get") and write ("put"), a DMA command can transfer either a single block area of size up to 16KB, or a list of 2 to 2048 such blocks. The DMA command is issued by specifying a pair of a local address and a remote address: for example when a SPE program issues a put DMA command, it specifies an address of its own local memory as the source and a virtual memory address (pointing to either the main memory or the local memory of another SPE) as the target, together with a block size. According to a recent experiment, an effective peak performance of DMA in Cell (3 GHz, under uniform traffic) reaches 200GB per second.
Computer
A computer is a programmable machine designed to sequentially and automatically carry out a sequence of arithmetic or logical operations. The particular sequence of operations can be changed readily, allowing the computer to solve more than one kind of problem...
s that allows certain hardware subsystems within the computer to access system memory
Computer storage
Computer data storage, often called storage or memory, refers to computer components and recording media that retain digital data. Data storage is one of the core functions and fundamental components of computers....
independently of the central processing unit
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...
(CPU).
Without DMA, the CPU using programmed input/output
Programmed input/output
Programmed input/output is a method of transferring data between the CPU and a peripheral such as a network adapter or an ATA storage device....
is typically fully occupied for the entire duration of the read or write operation, and is thus unavailable to perform other work. With DMA, the CPU would initiate the transfer, do other operations while the transfer is in progress, and receive an interrupt from the DMA controller once the operation has been done. This is useful any time the CPU cannot keep up with the rate of data transfer, or where the CPU can perform useful work while waiting for a relatively slow I/O data transfer. Many hardware systems use DMA including disk drive controllers, graphics cards, network card
Network card
A network interface controller is a computer hardware component that connects a computer to a computer network....
s and sound card
Sound card
A sound card is an internal computer expansion card that facilitates the input and output of audio signals to and from a computer under control of computer programs. The term sound card is also applied to external audio interfaces that use software to generate sound, as opposed to using hardware...
s. DMA is also used for intra-chip data transfer in multi-core processors. Computers that have DMA channels can transfer data to and from devices with much less CPU overhead than computers without a DMA channel. Similarly, a processing element inside a multi-core processor can transfer data to and from its local memory without occupying its processor time, allowing computation and data transfer to proceed in parallel.
DMA can also be used for "memory to memory" copying or moving of data within memory. DMA can offload expensive memory operations, such as large copies or scatter-gather
Vectored I/O
In computing, vectored I/O, also known as scatter/gather I/O, is a method of input and output by which a single procedure-call sequentially writes data from multiple buffers to a single data stream or reads data from a data stream to multiple buffers. The buffers are given in a vector of buffers...
operations, from the CPU to a dedicated DMA engine. Intel includes such engines on high-end servers, called I/O Acceleration Technology
I/O Acceleration Technology
I/O Acceleration Technology is a DMA engine by Intel bundled with high-end server motherboards, that offloads memory copies from the main processor...
(I/OAT).
Principle
A DMA controller can generate addresses and initiate memory read or write cycles. It contains several registers that can be written and read by the CPU. These include a memory address register, a byte count register, and one or more control registers. The control registers specify the I/O port to use, the direction of the transfer (reading from the I/O device or writing to the I/O device), the transfer unit (byte at a time or word at a time), and the number of bytes to transfer in one burst.To carry out an input, output or memory-to-memory operation, the host processor initializes the DMA controller with a count of the number of words to transfer, and the memory address to use. The CPU then sends commands to a peripheral device to initiate transfer of data. The DMA controller then provides addresses and read/write control lines to the system memory. Each time a word of data is ready to be transferred between the peripheral device and memory, the DMA controller increments its internal address register until the full block of data is transferred.
DMA transfers can either occur one word at a time, allowing the CPU to access memory on alternate bus cycles - this is called cycle stealing
Cycle stealing
Cycle stealing is used to describe the "stealing" of a single CPU cycle, for example, to allow a DMA controller to perform a DMA operation. This is opposed to block operation where a DMA controller would request a bus, hold it for a complete transaction before releasing to a CPU.Cycle stealing...
since the DMA controller and CPU contend for memory access. In burst mode DMA (Also called as Block Transfer Mode, An entire block of data is transferred in one continuous sequence. Once the DMA controller is granted access to the system buses by the CPU, it transfers all bytes of data in the data block before releasing control of the system buses back to the CPU and also this mode is useful for loading programs or data files into memory, but it does render the CPU inactive for relatively long periods of time), the CPU can be put on hold while the DMA transfer occurs and a full block of possibly hundreds or thousands of words can be moved. Where memory cycles are much faster than processor cycles, an interleaved DMA cycle is possible, where the DMA controller uses memory while the CPU cannot.
In a bus mastering
Bus mastering
In computing, bus mastering is a feature supported by many bus architectures that enables a device connected to the bus to initiate transactions...
system, both the CPU and peripherals can be granted control of the memory bus. Where a peripheral can become bus master, it can directly write to system memory without involvement of the CPU, providing memory address and control signals as required. Some measure must be provided to put the processor into a hold condition so that bus contention does not occur.
Burst Mode
An entire block of data is transferred in one contiguous sequence. Once the DMA controller is granted access to the system buses by the CPU, it transfers all bytes of data in the data block before releasing control of the system buses back to the CPU.This mode is useful for loading programs or data files into memory, but it does render the CPU inactive for relatively long periods of time. It is also called as Block Transfer mode.Cycle Stealing Mode
Viable alternative for systems in which the CPU should not be disabled for the length of time needed for Burst transfer modes. DMA controller obtains access to the system buses as in burst mode, using BR & BG signals (Which is the 2 control signals which controls the interfacing between the CPU and DMA controller). However, it transfers one byte of data and then deasserts BR, returning control of the system buses to the CPU. It continually issues requests via BR, transferring one byte of data per request, until it has transferred it’s entire block of data. By continually obtaining and releasing control of the system buses, the DMA controller essentially interleaves instruction & data transfers. The CPU processes an instruction, then the DMA controller transfers a data value, and so on.The data block is not transferred as quickly as in burst mode, but the CPU is not idled for as long as in that mode.Useful for controllers monitoring data in real time.Transparent Mode
This requires the most time to transfer a block of data, yet it is also the most efficient in terms of overall system performance.The DMA controller only transfers data when the CPU is performing operations that do not use the system buses. Primary advantage is that CPU never stops executing its programs and DMA transfer is free in terms of time. Disadvantage is that the hardware needed to determine when the CPU is not using the system buses can be quite complex and relatively expensive.Cache coherency
DMA can lead to cache coherencyCache coherency
In computing, cache coherence refers to the consistency of data stored in local caches of a shared resource.When clients in a system maintain caches of a common memory resource, problems may arise with inconsistent data. This is particularly true of CPUs in a multiprocessing system...
problems. Imagine a CPU equipped with a cache and an external memory that can be accessed directly by devices using DMA. When the CPU accesses location X in the memory, the current value will be stored in the cache. Subsequent operations on X will update the cached copy of X, but not the external memory version of X, assuming a write-back cache. If the cache is not flushed to the memory before the next time a device tries to access X, the device will receive a stale value of X.
Similarly, if the cached copy of X is not invalidated when a device writes a new value to the memory, then the CPU will operate on a stale value of X.
This issue can be addressed in one of two ways in system design: Cache-coherent systems implement a method in hardware whereby external writes are signaled to the cache controller which then performs a cache invalidation
Cache invalidation
Cache invalidation is a process whereby entries in a cache are deleted. It can be done explicitly, as part of a cache coherency protocol in a parallel computer. In such a case, a compute node changes a variable and then invalidates the cached values of that variable across the rest of the computer...
for DMA writes or cache flush for DMA reads. Non-coherent systems leave this to software, where the OS must then ensure that the cache lines are flushed before an outgoing DMA transfer is started and invalidated before a memory range affected by an incoming DMA transfer is accessed. The OS must make sure that the memory range is not accessed by any running threads in the meantime. The latter approach introduces some overhead to the DMA operation, as most hardware requires a loop to invalidate each cache line individually.
Hybrids also exist, where the secondary L2 cache is coherent while the L1 cache (typically on-CPU) is managed by software.
ISA
In the original IBM PCIBM PC
The IBM Personal Computer, commonly known as the IBM PC, is the original version and progenitor of the IBM PC compatible hardware platform. It is IBM model number 5150, and was introduced on August 12, 1981...
, there was only one Intel 8237
Intel 8237
Intel 8237 is a direct memory access controller , a part of the MCS 85 microprocessor family. It was used as the DMA controller in the original IBM PC and IBM XT...
DMA controller capable of providing four DMA channels (numbered 0-3). These DMA channels performed 8-bit transfers and could only address the first megabyte of RAM. With the IBM PC/AT, a second 8237 DMA controller was added (channels 5-7; channel 4 is unusable), and the page register was rewired to address the full 16 MB memory address space of the 80286 CPU. This second controller performed 16-bit transfers.
Due to their lagging performance (2.5 Mbit/s), these devices have been largely obsolete since the advent of the 80386 processor and its capacity for 32-bit transfers. They are still supported to the extent they are required to support built-in legacy PC hardware on modern machines. The only pieces of legacy hardware that use ISA DMA and are still fairly common are the built-in Floppy disk
Floppy disk
A floppy disk is a disk storage medium composed of a disk of thin and flexible magnetic storage medium, sealed in a rectangular plastic carrier lined with fabric that removes dust particles...
controllers of many PC mainboards and those IEEE 1284
IEEE 1284
IEEE 1284 is a standard that defines bi-directional parallel communications between computers and other devices.-History:In the 1970s, Centronics developed the now-familiar printer parallel port that soon became a de facto standard...
parallel ports that support the fast ECP mode.
Each DMA channel has a 16-bit address register and a 16-bit count register associated with it. To initiate a data transfer the device driver sets up the DMA channel's address and count registers together with the direction of the data transfer, read or write. It then instructs the DMA hardware to begin the transfer. When the transfer is complete, the device interrupts the CPU.
Scatter-gather DMA allows the transfer of data to and from multiple memory areas in a single DMA transaction. It is equivalent to the chaining together of multiple simple DMA requests. The motivation is to off-load multiple input/output
Input/output
In computing, input/output, or I/O, refers to the communication between an information processing system , and the outside world, possibly a human, or another information processing system. Inputs are the signals or data received by the system, and outputs are the signals or data sent from it...
interrupt and data copy tasks from the CPU.
DRQ stands for DMA request; DACK for DMA acknowledge. These symbols, seen on hardware schematic
Schematic
A schematic diagram represents the elements of a system using abstract, graphic symbols rather than realistic pictures. A schematic usually omits all details that are not relevant to the information the schematic is intended to convey, and may add unrealistic elements that aid comprehension...
s of computer systems with DMA functionality, represent electronic signaling lines between the CPU and DMA controller. Each DMA channel has one Request and one Acknowledge line. A device that uses DMA must be configured to use both lines of the assigned DMA channel.
Standard ISA DMA assignments:
- DRAMDramDram or DRAM may refer to:As a unit of measure:* Dram , an imperial unit of mass and volume* Armenian dram, a monetary unit* Dirham, a unit of currency in several Arab nationsOther uses:...
Refresh (obsolete), - User hardware,
- Floppy diskFloppy diskA floppy disk is a disk storage medium composed of a disk of thin and flexible magnetic storage medium, sealed in a rectangular plastic carrier lined with fabric that removes dust particles...
controller, - Hard diskHard diskA hard disk drive is a non-volatile, random access digital magnetic data storage device. It features rotating rigid platters on a motor-driven spindle within a protective enclosure. Data is magnetically read from and written to the platter by read/write heads that float on a film of air above the...
(obsoleted by PIOProgrammed input/outputProgrammed input/output is a method of transferring data between the CPU and a peripheral such as a network adapter or an ATA storage device....
modes, and replaced by UDMAUDMAFor the main article about the controller, see Parallel ATAThe Ultra DMA interface was the fastest method used to transfer data between the computer and an ATA device until Serial ATA....
modes), - Cascade from XT DMA controller,
- Hard Disk (PS/2 only), user hardware for all others,
- User hardware,
- User hardware.
PCI
A PCIPeripheral Component Interconnect
Conventional PCI is a computer bus for attaching hardware devices in a computer...
architecture has no central DMA controller, unlike ISA. Instead, any PCI component can request control of the bus ("become the bus master") and request to read from and write to system memory. More precisely, a PCI component requests bus ownership from the PCI bus controller (usually the southbridge
Southbridge (computing)
The southbridge is one of the two chips in the core logic chipset on a personal computer motherboard, the other being the northbridge. The southbridge typically implements the slower capabilities of the motherboard in a northbridge/southbridge chipset computer architecture. In Intel chipset...
in a modern PC design), which will arbitrate
Arbiter (electronics)
-Asynchronous arbiters:An important form of arbiter is used in asynchronous circuits, to select the order of access to a shared resource among asynchronous requests. Its function is to prevent two operations from occurring at once when they should not...
if several devices request bus ownership simultaneously, since there can only be one bus master at one time. When the component is granted ownership, it will issue normal read and write commands on the PCI bus, which will be claimed by the bus controller and forwarded to the memory controller using a scheme which is specific to every chipset.
As an example, on a modern AMD Socket AM2
Socket AM2
The Socket AM2, renamed from Socket M2 , is a CPU socket designed by AMD for desktop processors, including the performance, mainstream and value segments...
-based PC, the southbridge will forward the transactions to the northbridge
Northbridge (computing)
The northbridge has historically been one of the two chips in the core logic chipset on a PC motherboard, the other being the southbridge. Increasingly these functions have migrated to the CPU chip itself, beginning with memory and graphics controllers. For Intel Sandy Bridge and AMD Fusion...
(which is integrated on the CPU die) using HyperTransport
HyperTransport
HyperTransport , formerly known as Lightning Data Transport , is a technology for interconnection of computer processors. It is a bidirectional serial/parallel high-bandwidth, low-latency point-to-point link that was introduced on April 2, 2001...
, which will in turn convert them to DDR2
DDR2 SDRAM
DDR2 SDRAM is a double data rate synchronous dynamic random-access memory interface. It supersedes the original DDR SDRAM specification and has itself been superseded by DDR3 SDRAM...
operations and send them out on the DDR2 memory bus. As can be seen, there are quite a number of steps involved in a PCI DMA transfer; however, that poses little problem, since the PCI device or PCI bus itself are an order of magnitude slower than rest of components (see list of device bandwidths).
A modern x86 CPU may use more than 4 GB of memory, utilizing PAE
Physical Address Extension
In computing, Physical Address Extension is a feature to allow x86 processors to access a physical address space larger than 4 gigabytes....
, a 36-bit addressing mode, or the native 64-bit mode of x86-64
X86-64
x86-64 is an extension of the x86 instruction set. It supports vastly larger virtual and physical address spaces than are possible on x86, thereby allowing programmers to conveniently work with much larger data sets. x86-64 also provides 64-bit general purpose registers and numerous other...
CPUs. In such a case, a device using DMA with a 32-bit address bus is unable to address memory above the 4 GB line. The new Double Address Cycle (DAC) mechanism, if implemented on both the PCI bus and the device itself, enables 64-bit DMA addressing. Otherwise, the operating system would need to work around the problem by either using costly double buffers (Windows nomenclature) also known as bounce buffers (Linux), or it could use an IOMMU
IOMMU
In computing, an input/output memory management unit is a memory management unit that connects a DMA-capable I/O bus to the main memory...
to provide address translation services if one is present.
I/OAT
As an example of DMA engine incorporated in a general-purpose CPU, newer Intel XeonXeon
The Xeon is a brand of multiprocessing- or multi-socket-capable x86 microprocessors from Intel Corporation targeted at the non-consumer server, workstation and embedded system markets.-Overview:...
chipsets include a DMA engine technology called I/O Acceleration Technology
I/O Acceleration Technology
I/O Acceleration Technology is a DMA engine by Intel bundled with high-end server motherboards, that offloads memory copies from the main processor...
(I/OAT), meant to improve network performance on high-throughput network interfaces, in particular gigabit Ethernet
Gigabit Ethernet
Gigabit Ethernet is a term describing various technologies for transmitting Ethernet frames at a rate of a gigabit per second , as defined by the IEEE 802.3-2008 standard. It came into use beginning in 1999, gradually supplanting Fast Ethernet in wired local networks where it performed...
and faster. However, various benchmarks with this approach by Intel's Linux kernel
Linux kernel
The Linux kernel is an operating system kernel used by the Linux family of Unix-like operating systems. It is one of the most prominent examples of free and open source software....
developer Andrew Grover indicate no more than 10% improvement in CPU utilization with receiving workloads, and no improvement when transmitting data.
AHB
In systems-on-a-chipSystem-on-a-chip
A system on a chip or system on chip is an integrated circuit that integrates all components of a computer or other electronic system into a single chip. It may contain digital, analog, mixed-signal, and often radio-frequency functions—all on a single chip substrate...
and embedded system
Embedded system
An embedded system is a computer system designed for specific control functions within a larger system. often with real-time computing constraints. It is embedded as part of a complete device often including hardware and mechanical parts. By contrast, a general-purpose computer, such as a personal...
s, typical system bus infrastructure is a complex on-chip bus such as AMBA High-performance Bus. AMBA defines two kinds of AHB components: master and slave. A slave interface is similar to programmed I/O through which the software (running on embedded CPU, e.g. ARM
ARM architecture
ARM is a 32-bit reduced instruction set computer instruction set architecture developed by ARM Holdings. It was named the Advanced RISC Machine, and before that, the Acorn RISC Machine. The ARM architecture is the most widely used 32-bit ISA in numbers produced...
) can write/read I/O registers or (less commonly) local memory blocks inside the device. A master interface can be used by the device to perform DMA transactions to/from system memory without heavily loading the CPU.
Therefore high bandwidth devices such as network controllers that need to transfer huge amounts of data to/from system memory will have two interface adapters to the AHB: a master and a slave interface. This is because on-chip buses like AHB do not support tri-stating
Three-state logic
In digital electronics three-state, tri-state, or 3-state logic allows an output port to assume a high impedance state in addition to the 0 and 1 logic levels, effectively removing the output from the circuit...
the bus or alternating the direction of any line on the bus. Like PCI, no central DMA controller is required since the DMA is bus-mastering, but an arbiter
Arbiter (electronics)
-Asynchronous arbiters:An important form of arbiter is used in asynchronous circuits, to select the order of access to a shared resource among asynchronous requests. Its function is to prevent two operations from occurring at once when they should not...
is required in case of multiple masters present on the system.
Internally, a multichannel DMA engine is usually present in the device to perform multiple concurrent scatter-gather
Vectored I/O
In computing, vectored I/O, also known as scatter/gather I/O, is a method of input and output by which a single procedure-call sequentially writes data from multiple buffers to a single data stream or reads data from a data stream to multiple buffers. The buffers are given in a vector of buffers...
operations as programmed by the software.
Cell
As an example usage of DMA in a multiprocessor-system-on-chip, IBM/Sony/Toshiba's Cell processorCell (microprocessor)
Cell is a microprocessor architecture jointly developed by Sony, Sony Computer Entertainment, Toshiba, and IBM, an alliance known as "STI". The architectural design and first implementation were carried out at the STI Design Center in Austin, Texas over a four-year period beginning March 2001 on a...
incorporates a DMA engine for each of its 9 processing elements including one Power processor element (PPE) and eight synergistic processor elements (SPEs). Since the SPE's load/store instructions can read/write only its own local memory, an SPE entirely depends on DMAs to transfer data to and from the main memory and local memories of other SPEs. Thus the DMA acts as a primary means of data transfer among cores inside this CPU
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...
(in contrast to cache-coherent CMP architectures such as Intel's coming general-purpose GPU
GPGPU
General-purpose computing on graphics processing units is the technique of using a GPU, which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the CPU...
, Larrabee).
DMA in Cell is fully cache coherent (note however local stores of SPEs operated upon by DMA do not act as globally coherent cache in the standard sense
CPU cache
A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations...
). In both read ("get") and write ("put"), a DMA command can transfer either a single block area of size up to 16KB, or a list of 2 to 2048 such blocks. The DMA command is issued by specifying a pair of a local address and a remote address: for example when a SPE program issues a put DMA command, it specifies an address of its own local memory as the source and a virtual memory address (pointing to either the main memory or the local memory of another SPE) as the target, together with a block size. According to a recent experiment, an effective peak performance of DMA in Cell (3 GHz, under uniform traffic) reaches 200GB per second.
See also
- AT AttachmentAT AttachmentParallel ATA , originally ATA, is an interface standard for the connection of storage devices such as hard disks, solid-state drives, floppy drives, and optical disc drives in computers. The standard is maintained by X3/INCITS committee...
- BlitterBlitterIn a computer system, a blitter is a circuit, sometimes as a coprocessor or a logic block on a microprocessor, that is dedicated to the rapid movement and modification of data within that computer's memory...
- Channel I/OChannel I/OIn computer science, channel I/O is a generic term that refers to a high-performance input/output architecture that is implemented in various forms on a number of computer architectures, especially on mainframe computers...
- DMA attackDMA attackIn cryptography, a DMA attack is a type of side channel attack whereby cryptographic keys can be stolen by an attacker who has physical access to an operating system.-Description:...
- Polling (computer science)Polling (computer science)Polling, or polled operation, in computer science, refers to actively sampling the status of an external device by a client program as a synchronous activity. Polling is most often used in terms of input/output , and is also referred to as polled or software driven .Polling is sometimes used...
- Remote direct memory accessRemote Direct Memory AccessIn computing, remote direct memory access is a direct memory access from the memory of one computer into that of another without involving either one's operating system...