SSE4
Encyclopedia
SSE4 is a CPU instruction set
used in the Intel Core microarchitecture and AMD K10 (K8L)
. It was announced on 27 September 2006 at the Fall 2006 Intel Developer Forum
, with vague details in a white paper
; more precise details of 47 instructions became available at the Spring 2007 Intel Developer Forum
in Beijing
, in the presentation. The SSE4 Programming Reference is available from Intel.
AMD supports 4 instructions from the SSE4 instruction set, but have also added four new SSE instructions, naming the group SSE4a. These instructions are not found in Intel's processors supporting SSE4.1 and AMD processors only started supporting Intel's SSE4.1 and SSE4.2 in the Bulldozer-based FX processors. Support was added for SSE4a for unaligned SSE load-operation instructions (which formerly required 16-byte alignment).
(Supplemental Streaming SIMD
Extension 3), introduced in the Intel Core 2
processor line, was mistakenly referred to as SSE4 by the media during its development.
Intel is using the marketing term HD Boost to refer to SSE4.
Several of these instructions are enabled by the single-cycle shuffle engine in Penryn. (Shuffle operations are operations that involve the repositioning of bits.)
AMD calls this pair of instructions Advanced Bit Manipulation (ABM).
Trailing zeros can be counted using POPCNT(NOT(x) AND (x − 1)).
Instruction set
An instruction set, or instruction set architecture , is the part of the computer architecture related to programming, including the native data types, instructions, registers, addressing modes, memory architecture, interrupt and exception handling, and external I/O...
used in the Intel Core microarchitecture and AMD K10 (K8L)
AMD K10
The AMD Family 10h is a microprocessor microarchitecture by AMD. Though there were once reports that the K10 had been canceled, the first third-generation Opteron products for servers were launched on September 10, 2007, with the Phenom processors for desktops following and launching on November...
. It was announced on 27 September 2006 at the Fall 2006 Intel Developer Forum
Intel Developer Forum
Intel Developer Forum , is a gathering of technologists to discuss Intel products and products based around Intel products. The first IDF was in 1997...
, with vague details in a white paper
White paper
A white paper is an authoritative report or guide that helps solve a problem. White papers are used to educate readers and help people make decisions, and are often requested and used in politics, policy, business, and technical fields. In commercial use, the term has also come to refer to...
; more precise details of 47 instructions became available at the Spring 2007 Intel Developer Forum
Intel Developer Forum
Intel Developer Forum , is a gathering of technologists to discuss Intel products and products based around Intel products. The first IDF was in 1997...
in Beijing
Beijing
Beijing , also known as Peking , is the capital of the People's Republic of China and one of the most populous cities in the world, with a population of 19,612,368 as of 2010. The city is the country's political, cultural, and educational center, and home to the headquarters for most of China's...
, in the presentation. The SSE4 Programming Reference is available from Intel.
SSE4 subsets
Intel SSE4 consists of 54 instructions. A subset consisting of 47 instructions, referred to as SSE4.1 in some Intel documentation, is available in Penryn. Additionally, SSE4.2, a second subset consisting of the 7 remaining instructions, is first available in Nehalem-based Core i7. Intel credits feedback from developers as playing an important role in the development of the instruction set.AMD supports 4 instructions from the SSE4 instruction set, but have also added four new SSE instructions, naming the group SSE4a. These instructions are not found in Intel's processors supporting SSE4.1 and AMD processors only started supporting Intel's SSE4.1 and SSE4.2 in the Bulldozer-based FX processors. Support was added for SSE4a for unaligned SSE load-operation instructions (which formerly required 16-byte alignment).
Name confusion
What is now known as SSSE3SSSE3
Supplemental Streaming SIMD Extensions 3 is a SIMD instruction set created by Intel and is the fourth iteration of the SSE technology.- History :...
(Supplemental Streaming SIMD
SIMD
Single instruction, multiple data , is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously...
Extension 3), introduced in the Intel Core 2
Intel Core 2
Core 2 is a brand encompassing a range of Intel's consumer 64-bit x86-64 single-, dual-, and quad-core microprocessors based on the Core microarchitecture. The single- and dual-core models are single-die, whereas the quad-core models comprise two dies, each containing two cores, packaged in a...
processor line, was mistakenly referred to as SSE4 by the media during its development.
Intel is using the marketing term HD Boost to refer to SSE4.
New instructions
Unlike all previous iterations of SSE, SSE4 contains instructions that execute operations which are not specific to multimedia applications. It features a number of instructions whose action is determined by a constant field and a set of instructions that take XMM0 as an implicit third operand.Several of these instructions are enabled by the single-cycle shuffle engine in Penryn. (Shuffle operations are operations that involve the repositioning of bits.)
SSE4.1
These instructions were introduced with Penryn microarchitecture, the 45 nm shrink of Intel's Core microarchitecture. Support is indicated via the CPUID.01H:ECX.SSE41[Bit 19] flag.Instruction | Description |
---|---|
MPSADBW | Compute eight offset sums of absolute differences, four at a time (i.e., |x0−y0|+|x1−y1|+|x2−y2|+|x3−y3|, |x0−y1|+|x1−y2|+|x2−y3|+|x3−y4|, …, |x0−y7|+|x1−y8|+|x2−y9|+|x3−y10|); this operation is important for some HD High-definition video High-definition video or HD video refers to any video system of higher resolution than standard-definition video, and most commonly involves display resolutions of 1,280×720 pixels or 1,920×1,080 pixels... codec Codec A codec is a device or computer program capable of encoding or decoding a digital data stream or signal. The word codec is a portmanteau of "compressor-decompressor" or, more commonly, "coder-decoder"... s, and allows an 8×8 block difference to be computed in fewer than seven cycles. One bit of a three-bit immediate operand indicates whether y0 .. y10 or y4 .. y14 should be used from the destination operand, the other two whether x0..x3, x4..x7, x8..x11 or x12..x15 should be used from the source. |
PHMINPOSUW | Sets the bottom unsigned 16-bit word of the destination to the smallest unsigned 16-bit word in the source, and the next-from-bottom to the index of that word in the source. |
PMULDQ | Packed signed multiplication on two sets of two out of four packed integers, the 1st and 3rd per packed 4, giving two packed 64-bit results. |
PMULLD | Packed signed multiplication, four packed sets of 32-bit integers multiplied to give 4 packed 32-bit results. |
DPPS, DPPD | Dot product for AOS (Array of Structs) data. This takes an immediate operand consisting of four (or two for DPPD) bits to select which of the entries in the input to multiply and accumulate, and another four (or two for DPPD) to select whether to put 0 or the dot-product in the appropriate field of the output. |
BLENDPS, BLENDPD, BLENDVPS, BLENDVPD, PBLENDVB, PBLENDW | Conditional copying of elements in one location with another, based (for non-V form) on the bits in an immediate operand, and (for V form) on the bits in register XMM0. |
PMINSB, PMAXSB, PMINUW, PMAXUW, PMINUD, PMAXUD, PMINSD, PMAXSD | Packed minimum/maximum for different integer operand types |
ROUNDPS, ROUNDSS, ROUNDPD, ROUNDSD | Round values in a floating-point register to integers, using one of four rounding modes specified by an immediate operand |
INSERTPS, PINSRB, PINSRD/PINSRQ, EXTRACTPS, PEXTRB, PEXTRW, PEXTRD/PEXTRQ | The INSERTPS and PINSR instructions read 8, 16 or 32 bits from an x86 register memory location and insert it into a field in the destination register given by an immediate operand, EXTRACTPS and PEXTR read a field from the source register and insert it into an x86 register or memory location. For example, PEXTRD eax, [xmm0], 1; EXTRACTPS [addr+4*eax], xmm1, 1 stores the first field of xmm1 in the address given by the first field of xmm0. |
PMOVSXBW, PMOVZXBW, PMOVSXBD, PMOVZXBD, PMOVSXBQ, PMOVZXBQ, PMOVSXWD, PMOVZXWD, PMOVSXWQ, PMOVZXWQ, PMOVSXDQ, PMOVZXDQ | Packed sign/zero extension to wider types |
PTEST | This is similar to the TEST instruction, in that it sets the Z flag to the result of an AND between its operators: ZF is set, if DEST AND SRC is equal to 0. Additionally it sets the C flag if (NOT DEST) AND SRC equals zero. This is equivalent to setting the Z flag if none of the bits masked by SRC are set, and the C flag if all of the bits masked by SRC are set. |
PCMPEQQ | Quadword (64 bits) compare for equality |
PACKUSDW | Convert signed DWORDs into unsigned WORDs with saturation. |
MOVNTDQA | Efficient read from write-combining memory area into SSE register; this is useful for retrieving results from peripherals attached to the memory bus. |
SSE4.2
These instructions were first implemented in the Nehalem-based Intel Core i7 product line and complete the SSE4 instruction set. Support is indicated via the CPUID.01H:ECX.SSE42[Bit 20] flag.Instruction | Description |
---|---|
CRC32 | Accumulate CRC32C value using the polynomial 0x11EDC6F41 (or, without the high order bit, 0x1EDC6F41). |
PCMPESTRI | Packed Compare Explicit Length Strings, Return Index |
PCMPESTRM | Packed Compare Explicit Length Strings, Return Mask |
PCMPISTRI | Packed Compare Implicit Length Strings, Return Index |
PCMPISTRM | Packed Compare Implicit Length Strings, Return Mask |
PCMPGTQ | Compare Packed Signed 64-bit data For Greater Than |
POPCNT and LZCNT
These instructions operate on integer rather than SSE registers, and although introduced at the same time, are not considered part of the SSE4.2 instruction set; instead, they have their own dedicated CPUID bits to indicate support. Intel implements POPCNT beginning with the Nehalem microarchitecture, and AMD implements both beginning with the Barcelona microarchitecture.AMD calls this pair of instructions Advanced Bit Manipulation (ABM).
Instruction | Description |
---|---|
POPCNT | Population count Hamming weight The Hamming weight of a string is the number of symbols that are different from the zero-symbol of the alphabet used. It is thus equivalent to the Hamming distance from the all-zero string of the same length. For the most typical case, a string of bits, this is the number of 1's in the string... (count number of bits set to 1). Support is indicated via the CPUID.01H:ECX.POPCNT[Bit 23] flag. |
LZCNT | Leading zero count. Support is indicated via the CPUID.80000001H:ECX.ABM[Bit 5] flag. This instruction is not available on Intel processors. |
Trailing zeros can be counted using POPCNT(NOT(x) AND (x − 1)).
SSE4a
The SSE4a instruction group was introduced in AMD's Barcelona microarchitecture. These instructions are not available in Intel processors. Support is indicated via the CPUID.80000001H:ECX.SSE4A[Bit 6] flag.Instruction | Description |
---|---|
EXTRQ/INSERTQ | Combined mask-shift instructions. |
MOVNTSD/MOVNTSS | Scalar streaming store instructions. |