Prefetch input queue
Encyclopedia
Fetching the instruction opcodes
from program memory
well in advance is known as prefetching
and it is served by using prefetch input queue (PIQ).The pre-fetched instructions are stored in data structure Queue. The fetching of opcodes well in advance, prior to their need for execution increases the overall efficiency of the processor
boosting it's speed. The processor no longer has to wait for the memory access operations for the subsequent instruction opcode to complete. This architecture was prominently used in the Intel 8086 microprocessor
.
from memory into a prefetch input queue.
This behavior only applies to von Neumann computers (that is, not Harvard architecture
computers) that can run self-modifying code
and have some sort of instruction pipelining. Nearly all modern high-performance computers fulfill these three requirements.
Usually, the prefetching behavior of the PIQ is invisible to the programming model of the CPU. However, there are some circumstances where the behavior of PIQ is visible, and needs to be taken into account by the programmer.
When the x86-processor changes mode from realmode to protected mode
and vice versa, the PIQ has to be flushed, or else the CPU will continue to translate the machine code
as if it were written in its last mode. If the PIQ is not flushed, the processor might translate its codes wrong and generate an invalid instruction exception
.
When executing self-modifying code
, a change in the processor code immediately in front of the current location of execution might not change how the processor interprets the code, as it is already loaded into its PIQ. It simply executes its old copy already loaded in the PIQ instead of the new and altered version of the code in its RAM and/or cache
.
This behavior of the PIQ can be used to determine if code is being executed inside an emulator
or directly on the hardware of a real CPU. Most emulators will probably never simulate this behavior. If the PIQ-size is zero (changes in the code always affect the state of the processor immediately), it can be deduced that either the code is being executed in an emulator or the processor invalidates the PIQ upon writes to addresses loaded in the PIQ.
(1878-1929) who first conceived of a queue as a solution to congestion in telephone traffic. Different queueing models are proposed in order to approximately simulate the real time queuing systems so that those can be analysed mathematically for different performance specifications.
Queuing models can be represented using Kendall's notation
:
where:
Generally in applications like prefetch input queue, M/M/1 Model is popularly used because of limited use of queue features. In this model in accordance with microprocessors, the user takes the role of the execution unit and server is the bus interface unit.
With a four stage pipeline
, the rate at which instructions are executed is four times that of sequential execution.
The processor usually has two separate units for fetching the instructions and for executing the instructions.
The implementation of a pipeline architecture is possible only if the bus interface unit and the execution unit are independent. While the execution unit is decoding or executing an instruction which does not require the use of the data
and address buses
, the bus interface unit fetches instruction opcodes
from the memory.
This process is much faster than sending out an address, reading the opcode and then decoding and executing it. Fetching the next instruction while the current instruction is being decoded or executed is called pipelining.
The 8086
architecture has a six-byte prefetch instruction pipeline. As the Execution Unit is executing the current instruction, the bus interface unit reads up to six bytes of opcodes in advance from the memory. The six byte long queue was chosen because the maximum number of bytes required for any instruction in 8086 is this long.
An exception is encountered when the execution unit encounters a branch
instruction i.e. either a jump or a call instruction. In this case, the entire queue must be dumped and the contents pointed to by the instruction pointer must be fetched from memory.
As the complexity of these chips increases, the cost also increases. These processors are relatively costlier than their counterparts without the prefetch input queue.
However, these disadvantages are greatly offset by the improvement in processor execution time. After the introduction of prefetch instruction queue in the 8086 processor, all successive processors have incorporated this feature.
mov eax, ahead
mov [eax], 0x9090
ahead:
jmp near to_the_end
; Some other code
to_the_end:
This self-modifying
program will overwrite the jmp to_the_end with two NOP
s (which is encoded as 0x9090). The jump jmp near to_the_end is assembled into two bytes of machine code, so the two NOP
s will just overwrite this jump and nothing else. (That is, the jump is replaced with a do-nothing-code.)
Because the machine code of the jump is already read into the PIQ, and probably also already executed by the processor (superscalar
processors execute several instructions at once, but they "pretend" that they don't because of the need for backward compatibility
), the change of the code will not have any change of the execution flow.
-syntax
self-modifying
x86-assembly
algorithm that determines the size of the PIQ:
code_starts_here:
xor cx, cx ; zero register cx
xor ax, ax ; zero register ax
mov dx, cs
mov [code_segment], dx ; "calculate" codeseg in the far jump below (edx here too)
around:
cmp ax, 1 ; check if ax has been alterd
je found_size
mov [nop_field+cx], 0x90 ; 0x90 = opcode "nop" (NO oPeration)
inc cx
db 0xEA ; 0xEA = opcode "far jump"
dw flush_queue ; should be followed by offset (rm = "dw", pm = "dd")
code_segment:
dw 0 ; and then the code segment (calculated above)
flush_queue:
mov [nop_field+cx], 0x40 ; 0x40 = opcode "inc ax" (INCrease ax)
nop_field:
nop times 256
jmp around
found_size:
;
; register cx now contains the size of the PIQ
; this code is for real mode
and 16-bit protected mode, but it could easily be changed into
; running for 32-bit protected mode as well. just change the "dw" for
; the offset to "dd". you need also change dx to edx at the top as
; well. (dw and dx = 16 bit addressing, dd and edx = 32 bit addressing)
;
What this code does is basically that it changes the execution flow, and determines by brute force
how large the PIQ is. "How far away do I have to change the code in front of me for it to affect me?"
If it is too near (it is already in the PIQ) the update will not have any effect. If it is far enough, the change of the code will affect the program and the program has then found the size of the processor's PIQ.
If this code is being executed under multitasking OS, the context switch
may lead to the wrong value.
Opcode
In computer science engineering, an opcode is the portion of a machine language instruction that specifies the operation to be performed. Their specification and format are laid out in the instruction set architecture of the processor in question...
from program memory
Computer memory
In computing, memory refers to the physical devices used to store programs or data on a temporary or permanent basis for use in a computer or other digital electronic device. The term primary memory is used for the information in physical systems which are fast In computing, memory refers to the...
well in advance is known as prefetching
Instruction prefetch
In computer architecture, instruction prefetch is a technique used in microprocessors to speed up the execution of a program by reducing wait states....
and it is served by using prefetch input queue (PIQ).The pre-fetched instructions are stored in data structure Queue. The fetching of opcodes well in advance, prior to their need for execution increases the overall efficiency of the processor
Microprocessor
A microprocessor incorporates the functions of a computer's central processing unit on a single integrated circuit, or at most a few integrated circuits. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and...
boosting it's speed. The processor no longer has to wait for the memory access operations for the subsequent instruction opcode to complete. This architecture was prominently used in the Intel 8086 microprocessor
Intel 8086
The 8086 is a 16-bit microprocessor chip designed by Intel between early 1976 and mid-1978, when it was released. The 8086 gave rise to the x86 architecture of Intel's future processors...
.
Introduction
Pipelining was brought to the forefront of computing architecture design during the 1960's due to the need for faster and more efficient computing. Pipelining is the broader concept and most modern processors load their instructions some clock cycles before they execute them. This is achieved by pre-loading machine codeMachine code
Machine code or machine language is a system of impartible instructions executed directly by a computer's central processing unit. Each instruction performs a very specific task, typically either an operation on a unit of data Machine code or machine language is a system of impartible instructions...
from memory into a prefetch input queue.
This behavior only applies to von Neumann computers (that is, not Harvard architecture
Harvard architecture
The Harvard architecture is a computer architecture with physically separate storage and signal pathways for instructions and data. The term originated from the Harvard Mark I relay-based computer, which stored instructions on punched tape and data in electro-mechanical counters...
computers) that can run self-modifying code
Self-modifying code
In computer science, self-modifying code is code that alters its own instructions while it is executing - usually to reduce the instruction path length and improve performance or simply to reduce otherwise repetitively similar code, thus simplifying maintenance...
and have some sort of instruction pipelining. Nearly all modern high-performance computers fulfill these three requirements.
Usually, the prefetching behavior of the PIQ is invisible to the programming model of the CPU. However, there are some circumstances where the behavior of PIQ is visible, and needs to be taken into account by the programmer.
When the x86-processor changes mode from realmode to protected mode
Protected mode
In computing, protected mode, also called protected virtual address mode, is an operational mode of x86-compatible central processing units...
and vice versa, the PIQ has to be flushed, or else the CPU will continue to translate the machine code
Machine code
Machine code or machine language is a system of impartible instructions executed directly by a computer's central processing unit. Each instruction performs a very specific task, typically either an operation on a unit of data Machine code or machine language is a system of impartible instructions...
as if it were written in its last mode. If the PIQ is not flushed, the processor might translate its codes wrong and generate an invalid instruction exception
Exception handling
Exception handling is a programming language construct or computer hardware mechanism designed to handle the occurrence of exceptions, special conditions that change the normal flow of program execution....
.
When executing self-modifying code
Self-modifying code
In computer science, self-modifying code is code that alters its own instructions while it is executing - usually to reduce the instruction path length and improve performance or simply to reduce otherwise repetitively similar code, thus simplifying maintenance...
, a change in the processor code immediately in front of the current location of execution might not change how the processor interprets the code, as it is already loaded into its PIQ. It simply executes its old copy already loaded in the PIQ instead of the new and altered version of the code in its RAM and/or cache
CPU cache
A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations...
.
This behavior of the PIQ can be used to determine if code is being executed inside an emulator
Emulator
In computing, an emulator is hardware or software or both that duplicates the functions of a first computer system in a different second computer system, so that the behavior of the second system closely resembles the behavior of the first system...
or directly on the hardware of a real CPU. Most emulators will probably never simulate this behavior. If the PIQ-size is zero (changes in the code always affect the state of the processor immediately), it can be deduced that either the code is being executed in an emulator or the processor invalidates the PIQ upon writes to addresses loaded in the PIQ.
Queuing theory and performance-based evaluation of different queuing models
It was A.K ErlangAgner Krarup Erlang
Agner Krarup Erlang was a Danish mathematician, statistician and engineer, who invented the fields of traffic engineering and queueing theory....
(1878-1929) who first conceived of a queue as a solution to congestion in telephone traffic. Different queueing models are proposed in order to approximately simulate the real time queuing systems so that those can be analysed mathematically for different performance specifications.
Queuing models can be represented using Kendall's notation
Kendall's notation
In queueing theory, Kendall's notation is the standard system used to describe and classify the queueing model that a queueing system corresponds to. First suggested by D. G...
:
- A1/A2/A3/A4
where:
- A1 is the distribution of time between two arrivals
- A2 is the service time distribution
- A3 is the total number of servers
- A4 is the capacity of system
- M/M/1 Model (Single Queue Single Server/ Markov process): In this model, elements of queue are served on first-come, first-served basis. Given the mean arrival and service rates, then actual rates vary around these average values randomly and hence have to be determined using a cumulative probability distribution functionCumulative distribution functionIn probability theory and statistics, the cumulative distribution function , or just distribution function, describes the probability that a real-valued random variable X with a given probability distribution will be found at a value less than or equal to x. Intuitively, it is the "area so far"...
. - M/M/r Model: This model is a generalization of the basic M/M/1 model where multiple servers operate in parallel. This kind of model can also model scenarios with impatient users who leave the queue immediately if they are not receiving service. This can also be modeled using a Bernoulli processBernoulli processIn probability and statistics, a Bernoulli process is a finite or infinite sequence of binary random variables, so it is a discrete-time stochastic process that takes only two values, canonically 0 and 1. The component Bernoulli variables Xi are identical and independent...
having only two states, success and failure. The best example of this model is our regular land-line telephone systems. - M/G/1 Model (Takacs' finite input Model) : This model is used to analyze advanced cases. Here the service time distribution is no longer a Markov process. This model considers the case of more than one failed machine being repaired by single repairman. Service time for any user is going to increase in this case.
Generally in applications like prefetch input queue, M/M/1 Model is popularly used because of limited use of queue features. In this model in accordance with microprocessors, the user takes the role of the execution unit and server is the bus interface unit.
Instruction queue
The processor executes a program by fetching the instructions from memory and executing them. Usually the processor execution speed is much faster than the memory access speed. Instruction queue is used to prefetch the next instructions in a separate buffer while the processor is executing the current instruction.With a four stage pipeline
Instruction pipeline
An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput ....
, the rate at which instructions are executed is four times that of sequential execution.
The processor usually has two separate units for fetching the instructions and for executing the instructions.
The implementation of a pipeline architecture is possible only if the bus interface unit and the execution unit are independent. While the execution unit is decoding or executing an instruction which does not require the use of the data
System bus
A system bus is a single computer bus that connects the major components of a computer system. The technique was developed to reduce costs and improve modularity....
and address buses
Address bus
An address bus is a computer bus that is used to specify a physical address. When a processor or DMA-enabled device needs to read or write to a memory location, it specifies that memory location on the address bus...
, the bus interface unit fetches instruction opcodes
Opcode
In computer science engineering, an opcode is the portion of a machine language instruction that specifies the operation to be performed. Their specification and format are laid out in the instruction set architecture of the processor in question...
from the memory.
This process is much faster than sending out an address, reading the opcode and then decoding and executing it. Fetching the next instruction while the current instruction is being decoded or executed is called pipelining.
The 8086
Intel 8086
The 8086 is a 16-bit microprocessor chip designed by Intel between early 1976 and mid-1978, when it was released. The 8086 gave rise to the x86 architecture of Intel's future processors...
architecture has a six-byte prefetch instruction pipeline. As the Execution Unit is executing the current instruction, the bus interface unit reads up to six bytes of opcodes in advance from the memory. The six byte long queue was chosen because the maximum number of bytes required for any instruction in 8086 is this long.
An exception is encountered when the execution unit encounters a branch
Branch (computer science)
A branch is sequence of code in a computer program which is conditionally executed depending on whether the flow of control is altered or not . The term can be used when referring to programs in high level languages as well as program written in machine code or assembly language...
instruction i.e. either a jump or a call instruction. In this case, the entire queue must be dumped and the contents pointed to by the instruction pointer must be fetched from memory.
Drawbacks
Processors implementing the instruction queue prefetch algorithm are rather technically advanced. The design level complexity of the such processors is much higher than for regular processors. This is primarily because of the need to implement two separate units, the BIU and EU, operating separately.As the complexity of these chips increases, the cost also increases. These processors are relatively costlier than their counterparts without the prefetch input queue.
However, these disadvantages are greatly offset by the improvement in processor execution time. After the introduction of prefetch instruction queue in the 8086 processor, all successive processors have incorporated this feature.
x86 example code
code_starts_here:mov eax, ahead
mov [eax], 0x9090
ahead:
jmp near to_the_end
; Some other code
to_the_end:
This self-modifying
Self-modifying code
In computer science, self-modifying code is code that alters its own instructions while it is executing - usually to reduce the instruction path length and improve performance or simply to reduce otherwise repetitively similar code, thus simplifying maintenance...
program will overwrite the jmp to_the_end with two NOP
NOP
In computer science, NOP or NOOP is an assembly language instruction, sequence of programming language statements, or computer protocol command that effectively does nothing at all....
s (which is encoded as 0x9090). The jump jmp near to_the_end is assembled into two bytes of machine code, so the two NOP
NOP
In computer science, NOP or NOOP is an assembly language instruction, sequence of programming language statements, or computer protocol command that effectively does nothing at all....
s will just overwrite this jump and nothing else. (That is, the jump is replaced with a do-nothing-code.)
Because the machine code of the jump is already read into the PIQ, and probably also already executed by the processor (superscalar
Superscalar
A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate...
processors execute several instructions at once, but they "pretend" that they don't because of the need for backward compatibility
Backward compatibility
In the context of telecommunications and computing, a device or technology is said to be backward or downward compatible if it can work with input generated by an older device...
), the change of the code will not have any change of the execution flow.
Example program to detect the size of the PIQ
This is an example NASMNASM (computer program)
The Netwide Assembler is an assembler and disassembler for the Intel x86 architecture. It can be used to write 16-bit, 32-bit and 64-bit programs. NASM is considered to be one of the most popular assemblers for Linux....
-syntax
Syntax
In linguistics, syntax is the study of the principles and rules for constructing phrases and sentences in natural languages....
self-modifying
Self-modifying code
In computer science, self-modifying code is code that alters its own instructions while it is executing - usually to reduce the instruction path length and improve performance or simply to reduce otherwise repetitively similar code, thus simplifying maintenance...
x86-assembly
Assembly language
An assembly language is a low-level programming language for computers, microprocessors, microcontrollers, and other programmable devices. It implements a symbolic representation of the machine codes and other constants needed to program a given CPU architecture...
algorithm that determines the size of the PIQ:
code_starts_here:
xor cx, cx ; zero register cx
xor ax, ax ; zero register ax
mov dx, cs
mov [code_segment], dx ; "calculate" codeseg in the far jump below (edx here too)
around:
cmp ax, 1 ; check if ax has been alterd
je found_size
mov [nop_field+cx], 0x90 ; 0x90 = opcode "nop" (NO oPeration)
inc cx
db 0xEA ; 0xEA = opcode "far jump"
dw flush_queue ; should be followed by offset (rm = "dw", pm = "dd")
code_segment:
dw 0 ; and then the code segment (calculated above)
flush_queue:
mov [nop_field+cx], 0x40 ; 0x40 = opcode "inc ax" (INCrease ax)
nop_field:
nop times 256
jmp around
found_size:
;
; register cx now contains the size of the PIQ
; this code is for real mode
Real mode
Real mode, also called real address mode, is an operating mode of 80286 and later x86-compatible CPUs. Real mode is characterized by a 20 bit segmented memory address space and unlimited direct software access to all memory, I/O addresses and peripheral hardware...
and 16-bit protected mode, but it could easily be changed into
; running for 32-bit protected mode as well. just change the "dw" for
; the offset to "dd". you need also change dx to edx at the top as
; well. (dw and dx = 16 bit addressing, dd and edx = 32 bit addressing)
;
What this code does is basically that it changes the execution flow, and determines by brute force
Brute force
Brute force may refer to any of several problem-solving methods involving the evaluation of multiple possible answer for fitness. The term has also been used as a stage name, book title, etc.In mathematics:...
how large the PIQ is. "How far away do I have to change the code in front of me for it to affect me?"
If it is too near (it is already in the PIQ) the update will not have any effect. If it is far enough, the change of the code will affect the program and the program has then found the size of the processor's PIQ.
If this code is being executed under multitasking OS, the context switch
Context switch
A context switch is the computing process of storing and restoring the state of a CPU so that execution can be resumed from the same point at a later time. This enables multiple processes to share a single CPU. The context switch is an essential feature of a multitasking operating system...
may lead to the wrong value.