Code generation
Encyclopedia
In computer science
, code generation is the process by which a compiler
's code generator converts some intermediate representation of source code
into a form (e.g., machine code
) that can be readily executed by a machine (often a computer
).
Sophisticated compilers typically perform multiple passes over various intermediate forms. This multi-stage process is used because many algorithms for code optimization are easier to apply one at a time, or because the input to one optimization relies on the processing performed by another optimization. This organization also facilitates the creation of a single compiler that can target multiple architectures, as only the last of the code generation stages (the backend) needs to change from target to target. (For more information on compiler design, see Compiler
.)
The input to the code generator typically consists of a parse tree
or an abstract syntax tree
. The tree is converted into a linear sequence of instructions, usually in an intermediate language
such as three address code
. Further stages of compilation may or may not be referred to as "code generation", depending on whether they involve a significant change in the representation of the program. (For example, a peephole optimization
pass would not likely be called "code generation", although a code generator might incorporate a peephole optimization pass.)
, and avoid redundant computations
.
Tasks which are typically part of a sophisticated compiler's "code generation" phase include:
Instruction selection is typically carried out by doing a recursive
postorder traversal on the abstract syntax tree, matching particular tree configurations against templates; for example, the tree
In a compiler that uses an intermediate language, there may be two instruction selection stages — one to convert the parse tree into intermediate code, and a second phase much later to convert the intermediate code into instructions from the instruction set
of the target machine. This second phase does not require a tree traversal; it can be done linearly, and typically involves a simple replacement of intermediate-language operations with their corresponding opcode
s. However, if the compiler is actually a language translator
(for example, one that converts Eiffel
to C
), then the second code-generation phase may involve building a tree from the linear intermediate code.
(JIT), it is important that the entire process be efficient
with respect to space and time. For example, when regular expression
s are interpreted and used to generate code at runtime, a non-determistic finite state machine
is often generated instead of a deterministic one, because usually the former can be created more quickly and occupies less memory space than the latter. Despite its generally generating less efficient code, JIT code generation can take advantage of profiling information that is available only at runtime.
operations of formal language theory. Consequently, some techniques that were originally developed for use in compilers have come to be employed in other ways as well. For example, YACC
(Yet Another Compiler Compiler) takes input in Backus-Naur form and converts it to a parser in C
. Though it was originally created for automatic generation of a parser for a compiler, yacc is also often used to automate writing code that needs to be modified each time specifications are changed. (For example, see http://www.artima.com/weblogs/viewpost.jsp?thread=152273.)
Many integrated development environment
s (IDEs) support some form of automatic source code generation, often using algorithms in common with compiler code generators, although commonly less complicated. (See also: Program transformation
, Data transformation
.)
s) to produce code. In other words, the former adds information while the latter loses some of the information. One consequence of this information loss is that reflection
becomes difficult or even impossible. To counter this problem, code generators often embed syntactic and semantic information in addition to the code necessary for execution.
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...
, code generation is the process by which a compiler
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...
's code generator converts some intermediate representation of source code
Source code
In computer science, source code is text written using the format and syntax of the programming language that it is being written in. Such a language is specially designed to facilitate the work of computer programmers, who specify the actions to be performed by a computer mostly by writing source...
into a form (e.g., machine code
Machine code
Machine code or machine language is a system of impartible instructions executed directly by a computer's central processing unit. Each instruction performs a very specific task, typically either an operation on a unit of data Machine code or machine language is a system of impartible instructions...
) that can be readily executed by a machine (often a computer
Computer
A computer is a programmable machine designed to sequentially and automatically carry out a sequence of arithmetic or logical operations. The particular sequence of operations can be changed readily, allowing the computer to solve more than one kind of problem...
).
Sophisticated compilers typically perform multiple passes over various intermediate forms. This multi-stage process is used because many algorithms for code optimization are easier to apply one at a time, or because the input to one optimization relies on the processing performed by another optimization. This organization also facilitates the creation of a single compiler that can target multiple architectures, as only the last of the code generation stages (the backend) needs to change from target to target. (For more information on compiler design, see Compiler
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...
.)
The input to the code generator typically consists of a parse tree
Parse tree
A concrete syntax tree or parse tree or parsing treeis an ordered, rooted tree that represents the syntactic structure of a string according to some formal grammar. In a parse tree, the interior nodes are labeled by non-terminals of the grammar, while the leaf nodes are labeled by terminals of the...
or an abstract syntax tree
Abstract syntax tree
In computer science, an abstract syntax tree , or just syntax tree, is a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code. The syntax is 'abstract' in the sense that it...
. The tree is converted into a linear sequence of instructions, usually in an intermediate language
Intermediate language
In computer science, an intermediate language is the language of an abstract machine designed to aid in the analysis of computer programs. The term comes from their use in compilers, where a compiler first translates the source code of a program into a form more suitable for code-improving...
such as three address code
Three address code
In computer science, three-address code is a form of representing intermediate code used by compilers to aid in the implementation of code-improving transformations...
. Further stages of compilation may or may not be referred to as "code generation", depending on whether they involve a significant change in the representation of the program. (For example, a peephole optimization
Peephole optimization
In compiler theory, peephole optimization is a kind of optimization performed over a very small set of instructions in a segment of generated code. The set is called a "peephole" or a "window"...
pass would not likely be called "code generation", although a code generator might incorporate a peephole optimization pass.)
Major tasks in code generation
In addition to the basic conversion from an intermediate representation into a linear sequence of machine instructions, a typical code generator tries to optimize the generated code in some way. The generator may try to use faster instructions, use fewer instructions, exploit available registersProcessor register
In computer architecture, a processor register is a small amount of storage available as part of a CPU or other digital processor. Such registers are addressed by mechanisms other than main memory and can be accessed more quickly...
, and avoid redundant computations
Partial redundancy elimination
In compiler theory, partial redundancy elimination is a compiler optimization that eliminates expressions that are redundant on some but not necessarily all paths through a program...
.
Tasks which are typically part of a sophisticated compiler's "code generation" phase include:
- Instruction selectionInstruction selectionIn computer science, instruction selection is the stage of a compiler backend that transforms its tree-based middle-level intermediate representation into a low-level IR very close to its final target language...
: which instructions to use. - Instruction schedulingInstruction schedulingIn computer science, instruction scheduling is a compiler optimization used to improve instruction-level parallelism, which improves performance on machines with instruction pipelines...
: in which order to put those instructions. Scheduling is a speed optimization that can have a critical effect on pipelineInstruction pipelineAn instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput ....
d machines. - Register allocationRegister allocationIn compiler optimization, register allocation is the process of assigning a large number of target program variables onto a small number of CPU registers...
: the allocation of variablesVariable (programming)In computer programming, a variable is a symbolic name given to some known or unknown quantity or information, for the purpose of allowing the name to be used independently of the information it represents...
to processor registerProcessor registerIn computer architecture, a processor register is a small amount of storage available as part of a CPU or other digital processor. Such registers are addressed by mechanisms other than main memory and can be accessed more quickly...
s. - Debug dataDebugging data formatA debugging data format is a means of storing information about a compiled computer program for use by high-level debuggers. Modern debugging data formats store enough information to allow source-level debugging....
generation if required so the code can be debugged.
Instruction selection is typically carried out by doing a recursive
Recursion
Recursion is the process of repeating items in a self-similar way. For instance, when the surfaces of two mirrors are exactly parallel with each other the nested images that occur are a form of infinite recursion. The term has a variety of meanings specific to a variety of disciplines ranging from...
postorder traversal on the abstract syntax tree, matching particular tree configurations against templates; for example, the tree
W := ADD(X,MUL(Y,Z))
might be transformed into a linear sequence of instructions by recursively generating the sequences for t1 := X
and t2 := MUL(Y,Z)
, and then emitting the instruction ADD W, t1, t2
.In a compiler that uses an intermediate language, there may be two instruction selection stages — one to convert the parse tree into intermediate code, and a second phase much later to convert the intermediate code into instructions from the instruction set
Instruction set
An instruction set, or instruction set architecture , is the part of the computer architecture related to programming, including the native data types, instructions, registers, addressing modes, memory architecture, interrupt and exception handling, and external I/O...
of the target machine. This second phase does not require a tree traversal; it can be done linearly, and typically involves a simple replacement of intermediate-language operations with their corresponding opcode
Opcode
In computer science engineering, an opcode is the portion of a machine language instruction that specifies the operation to be performed. Their specification and format are laid out in the instruction set architecture of the processor in question...
s. However, if the compiler is actually a language translator
Transcompiler
A transcompiler is a special compiler that translates the source code of a programming language into the source code of another programming language, e.g...
(for example, one that converts Eiffel
Eiffel (programming language)
Eiffel is an ISO-standardized, object-oriented programming language designed by Bertrand Meyer and Eiffel Software. The design of the language is closely connected with the Eiffel programming method...
to C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
), then the second code-generation phase may involve building a tree from the linear intermediate code.
Runtime code generation
When code generation occurs at runtime, as in just-in-time compilationJust-in-time compilation
In computing, just-in-time compilation , also known as dynamic translation, is a method to improve the runtime performance of computer programs. Historically, computer programs had two modes of runtime operation, either interpreted or static compilation...
(JIT), it is important that the entire process be efficient
Algorithmic efficiency
In computer science, efficiency is used to describe properties of an algorithm relating to how much of various types of resources it consumes. Algorithmic efficiency can be thought of as analogous to engineering productivity for a repeating or continuous process, where the goal is to reduce...
with respect to space and time. For example, when regular expression
Regular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...
s are interpreted and used to generate code at runtime, a non-determistic finite state machine
Finite state machine
A finite-state machine or finite-state automaton , or simply a state machine, is a mathematical model used to design computer programs and digital logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states...
is often generated instead of a deterministic one, because usually the former can be created more quickly and occupies less memory space than the latter. Despite its generally generating less efficient code, JIT code generation can take advantage of profiling information that is available only at runtime.
Related concepts
The fundamental task of taking input in one language and producing output in a non-trivially different language can be understood in terms of the core transformationalTransformational grammar
In linguistics, a transformational grammar or transformational-generative grammar is a generative grammar, especially of a natural language, that has been developed in the Chomskyan tradition of phrase structure grammars...
operations of formal language theory. Consequently, some techniques that were originally developed for use in compilers have come to be employed in other ways as well. For example, YACC
Yacc
The computer program yacc is a parser generator developed by Stephen C. Johnson at AT&T for the Unix operating system. The name is an acronym for "Yet Another Compiler Compiler." It generates a parser based on an analytic grammar written in a notation similar to BNF.Yacc used to be available as...
(Yet Another Compiler Compiler) takes input in Backus-Naur form and converts it to a parser in C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
. Though it was originally created for automatic generation of a parser for a compiler, yacc is also often used to automate writing code that needs to be modified each time specifications are changed. (For example, see http://www.artima.com/weblogs/viewpost.jsp?thread=152273.)
Many integrated development environment
Integrated development environment
An integrated development environment is a software application that provides comprehensive facilities to computer programmers for software development...
s (IDEs) support some form of automatic source code generation, often using algorithms in common with compiler code generators, although commonly less complicated. (See also: Program transformation
Program transformation
A program transformation is any operation that takes a computer program and generates another program. In many cases the transformed program is required to be semantically equivalent to the original, relative to a particular formal semantics and in fewer cases the transformations result in programs...
, Data transformation
Data transformation
In metadata and data warehouse, a data transformation converts data from a source data format into destination data.Data transformation can be divided into two steps:...
.)
Reflection
In general, a syntax and semantic analyzer tries to retrieve the structure of the program from the source code, while a code generator uses this structural information (e.g., data typeData type
In computer programming, a data type is a classification identifying one of various types of data, such as floating-point, integer, or Boolean, that determines the possible values for that type; the operations that can be done on values of that type; the meaning of the data; and the way values of...
s) to produce code. In other words, the former adds information while the latter loses some of the information. One consequence of this information loss is that reflection
Reflection (computer science)
In computer science, reflection is the process by which a computer program can observe and modify its own structure and behavior at runtime....
becomes difficult or even impossible. To counter this problem, code generators often embed syntactic and semantic information in addition to the code necessary for execution.