Lex programming tool
Encyclopedia
Lex is a computer program
that generates lexical analyzer
s ("scanners" or "lexers"). Lex is commonly used with the yacc
parser generator. Lex, originally written by Mike Lesk
and Eric Schmidt, is the standard lexical analyzer generator on many Unix
systems, and a tool exhibiting its behavior is specified as part of the POSIX
standard.
Lex reads an input stream specifying the lexical analyzer and outputs source code
implementing the lexer in the C programming language
.
Though traditionally proprietary software, versions of Lex based on the original AT&T code are available as open source
, as part of systems such as OpenSolaris
and Plan 9 from Bell Labs
. Another popular open source
version of Lex is flex
, the "fast lexical analyzer".
Definition section
%%
Rules section
%%
C code section
version of lex. It recognizes strings of numbers (integers) in the input, and simply prints them out.
If this input is given to flex, it will be converted into a C file, lex.yy.c. This can be compiled into an executable which matches and outputs strings of integers. For example, given the input:
abc123z.!&*2gj6
the program will print:
Saw an integer: 123
Saw an integer: 2
Saw an integer: 6
or Bison
, are commonly used together. Parser generators use a formal grammar
to parse an input stream, something which Lex cannot do using simple regular expression
s (Lex is limited to simple finite state automata
).
It is typically preferable to have a (Yacc-generated, say) parser be fed a token-stream as input, rather than having it consume the input character-stream directly. Lex is often used to produce such a token-stream.
Scannerless parsing refers to where a parser consumes the input character-stream directly, without a distinct lexer.
Computer program
A computer program is a sequence of instructions written to perform a specified task with a computer. A computer requires programs to function, typically executing the program's instructions in a central processor. The program has an executable form that the computer can use directly to execute...
that generates lexical analyzer
Lexical analysis
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...
s ("scanners" or "lexers"). Lex is commonly used with the yacc
Yacc
The computer program yacc is a parser generator developed by Stephen C. Johnson at AT&T for the Unix operating system. The name is an acronym for "Yet Another Compiler Compiler." It generates a parser based on an analytic grammar written in a notation similar to BNF.Yacc used to be available as...
parser generator. Lex, originally written by Mike Lesk
Mike Lesk
Michael E. Lesk is a computer programmer.In the 1960s, Michael Lesk worked for the SMART Information Retrieval System project, wrote much of its retrieval code and did many of the retrieval experiments, as well as obtaining a PhD in Chemical Physics....
and Eric Schmidt, is the standard lexical analyzer generator on many Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
systems, and a tool exhibiting its behavior is specified as part of the POSIX
POSIX
POSIX , an acronym for "Portable Operating System Interface", is a family of standards specified by the IEEE for maintaining compatibility between operating systems...
standard.
Lex reads an input stream specifying the lexical analyzer and outputs source code
Source code
In computer science, source code is text written using the format and syntax of the programming language that it is being written in. Such a language is specially designed to facilitate the work of computer programmers, who specify the actions to be performed by a computer mostly by writing source...
implementing the lexer in the C programming language
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
.
Though traditionally proprietary software, versions of Lex based on the original AT&T code are available as open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
, as part of systems such as OpenSolaris
OpenSolaris
OpenSolaris was an open source computer operating system based on Solaris created by Sun Microsystems. It was also the name of the project initiated by Sun to build a developer and user community around the software...
and Plan 9 from Bell Labs
Plan 9 from Bell Labs
Plan 9 from Bell Labs is a distributed operating system. It was developed primarily for research purposes as the successor to Unix by the Computing Sciences Research Center at Bell Labs between the mid-1980s and 2002...
. Another popular open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
version of Lex is flex
Flex lexical analyser
flex is a free software alternative to lex. It is frequently used with the free Bison parser generator. Unlike Bison, flex is not part of the GNU Project. Flex was written in C by Vern Paxson around 1987...
, the "fast lexical analyzer".
Structure of a lex file
The structure of a lex file is intentionally similar to that of a yacc file; files are divided up into three sections, separated by lines that contain only two percent signs, as follows:Definition section
%%
Rules section
%%
C code section
- The definition section is the place to define macros and to import header fileHeader fileSome programming languages use header files. These files allow programmers to separate certain elements of a program's source code into reusable files. Header files commonly contain forward declarations of classes, subroutines, variables, and other identifiers...
s written in CC (programming language)C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
. It is also possible to write any C code here, which will be copied verbatim into the generated source file. - The rules section is the most important section; it associates patterns with C statementStatement (programming)In computer programming a statement can be thought of as the smallest standalone element of an imperative programming language. A program written in such a language is formed by a sequence of one or more statements. A statement will have internal components .Many languages In computer programming...
s. Patterns are simply regular expressionRegular expressionIn computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...
s. When the lexer sees some text in the input matching a given pattern, it executes the associated C code. This is the basis of how lex operates. - The C code section contains C statements and functions that are copied verbatim to the generated source file. These statements presumably contain code called by the rules in the rules section. In large programs it is more convenient to place this code in a separate file and link it in at compileCompilerA compiler is a computer program that transforms source code written in a programming language into another computer language...
time.
Example of a lex file
The following is an example lex file for the flexFlex lexical analyser
flex is a free software alternative to lex. It is frequently used with the free Bison parser generator. Unlike Bison, flex is not part of the GNU Project. Flex was written in C by Vern Paxson around 1987...
version of lex. It recognizes strings of numbers (integers) in the input, and simply prints them out.
/*** Definition section ***/
%{
/* C code to be copied verbatim */
- include
%}
/* This tells flex to read only one input file */
%option noyywrap
%%
/*** Rules section ***/
/* [0-9]+ matches a string of one or more digits */
[0-9]+ {
/* yytext is a string containing the matched text. */
printf("Saw an integer: %s\n", yytext);
}
.|\n { /* Ignore all other characters. */ }
%%
/*** C Code section ***/
int main(void)
{
/* Call the lexer, then quit. */
yylex;
return 0;
}
If this input is given to flex, it will be converted into a C file, lex.yy.c. This can be compiled into an executable which matches and outputs strings of integers. For example, given the input:
abc123z.!&*2gj6
the program will print:
Saw an integer: 123
Saw an integer: 2
Saw an integer: 6
Using Lex with parser generators
Lex and parser generators, such as YaccYacc
The computer program yacc is a parser generator developed by Stephen C. Johnson at AT&T for the Unix operating system. The name is an acronym for "Yet Another Compiler Compiler." It generates a parser based on an analytic grammar written in a notation similar to BNF.Yacc used to be available as...
or Bison
GNU bison
GNU bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification of a context-free language, warns about any parsing ambiguities, and generates a parser which reads sequences of tokens and decides whether the sequence conforms to the syntax...
, are commonly used together. Parser generators use a formal grammar
Formal grammar
A formal grammar is a set of formation rules for strings in a formal language. The rules describe how to form strings from the language's alphabet that are valid according to the language's syntax...
to parse an input stream, something which Lex cannot do using simple regular expression
Regular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...
s (Lex is limited to simple finite state automata
Finite state machine
A finite-state machine or finite-state automaton , or simply a state machine, is a mathematical model used to design computer programs and digital logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states...
).
It is typically preferable to have a (Yacc-generated, say) parser be fed a token-stream as input, rather than having it consume the input character-stream directly. Lex is often used to produce such a token-stream.
Scannerless parsing refers to where a parser consumes the input character-stream directly, without a distinct lexer.
Lex and make
make is a utility that can be used to maintain programs involving lex. Make assumes that a file that has an extension of.l
is a lex source file. The make internal macro LFLAGS
can be used to specify lex options to be invoked automatically by make.See also
- Flex lexical analyserFlex lexical analyserflex is a free software alternative to lex. It is frequently used with the free Bison parser generator. Unlike Bison, flex is not part of the GNU Project. Flex was written in C by Vern Paxson around 1987...
- YaccYaccThe computer program yacc is a parser generator developed by Stephen C. Johnson at AT&T for the Unix operating system. The name is an acronym for "Yet Another Compiler Compiler." It generates a parser based on an analytic grammar written in a notation similar to BNF.Yacc used to be available as...
- RagelRagelRagel is a finite-state machine compiler with output support for C, C++, C#, Objective-C, D, Java, Go and Ruby source code. It supports the generation of table or control flow driven state machines from regular expressions and/or state charts and can also build lexical analysers via the...
- QuexQuexQuex is a lexical analyzer generator that creates C and C++ lexical analyzers. Significant features include the ability to generate lexical analyzers that operate on Unicode input, the creation of direct coded lexical analyzers and the use of inheritance relationships in lexical analysis modes.-...
- Comparison of parser generatorsComparison of parser generatorsThis is a list of notable lexer generators and parser generators for various language classes.- Regular languages :- Deterministic context-free languages :-Parsing expression grammars, deterministic boolean grammars:...