Quex
Encyclopedia
Quex is a lexical analyzer
generator that creates C
and C++
lexical analyzers. Significant features include the ability to generate lexical analyzers that operate on Unicode
input, the creation of direct coded (non-table based) lexical analyzers and the use of inheritance relationships in lexical analysis modes.
s, conversion to a deterministic finite-state machine and then Hopcroft optimization to reduce the number of states to a minimum. Those mechanisms, though, have been adapted to deal with character sets rather than single characters. By means of this the calculation time can be significantly reduced. Since the Unicode character set consists of many more code points than plain ASCII
, those optimizations are necessary in order to produce lexical analysers in a reasonable amount of time.
Instead of construction of a table based lexical analyzer where the transition information is stored in a data structure, Quex generates C/C++ code to perform transitions.
Direct coding creates lexical analyzers that structurally more closely resemble typical hand written lexical analyzers than table based lexers. Also direct coded lexers tend to perform better than analogous table based lexical analyzers.
or ICU
to perform character conversion. Quex relies directly on databases as they are delivered by the Unicode Consortium
. Updating to new releases of the standard consists only of copying the correspondent database files into Quex's corresponding directory.
), Quex supports multiple lexical analysis modes in a lexer. In addition to pattern actions, Quex modes can specify event actions: code to be executed during events such as entering or exiting a mode or when any match is found. Quex modes can be also be related by inheritance which allows modes to share common pattern and event actions.
library. By means of this backbone Quex is able to analyze a huge set of character encodings.
can be translated into Quex source code as follows:
The brief token senders via the "=>" operator set the token ID of a token object with the token ID which follows the operator. The arguments following inside brackets are used to set contents of the token object. Note, that skipping whitespace can be achieved via skippers which are optimized to pass specific character sets quickly (see the "" tag). For more sophisticated token actions C-code sections can be provided, such as
which might be used to do some statistics about the numbers which occur in analyzed code.
Lexical analysis
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...
generator that creates C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
and C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...
lexical analyzers. Significant features include the ability to generate lexical analyzers that operate on Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
input, the creation of direct coded (non-table based) lexical analyzers and the use of inheritance relationships in lexical analysis modes.
Direct coded lexical analyzers
Quex uses traditional steps of Thompson construction to create nondeterministic finite-state machines from regular expressionRegular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...
s, conversion to a deterministic finite-state machine and then Hopcroft optimization to reduce the number of states to a minimum. Those mechanisms, though, have been adapted to deal with character sets rather than single characters. By means of this the calculation time can be significantly reduced. Since the Unicode character set consists of many more code points than plain ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
, those optimizations are necessary in order to produce lexical analysers in a reasonable amount of time.
Instead of construction of a table based lexical analyzer where the transition information is stored in a data structure, Quex generates C/C++ code to perform transitions.
Direct coding creates lexical analyzers that structurally more closely resemble typical hand written lexical analyzers than table based lexers. Also direct coded lexers tend to perform better than analogous table based lexical analyzers.
Unicode input alphabets
Quex can handle input alphabets that contain the full Unicode code point range (0 to 10FFFFh). This is augmented by the ability to specify regular expressions that contain Unicode properties as expressions. For example, Unicode code points with the binary property XID_Start can be specified with the expression\P{XID_Start}
or \P{XIDS}
. Quex can also generate code to call iconvIconv
iconv is a computer program and a standardized API used to convert between different character encodings.-iconv API:The iconv API is the standard programming interface for converting character strings from one character encoding to another in Unix-like operating systems.Initially appearing on the...
or ICU
International Components for Unicode
International Components for Unicode is an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all...
to perform character conversion. Quex relies directly on databases as they are delivered by the Unicode Consortium
Unicode Consortium
The Unicode Consortium is a non-profit organization that coordinates the development of the Unicode standard. Its stated goal is to eventually replace existing character encoding schemes with Unicode and its standard Unicode Transformation Format schemes, claiming that many of the existing...
. Updating to new releases of the standard consists only of copying the correspondent database files into Quex's corresponding directory.
Lexical analysis modes
Like traditional lexical analyzers (e.g. Lex and FlexFlex lexical analyser
flex is a free software alternative to lex. It is frequently used with the free Bison parser generator. Unlike Bison, flex is not part of the GNU Project. Flex was written in C by Vern Paxson around 1987...
), Quex supports multiple lexical analysis modes in a lexer. In addition to pattern actions, Quex modes can specify event actions: code to be executed during events such as entering or exiting a mode or when any match is found. Quex modes can be also be related by inheritance which allows modes to share common pattern and event actions.
Sophisticated buffer handling
Quex provides sophisticated mechanism of buffer handling and reload that are at the same time efficient and flexible. Quex provides interfaces that allow users to virtually to plug-in any character set converter. The converters are activated only "on-demand", that is, when new buffer filling is required. By default Quex can plug-in the iconvIconv
iconv is a computer program and a standardized API used to convert between different character encodings.-iconv API:The iconv API is the standard programming interface for converting character strings from one character encoding to another in Unix-like operating systems.Initially appearing on the...
library. By means of this backbone Quex is able to analyze a huge set of character encodings.
Example
Quex follows the syntax of the classical tools lex and flex for the description of regular expressions. The example in the section FlexFlex lexical analyser
flex is a free software alternative to lex. It is frequently used with the free Bison parser generator. Unlike Bison, flex is not part of the GNU Project. Flex was written in C by Vern Paxson around 1987...
can be translated into Quex source code as follows:
The brief token senders via the "=>" operator set the token ID of a token object with the token ID which follows the operator. The arguments following inside brackets are used to set contents of the token object. Note, that skipping whitespace can be achieved via skippers which are optimized to pass specific character sets quickly (see the "
which might be used to do some statistics about the numbers which occur in analyzed code.
See also
- Lexical analysisLexical analysisIn computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...
- Descriptions of Thompson construction and Hopcroft optimization can be found in most textbooks on compiler construction such as Compilers: Principles, Techniques, and ToolsCompilers: Principles, Techniques, and ToolsCompilers: Principles, Techniques, and Tools is a famous computer science textbook by Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman about compiler construction...
.