Parser Combinator
Encyclopedia
In functional programming
, a parser combinator is a higher-order function
which accepts several parsers as input and returns a new parser as its output. In this context, a parser is a function accepting strings as input and returning some structure as output, typically a parse tree
or a set of indices representing locations in the string where parsing stopped successfully. Parser combinators enable a recursive descent parsing strategy which facilitates modular piecewise construction and testing. This parsing technique is called combinatory-parsing.
Parsers built using combinators are straightforward to construct, ‘readable’, modular, well-structured and easily maintainable. They have been used extensively in the prototyping of compilers and processors for domain-specific languages such as natural language interfaces to databases, where complex and varied semantic actions are closely integrated with syntactic processing. In 1989, Richard Frost and John Launchbury demonstrated use of parser combinators to construct natural language
interpreters. Graham Hutton also used higher-order functions for basic parsing in 1992. S.D. Swierstra also exhibited the practical aspects of parser combinators in 2001. In 2008, Frost, Hafiz and Callaghan described a set of parser combinators in Haskell
that solve the long standing problem of accommodating left-recursion, and work as a complete top-down parsing
tool in polynomial time and space.
, parser combinators can be used to combine basic parsers to construct parsers for more complex rules. For example, a production rule of a context-free grammar
(CFG) may have one or more ‘alternatives’ and each alternative may consist of a sequence of non-terminal(s) and/or terminal(s), or the alternative may consist of a single non-terminal or terminal or the empty string. If a simple parser is available for each of these alternatives, a parser combinator can be used to combine each of these parsers, returning a new parser which can recognise any or all of the alternatives.
A parser combinator can take the form of an infix
operator, used to ‘glue’ different parsers to form a complete rule. Parser combinators thereby enable parsers to be defined in an embedded style, in code which is similar in structure to the rules of the grammar. As such, implementations can be thought of as executable specifications with all of the associated advantages.
Functional programming
In computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast to the imperative programming style, which emphasizes changes in state...
, a parser combinator is a higher-order function
Higher-order function
In mathematics and computer science, higher-order functions, functional forms, or functionals are functions which do at least one of the following:*take one or more functions as an input*output a function...
which accepts several parsers as input and returns a new parser as its output. In this context, a parser is a function accepting strings as input and returning some structure as output, typically a parse tree
Parse tree
A concrete syntax tree or parse tree or parsing treeis an ordered, rooted tree that represents the syntactic structure of a string according to some formal grammar. In a parse tree, the interior nodes are labeled by non-terminals of the grammar, while the leaf nodes are labeled by terminals of the...
or a set of indices representing locations in the string where parsing stopped successfully. Parser combinators enable a recursive descent parsing strategy which facilitates modular piecewise construction and testing. This parsing technique is called combinatory-parsing.
Parsers built using combinators are straightforward to construct, ‘readable’, modular, well-structured and easily maintainable. They have been used extensively in the prototyping of compilers and processors for domain-specific languages such as natural language interfaces to databases, where complex and varied semantic actions are closely integrated with syntactic processing. In 1989, Richard Frost and John Launchbury demonstrated use of parser combinators to construct natural language
Natural language
In the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written...
interpreters. Graham Hutton also used higher-order functions for basic parsing in 1992. S.D. Swierstra also exhibited the practical aspects of parser combinators in 2001. In 2008, Frost, Hafiz and Callaghan described a set of parser combinators in Haskell
Haskell (programming language)
Haskell is a standardized, general-purpose purely functional programming language, with non-strict semantics and strong static typing. It is named after logician Haskell Curry. In Haskell, "a function is a first-class citizen" of the programming language. As a functional programming language, the...
that solve the long standing problem of accommodating left-recursion, and work as a complete top-down parsing
Top-down parsing
Top-down parsing is a type of parsing strategy where in one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar...
tool in polynomial time and space.
Basic idea
In functional programmingFunctional programming
In computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast to the imperative programming style, which emphasizes changes in state...
, parser combinators can be used to combine basic parsers to construct parsers for more complex rules. For example, a production rule of a context-free grammar
Context-free grammar
In formal language theory, a context-free grammar is a formal grammar in which every production rule is of the formwhere V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals ....
(CFG) may have one or more ‘alternatives’ and each alternative may consist of a sequence of non-terminal(s) and/or terminal(s), or the alternative may consist of a single non-terminal or terminal or the empty string. If a simple parser is available for each of these alternatives, a parser combinator can be used to combine each of these parsers, returning a new parser which can recognise any or all of the alternatives.
A parser combinator can take the form of an infix
Infix
An infix is an affix inserted inside a word stem . It contrasts with adfix, a rare term for an affix attached to the end of a stem, such as a prefix or suffix.-Indonesian:...
operator, used to ‘glue’ different parsers to form a complete rule. Parser combinators thereby enable parsers to be defined in an embedded style, in code which is similar in structure to the rules of the grammar. As such, implementations can be thought of as executable specifications with all of the associated advantages.
The combinators
To keep the discussion relatively straightforward, we discuss parser combinators in terms of recognizers only. If the input string is of length#input
and its members are accessed through an index j
, a recognizer is a parser which returns, as output, a set of set of indices representing positions at which the parser successfully finished recognizing a sequence of tokens that began at position j
. An empty result set indicates that the recognizer failed to recognize any sequence beginning at index j
. A non-empty result set indicates the recognizer ends at different positions successfully.- The
empty
recognizer recognizes the empty string. This parser always succeeds, returning a singleton set containing the current position:
- A recognizer
term ’x’
recognizes the terminalx
. If the token at positionj
in the input string isx
, this parser returns a singleton set containingj + 1
; otherwise, it returns the empty set.
Note that there may be multiple distinct ways to parse a string while finishing at the same index: this indicates an ambiguous grammarAmbiguous grammarIn computer science, a context-free grammar is said to be an ambiguous grammar if there exists a string that can be generated by the grammar in more than one way...
. Simple recognizers do not acknowledge these ambiguities; each possible finishing index is listed only once in the result set. For a more complete set of results, a more complicated object such as a parse treeParse treeA concrete syntax tree or parse tree or parsing treeis an ordered, rooted tree that represents the syntactic structure of a string according to some formal grammar. In a parse tree, the interior nodes are labeled by non-terminals of the grammar, while the leaf nodes are labeled by terminals of the...
must be returned.
Following the definitions of two basic recognizersp
andq
, we can define two major parser combinators for alternative and sequencing:
- The ‘alternative’ parser combinator,
<+>
, applies both of the recognizers on the same input positionj
and sums up the results returned by both of the recognizers, which is eventually returned as the final result. It is used as an infix operator betweenp
andq
as follows:
- The sequencing of recognizers is done with the
<*>
parser combinator. Like<+>
, it is used as an infix operator betweenp
andq
. But it applies the first recognizerp
to the input positionj
, and if there is any successful result of this application, then the second recognizerq
is applied to every element of the result set returned by the first recognizer.<*>
ultimately returns the union of these applications of q.
Examples
Consider a highly ambiguous context-free grammarContext-free grammarIn formal language theory, a context-free grammar is a formal grammar in which every production rule is of the formwhere V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals ....
,s ::= ‘x’ s s | ε
. Using the combinators defined earlier, we can modularly define executable notations of this grammar in a modern functional language (e.g. HaskellHaskellHaskell may refer to:*Haskell , a standardized pure functional programming language with non-strict semantics* Haskell Indian Nations University, a four year degree granting university in Lawrence, Kansas which offers free tuition to members of registered Native American tribes in the United...
) ass = term ‘x’ <*> s <*> s <+> empty
. When the recognizers
is applied on an input sequencexxxxx
at position1
, according to the above definitions it would return a result set{5,4,3,2}
.
Shortcomings and solutions
Parser combinators, like all recursive descent parsers, are not limited to the context-free grammarContext-free grammarIn formal language theory, a context-free grammar is a formal grammar in which every production rule is of the formwhere V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals ....
s and thus do no global search for ambiguities in the LL(k) parsing Firstk and Followk sets. Thus, ambiguities are not known until run-time if and until the input triggers them. In such cases, the recursive descent parser defaults (perhaps unknown to the grammar designer) to one of the possible ambiguous paths, resulting in semantic confusion (aliasing) in the use of the language. This leads to bugs by users of ambiguous programming languages, which are not reported at compile-time, and which are introduced not by human error, but by ambiguous grammar. The only solution which eliminates these bugs, is to remove the ambiguities and use a context-free grammar.
The simple implementations of parser combinators have some shortcomings, which are common in top-down parsing. Naïve combinatory parsing requires exponential time and space when parsing an ambiguous context free grammar. In 1996, Frost and Szydlowski demonstrated how memoizationMemoizationIn computing, memoization is an optimization technique used primarily to speed up computer programs by having function calls avoid repeating the calculation of results for previously processed inputs...
can be used with parser combinators to reduce the time complexity to polynomial. Later Frost used monads to construct the combinators for systematic and correct threading of memo-table throughout the computation.
Like any top-down recursive descent parsingRecursive descent parserA recursive descent parser is a top-down parser built from a set of mutually-recursive procedures where each such procedure usually implements one of the production rules of the grammar...
, the conventional parser combinators (like the combinators described above) won’t terminate while processing a left-recursive grammarLeft recursionIn computer science, left recursion is a special case of recursion.In terms of context-free grammar, a non-terminal r is left-recursive if the left-most symbol in any of r’s ‘alternatives’ either immediately or through some other non-terminal definitions rewrites to r again.- Definition :"A...
(e.g.s ::= s <*> term ‘x’|empty
). A recognition algorithm that accommodates ambiguous grammars with direct left-recursive rules is described by Frost and Hafiz in 2006. The algorithm curtails the otherwise ever-growing left-recursive parse by imposing depth restrictions. That algorithm was extended to a complete parsing algorithm to accommodate indirect as well as direct left-recursion in polynomialPolynomialIn mathematics, a polynomial is an expression of finite length constructed from variables and constants, using only the operations of addition, subtraction, multiplication, and non-negative integer exponents...
time, and to generate compact polynomial-size representations of the potentially-exponential number of parse trees for highly-ambiguous grammars by Frost, Hafiz and Callaghan in 2007. This extended algorithm accommodates indirect left-recursion by comparing its ‘computed-context’ with ‘current-context’. The same authors also described their implementation of a set of parser combinators written in the Haskell programming language based on the same algorithm. The X-SAIGA site has more about the algorithms and implementation details.
External links
- X-SAIGA - eXecutable SpecificAtIons of GrAmmars
- The sequencing of recognizers is done with the
- The ‘alternative’ parser combinator,