Perl 6 rules
Encyclopedia
Perl 6 rules are the regular expression
, pattern matching
and general-purpose parsing
facility of Perl 6
, and are a core part of the language. Since Perl's pattern-matching constructs have exceeded the capabilities of formal
regular expressions for some time, Perl 6 documentation refers to them exclusively as regexes, distancing the term from the formal definition.
Perl 6 provides a superset of Perl 5
features with respect to regexes, folding them into a larger framework called rules, which provide the capabilities of a parsing expression grammar
, as well as acting as a closure
with respect to their lexical scope. Rules are introduced with the
Between late 2004 and mid-2005, a compiler for Perl 6 style rules was developed for the Parrot virtual machine
called Parrot Grammar Engine (PGE), which was later re-named to the more generic, Parser Grammar Engine. PGE is a combination of runtime and compiler for Perl 6 style grammars that allows any parrot-based compiler to use these tools for parsing, and also to provide rules to their runtimes.
Among other Perl 6 features, support for named captures was added to Perl 5.10 in 2007 .
A few of the most powerful additions include:
The following changes greatly improve the readability of regexes
However, because
for a grammar. For example, the following parsing expression grammar
describes the classic non-context-free
language :
S ← &(A !b) a+ B
A ← a A? b
B ← b B? c
In Perl 6 rules that would be:
Of course, given the ability to mix rules and regular code, that can be simplified even further:
rule S { (a+) (b+) (c+) <{$0.elems
However, this makes use of assertions
, which is a subtly different concept in Perl 6 rules but more substantially different in parsing theory, making this a semantic rather than syntactic predicate. The most important difference in practice is performance. There is no way for the rule engine to know what conditions the assertion may match, so no optimization of this process can be made.
with Perl's scanner. This simplified many aspects of regular expression usage, though it added a great deal of complexity to the scanner. In Perl 6, rules are part of the grammar of the language. No separate parser exists for rules, as it did in Perl 5. This means that code, embedded in rules, is parsed at the same time as the rule itself and its surrounding code. For example, it is possible to nest rules and code without re-invoking the parser:
rule ab {
(a.) # match "a" followed by any character
# Then check to see if that character was "b"
# If so, print a message.
{ $0 ~~ /b {say "found the b"}/ }
}
The above is a single block of Perl 6 code that contains an outer rule definition, an inner block of assertion code, and inside of that a regex that contains one more level of assertion.
regex: A named or anonymous regex that ignores whitespace within the regex by default.
token: A named or anonymous regex that implies the
rule: A named or anonymous regex that implies the
rx: An anonymous regex that takes arbitrary delimiters such as
m: An operator form of anonymous regex that performs matches with arbitrary delimiters.
mm: Shorthand for m with the
s: An operator form of anonymous regex that performs substitution with arbitrary delimiters.
ss: Shorthand for s with the
Here is an example of typical use:
token word { \w+ }
rule phrase { [ \, ]* \. }
if $string ~~ / \n / {
...
}
Some of the more important modifiers include:
For example:
regex addition :ratchet :sigspace { \+ }
for rules:
grammar Str::SprintfFormat {
regex format_token { \%:? ? ? }
token index { \d+ \$ }
token precision {? ? }
token flags { <[\ +0\#\-]>+ }
token precision_count { [ <[1-9]>\d* | \* ]? [ \. [ \d* | \* ] ]? }
token vector { \*? v }
token modifier { ll | <[lhmVqL]> }
token directive { <[\%csduoxefgXEGbpniDUOF]> }
}
This is the grammar used to define Perl's
Outside of this namespace, you could use these rules like so:
if / / { ... }
A rule used in this way is actually identical to the invocation of a subroutine with the extra semantics and side-effects of pattern matching (e.g., rule invocations can be backtracked).
rx { a [ b | c ] ( d | e ) f : g }
rx { ( ab* ) <{ $1.size % 2 0 }> }
External links
Regular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...
, pattern matching
Pattern matching
In computer science, pattern matching is the act of checking some sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact. The patterns generally have the form of either sequences or tree structures...
and general-purpose parsing
Parsing
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...
facility of Perl 6
Perl 6
Perl 6 is a major revision to the Perl programming language. It is still in development, as a specification from which several interpreter and compiler implementations are being written. It is introducing elements of many modern and historical languages. Perl 6 is intended to have many...
, and are a core part of the language. Since Perl's pattern-matching constructs have exceeded the capabilities of formal
Formal language
A formal language is a set of words—that is, finite strings of letters, symbols, or tokens that are defined in the language. The set from which these letters are taken is the alphabet over which the language is defined. A formal language is often defined by means of a formal grammar...
regular expressions for some time, Perl 6 documentation refers to them exclusively as regexes, distancing the term from the formal definition.
Perl 6 provides a superset of Perl 5
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
features with respect to regexes, folding them into a larger framework called rules, which provide the capabilities of a parsing expression grammar
Parsing expression grammar
A parsing expression grammar, or PEG, is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language...
, as well as acting as a closure
Closure (computer science)
In computer science, a closure is a function together with a referencing environment for the non-local variables of that function. A closure allows a function to access variables outside its typical scope. Such a function is said to be "closed over" its free variables...
with respect to their lexical scope. Rules are introduced with the
rule
keyword, which has a usage quite similar to subroutine definition. Anonymous rules can be introduced with the regex
(or rx
) keyword, or simply be used inline as regexps were in Perl 5 via the m
(matching) or s
(substitution) operators.History
In Apocalypse 5, a document outlining the preliminary design decisions for Perl 6 pattern matching, Larry Wall enumerated 20 problems with "current regex culture". Among these were that Perl's regexes were "too compact and 'cute'", had "too much reliance on too few metacharacters", "little support for named captures", "little support for grammars", and "poor integration with [the] 'real' language".Between late 2004 and mid-2005, a compiler for Perl 6 style rules was developed for the Parrot virtual machine
Parrot virtual machine
Parrot is a register-based process virtual machine designed to run dynamic languages efficiently. It uses just-in-time compilation for speed to reduce the interpretation overhead. It is currently possible to compile Parrot assembly language and PIR to Parrot bytecode and execute it...
called Parrot Grammar Engine (PGE), which was later re-named to the more generic, Parser Grammar Engine. PGE is a combination of runtime and compiler for Perl 6 style grammars that allows any parrot-based compiler to use these tools for parsing, and also to provide rules to their runtimes.
Among other Perl 6 features, support for named captures was added to Perl 5.10 in 2007 .
Changes from Perl 5
There are only six unchanged features from Perl 5's regexes:- Literals: word characters (letters, numbers and underscoreUnderscoreThe underscore [ _ ] is a character that originally appeared on the typewriter and was primarily used to underline words...
) matched literally - Capturing:
(...)
- Alternatives:
|
- Backslash escape:
\
- Repetition quantifiers:
*
,+
, and?
, but not{m,n}
- Minimal matching suffix:
*?
,+?
,??
A few of the most powerful additions include:
- The ability to reference rules using
to build up entire grammars. - A handful of commit operators that allow the programmer to control backtrackingBacktrackingBacktracking is a general algorithm for finding all solutions to some computational problem, that incrementally builds candidates to the solutions, and abandons each partial candidate c as soon as it determines that c cannot possibly be completed to a valid solution.The classic textbook example...
during matching.
The following changes greatly improve the readability of regexes
- Simplified non-capturing groups:
[...]
, which are the same as Perl 5's:(?:...)
- Simplified code assertions:
{...}>
- Extended regex formatting (Perl 5's
/x
) is now the default.
Implicit changes
Some of the features of Perl 5 regular expressions become more powerful in Perl 6 because of their ability to encapsulate the expanded features of Perl 6 rules. For example, in Perl 5, there were positive and negative lookahead operators(?=...)
and (?!...)
. In Perl 6 these same features exist, but are called
and
.However, because
before
can encapsulate arbitrary rules, it can be used to express lookahead as a syntactic predicateSyntactic predicate
A syntactic predicate specifies the syntactic validity of applying a production in a formal grammar and is analogous to a semantic predicate that specifies the semantic validity of applying a production. It is a simple and effective means of dramatically improving the recognition strength of an LL...
for a grammar. For example, the following parsing expression grammar
Parsing expression grammar
A parsing expression grammar, or PEG, is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language...
describes the classic non-context-free
Context-sensitive grammar
A context-sensitive grammar is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols...
language :
S ← &(A !b) a+ B
A ← a A? b
B ← b B? c
In Perl 6 rules that would be:
rule S {> a+ }
rule A { a ? b }
rule B { b ? c }
Of course, given the ability to mix rules and regular code, that can be simplified even further:
rule S { (a+) (b+) (c+) <{$0.elems
$1.elems
$2.elems}> }However, this makes use of assertions
Assertion (computing)
In computer programming, an assertion is a predicate placed in a program to indicate that the developer thinks that the predicate is always true at that place.For example, the following code contains two assertions:...
, which is a subtly different concept in Perl 6 rules but more substantially different in parsing theory, making this a semantic rather than syntactic predicate. The most important difference in practice is performance. There is no way for the rule engine to know what conditions the assertion may match, so no optimization of this process can be made.
Integration with Perl
In many languages, regular expressions are entered as strings, which are then passed to library routines that parse and compile them into an internal state. In Perl 5, regular expressions shared some of the lexical analysisLexical analysis
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...
with Perl's scanner. This simplified many aspects of regular expression usage, though it added a great deal of complexity to the scanner. In Perl 6, rules are part of the grammar of the language. No separate parser exists for rules, as it did in Perl 5. This means that code, embedded in rules, is parsed at the same time as the rule itself and its surrounding code. For example, it is possible to nest rules and code without re-invoking the parser:
rule ab {
(a.) # match "a" followed by any character
# Then check to see if that character was "b"
# If so, print a message.
{ $0 ~~ /b {say "found the b"}/ }
}
The above is a single block of Perl 6 code that contains an outer rule definition, an inner block of assertion code, and inside of that a regex that contains one more level of assertion.
Keywords
There are several keywords used in conjunction with Perl 6 rules:regex: A named or anonymous regex that ignores whitespace within the regex by default.
token: A named or anonymous regex that implies the
:ratchet
modifier.rule: A named or anonymous regex that implies the
:ratchet
and :sigspace
modifiers.rx: An anonymous regex that takes arbitrary delimiters such as
//
where regex only takes braces.m: An operator form of anonymous regex that performs matches with arbitrary delimiters.
mm: Shorthand for m with the
:sigspace
modifier.s: An operator form of anonymous regex that performs substitution with arbitrary delimiters.
ss: Shorthand for s with the
:sigspace
modifier./.../
: Simply placing a regex between slashes is shorthand for m/.../
.Here is an example of typical use:
token word { \w+ }
rule phrase {
if $string ~~ /
...
}
Modifiers
Modifiers may be placed after any of the regex keywords, and before the delimiter. If a regex is named, the modifier comes after the name. Modifiers control the way regexes are parsed and how they behave. They are always introduced with a leading:
character.Some of the more important modifiers include:
-
:i
or:ignorecase
– Perform matching without respect to case. -
:g
or:global
– Perform the match more than once on a given target string. -
:s
or:sigspace
– Replace whitespace in the regex with a whitespace-matching rule, rather than simply ignoring it. -
:Perl5
– Treat the regex as a Perl 5 regular expression. -
:ratchet
– Never perform backtracking in the rule.
For example:
regex addition :ratchet :sigspace {
Grammars
A grammar may be defined using thegrammar
operator. A grammar is essentially just a namespaceNamespace
In general, a namespace is a container that provides context for the identifiers it holds, and allows the disambiguation of homonym identifiers residing in different namespaces....
for rules:
grammar Str::SprintfFormat {
regex format_token { \%:
token index { \d+ \$ }
token precision {
token flags { <[\ +0\#\-]>+ }
token precision_count { [ <[1-9]>\d* | \* ]? [ \. [ \d* | \* ] ]? }
token vector { \*? v }
token modifier { ll | <[lhmVqL]> }
token directive { <[\%csduoxefgXEGbpniDUOF]> }
}
This is the grammar used to define Perl's
sprintf
string formatting notation.Outside of this namespace, you could use these rules like so:
if /
A rule used in this way is actually identical to the invocation of a subroutine with the extra semantics and side-effects of pattern matching (e.g., rule invocations can be backtracked).
Examples
Here are some example rules in Perl 6:rx { a [ b | c ] ( d | e ) f : g }
rx { ( ab* ) <{ $1.size % 2
0 }> }
That last is identical to:
rx { ( ab[bb]* ) }
External links
- Synopsis 05 - The standards document covering Perl 6 regexes and rules.
- Perl 6 Regex FAQ - Answers a range of questions about Perl 6 regexes.
- Perl 6 Regex Introduction - Gentle introduction to Perl 6 regexes.