Lyndon word
Encyclopedia
In mathematics
, in the areas of combinatorics
and computer science
, a Lyndon word is a string
that is strictly smaller in lexicographic order than all of its rotations
. Lyndon words are named after mathematician Roger Lyndon
, who introduced them in 1954, calling them standard lexicographic sequences.
A k-ary Lyndon word of length n is an n-character string over an alphabet of size k, and which is the minimum element in the lexicographical ordering of all its rotations. Being the singularly smallest rotation implies that a Lyndon word differs from any of its non-trivial rotations, and is therefore aperiodic.
Alternately, a Lyndon word has the property that, whenever it is split into two substrings, the left substring is always lexicographically less than the right substring. That is, if w is a Lyndon word, and w = uv is any factorization into two substrings, with u and v understood to be non-empty, then u < v. This definition implies that w is a Lyndon word if and only if there exist Lyndon words u and v such that u < v and w = uv. Although there may be more than one choice of u and v with this property, there is a particular choice, called the standard factorization, in which v is as long as possible.
Here, ε denotes the empty string
. The first string that does not belong to this sequence, "00", is omitted because it is periodic (it consists of two repetitions of the substring "0"); the second omitted string, "10", is aperiodic but is not minimal in its permutation class as it can be cyclically permuted to the smaller string "01".
The numbers of binary Lyndon words of each length, starting with length zero, form the integer sequence
Lyndon words correspond to aperiodic necklace
class representatives and can thus be counted with Moreau's necklace-counting function.
for listing the Lyndon words of length at most n with a given alphabet size s in lexicographic order. If w is one of the words in the sequence, then the next word after w can be found by the following steps:
The worst-case time to generate the successor of a word w by this procedure is O(n).
However, if the words being generated are stored in an array of length n, and the construction of x from w is performed by adding symbols onto the end of w rather than by making a new copy of w, then the average time to generate the successor of w (assuming each word is equally likely) is constant. Therefore, the sequence of all Lyndon words of length at most n can be generated in time proportional to the length of the sequence. A constant fraction of the words in this sequence have length exactly n, so the same procedure can be used to efficiently generate words of length exactly n or words whose length divides n, by filtering out the generated words that do not fit these criteria.
, and in algorithms for digital geometry
.
developed an algorithm for standard factorization that runs in linear time and constant space. It iterates over a string trying to find as long Lyndon word as possible. When it founds one, it adds it to the result list and proceeds to search in the remaining part of string. Resulting list of strings is standard factorization of a given string. More formal description of algorithm follows.
Given a string S of length N, one should proceed with the following steps:
, a circular sequence of symbols such that each possible length-n sequence appears exactly once as one of its contiguous subsequences. For example, the concatenation of the binary Lyndon words whose length divides four is
This construction, together with the efficient generation of Lyndon words, provides an efficient method for constructing a particular de Bruijn sequence in linear time and logarithmic space
.
s, in constructing a basis for the homogeneous part of a given degree; this was Lyndon's original motivation for introducing these words. Lyndon words may be understood as a special case of Hall sets.
A theorem of Radford states that the algebra
of polynomial
s of Lyndon words with rational coefficients
is a shuffle algebra
; that is, they form an algebra over a ring, with multiplication taken to be the shuffle operator.
Mathematics
Mathematics is the study of quantity, space, structure, and change. Mathematicians seek out patterns and formulate new conjectures. Mathematicians resolve the truth or falsity of conjectures by mathematical proofs, which are arguments sufficient to convince other mathematicians of their validity...
, in the areas of combinatorics
Combinatorics
Combinatorics is a branch of mathematics concerning the study of finite or countable discrete structures. Aspects of combinatorics include counting the structures of a given kind and size , deciding when certain criteria can be met, and constructing and analyzing objects meeting the criteria ,...
and computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...
, a Lyndon word is a string
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....
that is strictly smaller in lexicographic order than all of its rotations
Circular shift
In combinatorial mathematics, a circular shift is the operation of rearranging the entries in a tuple, either by moving the final entry to the first position, while shifting all other entries to the next position, or by performing the inverse operation...
. Lyndon words are named after mathematician Roger Lyndon
Roger Lyndon
Roger Conant Lyndon was an American mathematician, for many years a professor at the University of Michigan. He is known for Lyndon words, the Curtis–Hedlund–Lyndon theorem, Craig–Lyndon interpolation and the Lyndon–Hochschild–Serre spectral sequence.-Biography:Lyndon was born on December 18, 1917...
, who introduced them in 1954, calling them standard lexicographic sequences.
Definitions
Several equivalent definitions are possible.A k-ary Lyndon word of length n is an n-character string over an alphabet of size k, and which is the minimum element in the lexicographical ordering of all its rotations. Being the singularly smallest rotation implies that a Lyndon word differs from any of its non-trivial rotations, and is therefore aperiodic.
Alternately, a Lyndon word has the property that, whenever it is split into two substrings, the left substring is always lexicographically less than the right substring. That is, if w is a Lyndon word, and w = uv is any factorization into two substrings, with u and v understood to be non-empty, then u < v. This definition implies that w is a Lyndon word if and only if there exist Lyndon words u and v such that u < v and w = uv. Although there may be more than one choice of u and v with this property, there is a particular choice, called the standard factorization, in which v is as long as possible.
Enumeration
The Lyndon words over the two-symbol binary alphabet {0,1}, sorted by length and then lexicographically within each length class, form an infinite sequence that begins- ε, 0, 1, 01, 001, 011, 0001, 0011, 0111, 00001, 00011, 00101, 00111, 01011, 01111, ...
Here, ε denotes the empty string
Empty string
In computer science and formal language theory, the empty string is the unique string of length zero. It is denoted with λ or sometimes Λ or ε....
. The first string that does not belong to this sequence, "00", is omitted because it is periodic (it consists of two repetitions of the substring "0"); the second omitted string, "10", is aperiodic but is not minimal in its permutation class as it can be cyclically permuted to the smaller string "01".
The numbers of binary Lyndon words of each length, starting with length zero, form the integer sequence
Integer sequence
In mathematics, an integer sequence is a sequence of integers.An integer sequence may be specified explicitly by giving a formula for its nth term, or implicitly by giving a relationship between its terms...
- 1, 2, 1, 2, 3, 6, 9, 18, 30, 56, 99, 186, 335, ... .
Lyndon words correspond to aperiodic necklace
Necklace (combinatorics)
In combinatorics, a k-ary necklace of length n is an equivalence class of n-character strings over an alphabet of size k, taking all rotations as equivalent...
class representatives and can thus be counted with Moreau's necklace-counting function.
Generation
provides an efficient algorithmAlgorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
for listing the Lyndon words of length at most n with a given alphabet size s in lexicographic order. If w is one of the words in the sequence, then the next word after w can be found by the following steps:
- Repeat the symbols from w to form a new word x of length exactly n, where the ith symbol of x is the same as the symbol at position (i mod length(w)) of w.
- As long as the final symbol of x is the last symbol in the sorted ordering of the alphabet, remove it, producing a shorter word.
- Replace the final remaining symbol of x by its successor in the sorted ordering of the alphabet.
The worst-case time to generate the successor of a word w by this procedure is O(n).
However, if the words being generated are stored in an array of length n, and the construction of x from w is performed by adding symbols onto the end of w rather than by making a new copy of w, then the average time to generate the successor of w (assuming each word is equally likely) is constant. Therefore, the sequence of all Lyndon words of length at most n can be generated in time proportional to the length of the sequence. A constant fraction of the words in this sequence have length exactly n, so the same procedure can be used to efficiently generate words of length exactly n or words whose length divides n, by filtering out the generated words that do not fit these criteria.
Standard factorization
According to the Chen–Fox–Lyndon theorem, every string may be formed in a unique way by concatenating a sequence of Lyndon words, in such a way that the words in the sequence are nonincreasing lexicographically. The final Lyndon word in this sequence is the lexicographically smallest suffix of the given string. A factorization into a nonincreasing sequence of Lyndon words (a so-called Lyndon factorization) may be constructed in linear time. Lyndon factorizations may be used as part of a bijective variant of the Burrows–Wheeler transform for data compressionData compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
, and in algorithms for digital geometry
Digital geometry
Digital geometry deals with discrete sets considered to be digitized models or images of objects of the 2D or 3D Euclidean space.Simply put, digitizing is replacing an object by a discrete set of its points...
.
developed an algorithm for standard factorization that runs in linear time and constant space. It iterates over a string trying to find as long Lyndon word as possible. When it founds one, it adds it to the result list and proceeds to search in the remaining part of string. Resulting list of strings is standard factorization of a given string. More formal description of algorithm follows.
Given a string S of length N, one should proceed with the following steps:
- Let m be the index of the symbol-candidate to be appended to already collected ones. Initially, m = 1 (counting of symbols in a string starts from zero).
- Let k be the index of the symbol we would compare others to. Initially, k = 0.
- While k and m is less than N, compare S[k] (kth symbol of a string S) to S[m]. There are three possible outcomes:
- S[k] is equal to S[m]: append S[m] to the current collected symbols. Increment k and m.
- S[k] is greater than S[m]: if we append S[m] to the current collected symbols we'll get Lyndon word. But we can't add it to the result list yet because it may be just a part of a larger Lyndon word. Thus, just increment m and set k to 0 so the next symbol would be compared to the first one in the string.
- S[k] is less than S[m]: if we append S[m] to the current collected symbols, it won't be neither Lyndon word nor the possible beginning of one. Thus, add first m-k collected symbols to the result list, remove them from the string, set m to 1 and k to 0 so they point to the second and first symbol of the string respectively.
- Add S to the result list.
Connection to de Bruijn sequences
If one concatenates together, in lexicographic order, all the Lyndon words that have length dividing a given number n, the result is a de Bruijn sequenceDe Bruijn sequence
In combinatorial mathematics, a k-ary De Bruijn sequence B of order n, named after the Dutch mathematician Nicolaas Govert de Bruijn, is a cyclic sequence of a given alphabet A with size k for which every possible subsequence of length n in A appears as a sequence of consecutive characters exactly...
, a circular sequence of symbols such that each possible length-n sequence appears exactly once as one of its contiguous subsequences. For example, the concatenation of the binary Lyndon words whose length divides four is
- 0 0001 0011 01 0111 1
This construction, together with the efficient generation of Lyndon words, provides an efficient method for constructing a particular de Bruijn sequence in linear time and logarithmic space
L (complexity)
In computational complexity theory, L is the complexity class containing decision problems which can be solved by a deterministic Turing machine using a logarithmic amount of memory space...
.
Additional properties and applications
Lyndon words have an application to the description of free Lie algebraFree Lie algebra
In mathematics, a free Lie algebra, over a given field K, is a Lie algebra generated by a set X, without any imposed relations.-Definition:Given a set X, one can show that there exists a unique free Lie algebra L generated by X....
s, in constructing a basis for the homogeneous part of a given degree; this was Lyndon's original motivation for introducing these words. Lyndon words may be understood as a special case of Hall sets.
A theorem of Radford states that the algebra
Algebra (ring theory)
In mathematics, specifically in ring theory, an algebra over a commutative ring is a generalization of the concept of an algebra over a field, where the base field K is replaced by a commutative ring R....
of polynomial
Polynomial
In mathematics, a polynomial is an expression of finite length constructed from variables and constants, using only the operations of addition, subtraction, multiplication, and non-negative integer exponents...
s of Lyndon words with rational coefficients
Rational number
In mathematics, a rational number is any number that can be expressed as the quotient or fraction a/b of two integers, with the denominator b not equal to zero. Since b may be equal to 1, every integer is a rational number...
is a shuffle algebra
Shuffle algebra
In mathematics, a shuffle algebra is a Hopf algebra with a basis corresponding to words on some set, whose product is given by the shuffle product XшY of two words X, Y: the sum of all ways of interlacing them. The shuffle product was introduced by...
; that is, they form an algebra over a ring, with multiplication taken to be the shuffle operator.