Pumping lemma for regular languages
Encyclopedia
In the theory of formal language
s, the pumping lemma
for regular languages describes an essential property of all regular language
s. Informally, it says that all sufficiently long words in a regular language may be pumped — that is, have a middle section of the word repeated an arbitrary number of times — to produce a new word which also lies within the same language.
Specifically, the pumping lemma says that for any regular language L there exists a constant p such that any word w in L with length at least p can be split into three substrings, w = xyz, where the middle portion y must not be empty, such that the words xz, xyz, xyyz, xyyyz, … constructed by repeating y an arbitrary number of times (including zero times) are still in L. This process of repetition is known as "pumping". Moreover, the pumping lemma guarantees that the length of xy will be at most p, imposing a limit on the ways in which w may be split. Finite languages trivially satisfy the pumping lemma by having p equal to the maximum string length in L plus one.
The pumping lemma was first articulated by Y. Bar-Hillel
, Micha A. Perles, Eli Shamir in 1961. It is useful for disproving the regularity of a specific language in question. It is one of a few pumping lemma
s, each with a similar purpose.
length at least p (p is called the "pumping length") can be written as w = xyz (i.e., w can be divided into three substrings), satisfying the following conditions:
y is the substring that can be pumped (removed or repeated any number of times, and the resulting string is always in L). (1) means the loop y to be pumped must be of length at least one; (2) means the loop must occur within the first p characters. There is no restriction on x and z.
Below is a formal expression of the Pumping Lemma.
For example the language L = {anbn : n ≥ 0} over the alphabet Σ = {a, b} can be shown to be non-regular as follows. Let w, x, y, z, p, and i be as used in the formal statement for the pumping lemma above. Let w in L be given by w = apbp. By the pumping lemma, there must be some decomposition w = xyz with |xy| ≤ p and |y| ≥ 1 such that
xyiz in L for every i ≥ 0. Using |xy| ≤ p, we know y only consists of instances of a. Moreover, because |y| ≥ 1, it contains at least one instance of the letter a. We now pump y up: xy2z has more instances of the letter a than the letter b, since we have added some instances of a without adding instances of b. Therefore xy2z is not in L. We have reached a contradiction. Therefore, the assumption that L is regular must be incorrect. Hence L is not regular.
The proof that the language of balanced (i.e., properly nested) parentheses is not regular follows the same idea. Given p, there is a string of balanced parentheses that begins with more than p left parentheses, so that y will consist entirely of left parentheses. By repeating y, we can produce a string that does not contain the same number of left and right parentheses, and so they cannot be balanced.
For example, the following image shows an FSA.
The FSA accepts the string: abcd. Since this string has a length which is at least as large as the number of states, which is four, the pigeonhole principle indicates that there must be at least one repeated state among the start state and the next four visited states. In this example, only q1 is a repeated state. Since the substring bc takes the machine through transitions that start at state q1 and end at state q1, that portion could be repeated and the FSA would still accept, giving the string abcbcd. Alternatively, the bc portion could be removed and the FSA would still accept giving the string ad. In terms of the pumping lemma, the string abcd is broken into an x portion a, a y portion bc and a z portion d.
with strings x, y and z such that |xy| ≤ p, |y| ≥ 1 and
This version can be used to prove many more languages are non-regular, since it imposes stricter requirements on the language.
For example, consider the language L = {uvwxy : u,y {0,1,2,3}*, v,w,x {0,1,2,3}∧(v=w∨v=x∨x=w)} {w : w {0,1,2,3}* and precisely 1/7 of the characters in w are 3's}. In other words, L contains all strings over the alphabet {0,1,2,3} with a substring of length 3 including a duplicate character, as well as all strings over this alphabet where precisely 1/7 of the string's characters are 3's. This language is not regular but can still be "pumped" with p = 5. Suppose some string s has length at least 5. Then, since the alphabet has only four characters, at least two of the five characters in the string must be duplicates. They are separated by at most three characters.
For a practical test that exactly characterizes regular languages, see the Myhill-Nerode theorem. The very typical method for proving that a language is regular is to construct either a finite state machine
or a regular expression
for the language.
Formal language
A formal language is a set of words—that is, finite strings of letters, symbols, or tokens that are defined in the language. The set from which these letters are taken is the alphabet over which the language is defined. A formal language is often defined by means of a formal grammar...
s, the pumping lemma
Lemma (mathematics)
In mathematics, a lemma is a proven proposition which is used as a stepping stone to a larger result rather than as a statement in-and-of itself...
for regular languages describes an essential property of all regular language
Regular language
In theoretical computer science and formal language theory, a regular language is a formal language that can be expressed using regular expression....
s. Informally, it says that all sufficiently long words in a regular language may be pumped — that is, have a middle section of the word repeated an arbitrary number of times — to produce a new word which also lies within the same language.
Specifically, the pumping lemma says that for any regular language L there exists a constant p such that any word w in L with length at least p can be split into three substrings, w = xyz, where the middle portion y must not be empty, such that the words xz, xyz, xyyz, xyyyz, … constructed by repeating y an arbitrary number of times (including zero times) are still in L. This process of repetition is known as "pumping". Moreover, the pumping lemma guarantees that the length of xy will be at most p, imposing a limit on the ways in which w may be split. Finite languages trivially satisfy the pumping lemma by having p equal to the maximum string length in L plus one.
The pumping lemma was first articulated by Y. Bar-Hillel
Yehoshua Bar-Hillel
Yehoshua Bar-Hillel was an Israeli philosopher, mathematician, and linguist at the Hebrew University of Jerusalem, best known for his pioneering work in machine translation and formal linguistics.- Biography :...
, Micha A. Perles, Eli Shamir in 1961. It is useful for disproving the regularity of a specific language in question. It is one of a few pumping lemma
Pumping lemma
In the theory of formal languages in computability theory, a pumping lemma or pumping argument states that, for a particular language to be a member of a language class, any sufficiently long string in the language contains a section, or sections, that can be removed, or repeated any number of...
s, each with a similar purpose.
Formal statement
Let L be a regular language. Then there exists an integer p ≥ 1 depending only on L such that every string w in L oflength at least p (p is called the "pumping length") can be written as w = xyz (i.e., w can be divided into three substrings), satisfying the following conditions:
- |y| ≥ 1
- |xy| ≤ p
- for all i ≥ 0, xyiz ∈ L
y is the substring that can be pumped (removed or repeated any number of times, and the resulting string is always in L). (1) means the loop y to be pumped must be of length at least one; (2) means the loop must occur within the first p characters. There is no restriction on x and z.
Below is a formal expression of the Pumping Lemma.
Use of lemma
The pumping lemma is often used to prove that a particular language is non-regular: a proof by contradiction (of the language's regularity) may consist of exhibiting a word (of the required length) in the language which lacks the property outlined in the pumping lemma.For example the language L = {anbn : n ≥ 0} over the alphabet Σ = {a, b} can be shown to be non-regular as follows. Let w, x, y, z, p, and i be as used in the formal statement for the pumping lemma above. Let w in L be given by w = apbp. By the pumping lemma, there must be some decomposition w = xyz with |xy| ≤ p and |y| ≥ 1 such that
xyiz in L for every i ≥ 0. Using |xy| ≤ p, we know y only consists of instances of a. Moreover, because |y| ≥ 1, it contains at least one instance of the letter a. We now pump y up: xy2z has more instances of the letter a than the letter b, since we have added some instances of a without adding instances of b. Therefore xy2z is not in L. We have reached a contradiction. Therefore, the assumption that L is regular must be incorrect. Hence L is not regular.
The proof that the language of balanced (i.e., properly nested) parentheses is not regular follows the same idea. Given p, there is a string of balanced parentheses that begins with more than p left parentheses, so that y will consist entirely of left parentheses. By repeating y, we can produce a string that does not contain the same number of left and right parentheses, and so they cannot be balanced.
Proof of the pumping lemma
For every regular language there is a finite state automaton (FSA) that accepts the language. The number of states in such an FSA are counted and that count is used as the pumping length p. For a string of length at least p, let s0 be the start state and let s1, ..., sp be the sequence of the next p states visited as the string is emitted. Because the FSA has only p states, within this sequence of p + 1 visited states there must be at least one state that is repeated. Write S for such a state. The transitions that take the machine from the first encounter of state S to the second encounter of state S match some string. This string is called y in the lemma, and since the machine will match a string without the y portion, or the string y can be repeated any number of times, the conditions of the lemma are satisfied.For example, the following image shows an FSA.
The FSA accepts the string: abcd. Since this string has a length which is at least as large as the number of states, which is four, the pigeonhole principle indicates that there must be at least one repeated state among the start state and the next four visited states. In this example, only q1 is a repeated state. Since the substring bc takes the machine through transitions that start at state q1 and end at state q1, that portion could be repeated and the FSA would still accept, giving the string abcbcd. Alternatively, the bc portion could be removed and the FSA would still accept giving the string ad. In terms of the pumping lemma, the string abcd is broken into an x portion a, a y portion bc and a z portion d.
General version of pumping lemma for regular languages
If a language L is regular, then there exists a number p ≥ 1 (the pumping length) such that every string uwv in L with |w| ≥ p can be written in the form- uwv = uxyzv
with strings x, y and z such that |xy| ≤ p, |y| ≥ 1 and
- uxyizv is in L for every integer i ≥ 0.
This version can be used to prove many more languages are non-regular, since it imposes stricter requirements on the language.
Lemma not sufficient
Note that neither the original nor the general version of the pumping lemma give a sufficient condition for a language to be regular. That is, the lemma holds for some non-regular languages. If the lemma does not hold for a language L, then L is not regular. If the lemma does hold for a language L, L might be regular.For example, consider the language L = {uvwxy : u,y {0,1,2,3}*, v,w,x {0,1,2,3}∧(v=w∨v=x∨x=w)} {w : w {0,1,2,3}* and precisely 1/7 of the characters in w are 3's}. In other words, L contains all strings over the alphabet {0,1,2,3} with a substring of length 3 including a duplicate character, as well as all strings over this alphabet where precisely 1/7 of the string's characters are 3's. This language is not regular but can still be "pumped" with p = 5. Suppose some string s has length at least 5. Then, since the alphabet has only four characters, at least two of the five characters in the string must be duplicates. They are separated by at most three characters.
- If the duplicate characters are separated by 0 characters, or 1, pump one of the other two characters in the string, which will not affect the substring containing the duplicates.
- If the duplicate characters are separated by 2 or 3 characters, pump 2 of the characters separating them. Pumping either down or up results in the creation of a substring of size 3 that contains 2 duplicate characters.
- The second condition of L ensures that L is not regular: i.e., there are an infinite number of strings that are in L but cannot be obtained by pumping some smaller string in L.
For a practical test that exactly characterizes regular languages, see the Myhill-Nerode theorem. The very typical method for proving that a language is regular is to construct either a finite state machine
Finite state machine
A finite-state machine or finite-state automaton , or simply a state machine, is a mathematical model used to design computer programs and digital logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states...
or a regular expression
Regular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...
for the language.
See also
- Pumping lemma for context-free languagesPumping lemma for context-free languagesThe pumping lemma for context-free languages, also known as the Bar-Hillel lemma, is a lemma that gives a property shared by all context-free languages.- Formal statement :...
- Formal languages
- Ogden's lemmaOgden's lemmaIn the theory of formal languages, Ogden's lemma provides an extension of flexibility over the pumping lemma for context-free languages....