Fall 2000, CSE 468:
Lectures 25 and 26
Proofs by induction on context free grammars
In these lectures we will show how to prove by induction
certain properties of context-free grammars. In particular, the property
of generating a given language, and the property of being unambiguous.
Our running example will be
a grammar for generating the language of the strings with the same number of a's
and b's,
namely the language
L = { x in {a,b}* | #a(x) = #b(x) }
We want to define a non-ambiguous grammar G for such a language, and prove
formally that indeed L(G) = L and that G is non-ambiguous.
Note that the grammar
S -> lambda | a S b | b S a | S S
generates (intuitively) the given language, but it
is ambiguous. For instance, the string abab has two different derivation
trees:
S S
/ | \ / \
a S b S S
/ | \ / | \ / | \
b S a a S b a S b
| | |
lambda lambda lambda
Consider now the grammar
S -> lambda | a S b S | b S a S
Also this grammar generates the given langauge, but it is still
ambiguous. The string abab has still two different derivation trees:
S S
// | \ // | \
a S b S a S b S
| // | \ //| \ |
lambda a S b S b S a S lambda
| | | |
lambda lambda lambda lambda
Intuitively, the reson why the second grammar is ambiguous is because
in the production S -> a S b S we do not enforce the b to be the "matching b" for
the a. An analogous problem is related to the production S -> b S a S.
Intuitively, we could eliminate the ambiguity by forging the first
S in a S b S to generate "the shortest string" with the same number of a's
and b's. Analogously for b S a S. This can be done
by introducing new syntactic categories (non-terminal
symbols) T and U, and by stratifying the productions as follows:
S -> lambda | a T b S | b U a S
T -> lambda | a T b T
U -> lambda | b U a U
We will call G the above grammar consisting of all the productions for S,
T and U.
We will also call G1 the subgrammar consisting of the productions
T -> lambda | a T b T
and we will call G2 the subgrammar consisting of the productions
U -> lambda | b U a U
Intuitively, G1 generates the language of the "balanced" strings
on a and b, namely those strings with the same number of a's and b's, and
where each b comes after the corresponding a. For instance, strings like
ab , abab , aabb , aabbab , ... etc.
This language can be formally defined as the language
L1 = { x in {a,b}* | #a(x) = #b(x) and
for all x1, x2 s.t. x = x1b x2,
#a(x1) >= 1 + #b(x1) }
Note that if we replaced "a"
by "(" and "b" by ")",
then G1 would generate the language of balanced parentheses,
namely all the strings of the form
( ) , ( )( ) , (( )) , (( ))( ) , ... etc.
The grammar G2 generates the symmetric language, with the roles of a and b
exchanged. Formally we can define the language (intuitively) generated by G2
as
L2 = { x in {a,b}* | #a(x) = #b(x) and
for all x1, x2 s.t. x = x1a x2,
#b(x1) >= 1 + #a(x1) }
We are now going to prove by mathematical induction and by structural induction that
G generates the required language, namely
Proposition 1
L(G) = L
and that G is unambiguous, namely:
Proposition 2
For all x in L(G), there exists only one derivation tree in G for x.
In order to prove Propositions 1 and 2,
we need to prove analogous properties for the subgrammars G1
and G2. Specifically, we need to prove the following lemmata:
Lemma 1
L(G1) = L1
Lemma 2
For all x in L(G1), there exists only one derivation tree in G1 for x.
Lemma 3
L(G2) = L2
Lemma 4
For all x in L(G2), there exists only one derivation tree in G2 for x.
We are going to prove only Lemmata 1 and 2: The proofs of Lemmata 3 and 4 are analogous.
Proof of Lemma 1 (L(G1) = L1 )
- Part 1: L(G1) is contained in L1
- We need to prove that, for every x in L(G1), x
enjoys the following properties:
- #a(x) = #b(x)
- for all x1, x2 s.t. x = x1b x2,
#a(x1) >= 1 + #b(x1)
We are going to prove these properties by structural induction
(remember that L(G1) can be seen as defined inductively).
- base case: x = lambda. We have:
- #a(lambda) = 0 = #b(lambda)
- there exists no x1, x2 s.t. x =
x1a x2, hence this point is trivially satisfied
- inductive step: x = a y b z where y and z are in L(G1) as
well. We have:
- #a(x) = #a(a y b z) = 1 + #a(y) +
#a(z)
#b(x) = #b(a y b z) = #b(y) + 1 +
#b(z).
By inductive hypothesis #a(y) = #b(y) and
#a(z) = #b(z), hence
#a(x) = #b(x).
- Let x = x1 b x2. We have three cases:
- x1 = a y. In this case we have:
#a(x1) = 1 + #a(y) =
(by inductive hypothesis) 1 + #b(y) = 1 + #b(x1)
- x1 = a y1 where y1 is a
prefix of y, namely y = y1b y2. In this case
we have:
#a(x1) = 1 +
#a(y1) >=
(by inductive hypothesis) 1 + 1 + #b(y1) = 2 + #b(x1)
> 1 + #b(x1)
- x1 = a y b z1 where z1 is a
prefix of z, namely z = z1b z2. In this case
we have:
#a(x1) = 1 + #a(y) +
#a(z1) >=
(by inductive hypothesis) 1 + #b(y) + 1 +
#b(z1) = 1 + #b(x1)
- Part 2: L1 is contained in L(G1)
- We are going to prove this part by strong mathematical
induction on the length of the strings. Remember that the principle
of strong mathematical induction has the following schema:
If for all n and for all k < n P(k) implies P(n), then we can
deduce that for all n, P(n) holds.
In practice, the principle of strong mathematical induction allows
us to use
the inductive hypothesis on all k < n, instead than just on n-1.
Let x be a string in L1.
- if |x| = 0, then x = lambda and therefore x can be generated by using the
production T -> lambda.
- if |x| > 0, then x must start with "a" (by definition of
L1), and must contain a matching "b" somewhere.
Let us consider the shortest y such that x = a y b z for some z
and #a(y) = #b(y). By a case analysis similar
to the one in previous part, we can show that y and z are also in L1
(for proving that for all y1, y2 s.t. y = y1b y2,
#a(y1) >= 1 + #b(y1) holds, we
must use the fact that y is the shortest string satisfying the properties x = a y b z
and #a(y) = #b(y)).
By inductive hypothesis we have that y and z are in
L(G1), namely T ->* y and T ->* z.
Hence we can obtain a derivation
T -> a T b T ->* a y b T ->* a y b z
which shows that the string a y b z, namely x, is in L(G1)
Proof of Lemma 2 (G1 is not ambiguous)
We are going to prove this lemma by structural induction.
Let x be a string in L(G1).
- base case: x = lambda. In this case, there can be only one
derivation tree for x: the one obtained by applying the production T ->
lambda.
- inductive step: x = a y b z, with y, z in L(G1).
Observe that y and z are uniquely defined by these properties. Namely,
if x = a v b w, with v, w in L(G1), then y = v and z = w.
In fact, assume by contradition that y and v are different, and
consider the case that y is a prefix of v. But then we must have v = y
b v' for some v'. By definition of L1, #a(y) >= 1 +
#b(y) holds, which implies that y is not in L1.
Contradiction.
The case in which v is a prefix of y is analogous.
Since y and v are equal, also z and w must be equal.
Now, by inductive hypothesis, y and z must have each a unique derivation
tree. Let us call these trees ty and tz
respectively. We have that x admits only one derivation tree, which is
the following one:
T
// | \
a T b T
| |
ty tz
Proof of Proposition 1 (L(G) = L)
- Part 1: L(G) is contained in L
- We need to prove that, for every x in L(G), #a(x) =
#b(x).
We are going to prove this property by structural induction.
- base case: x = lambda. We have: #a(lambda) = 0 = #b(lambda)
- inductive step, case 1: x = a y b z where y is in L(G1) and z is
in L(G). We have:
#a(x) = #a(a y b z) = 1 + #a(y) +
#a(z)
#b(x) = #b(a y b z) = #b(y) + 1 +
#b(z).
By Lemma 1 #a(y) = #b(y) and by inductive
hipothesis
#a(z) = #b(z), hence
#a(x) = #b(x).
- inductive step, case 2: x = b y a z where y is in L(G2) and z is
in L(G). We have:
#a(x) = #a(b y a z) = #a(y) + 1 +
#a(z)
#b(x) = #b(b y a z) = 1 + #b(y) +
#b(z).
By Lemma 3 #a(y) = #b(y) and by inductive
hipothesis
#a(z) = #b(z), hence
#a(x) = #b(x).
- Part 2: L is contained in L(G)
- We are going to prove this part by strong mathematical
induction on the length of the strings. Let x be a string in L1.
- if |x| = 0, then x = lambda and therefore x can be generated by using the
production S -> lambda.
- if |x| > 0, and x starts with "a": Since
#a(x) = #b(x), x must contain a matching "b" somewhere.
Let us consider the shortest y such that x = a y b z for some z
and #a(y) = #b(y). By a case analysis similar
to the one in the proof of Lemma 1, we can show that y is in L1
and that z is in L.
(for proving that for all y1, y2 s.t. y = y1b y2,
#a(y1) >= 1 + #b(y1) holds, we
must use the fact that y is the shortest string satisfying the properties x = a y b z
and #a(y) = #b(y)).
By Lemma 1 we have that y is in L(G1), namely T
->* y.
By inductive hypothesis we have that z is in
L(G), namely S ->* z.
Hence we can obtain a derivation
S -> a T b S ->* a y b S ->* a y b z
which shows that the string a y b z, namely x, is in L(G).
- if |x| > 0, and x starts with "b": the proof is analogous to
previous case.
Proof of Proposition 2 (G is not ambiguous)
We are going to prove this proposition by structural induction.
Let x be a string in L(G).
- base case: x = lambda. In this case, there can be only one
derivation tree for x: the one obtained by applying the production S ->
lambda.
- inductive step: x = a y b z, with y in L(G1) and z in
L(G).
Observe that y and z are uniquely defined by these properties. Namely,
if x = a v b w, with v in L(G1) and w in L(G), then y = v and z = w.
In fact, assume by contradition that y and v are different, and
consider the case that y is a prefix of v. But then we must have v = y
b v' for some v'. By definition of L1, #a(y) >= 1 +
#b(y) holds, which implies that y is not in L1.
Contradiction.
The case in which v is a prefix of y is analogous.
Since y and v are equal, also z and w must be equal.
Now, by inductive hypothesis, y and z must have each a unique derivation
tree, with roots T and S respectively. Let us call these trees ty and tz
respectively. We have that x admits only one derivation tree, which is
the following one:
S
// | \
a T b S
| |
ty tz
- inductive step: x = b y a z, with y in L(G2) and z in
L(G). The proof is analogous to previous case.