## Proofs by induction on context free grammars

In these lectures we will show how to prove by induction certain properties of context-free grammars. In particular, the property of generating a given language, and the property of being unambiguous.

Our running example will be a grammar for generating the language of the strings with the same number of a's and b's, namely the language

L = { x in {a,b}* | #a(x) = #b(x) }
We want to define a non-ambiguous grammar G for such a language, and prove formally that indeed L(G) = L and that G is non-ambiguous.

Note that the grammar

S -> lambda | a S b | b S a | S S
generates (intuitively) the given language, but it is ambiguous. For instance, the string abab has two different derivation trees:
```      S                S
/ | \            /   \
a  S  b         S       S
/ | \        / | \   / | \
b  S  a      a  S  b a  S  b
|            |       |
lambda       lambda   lambda
```

Consider now the grammar

S -> lambda | a S b S | b S a S
Also this grammar generates the given langauge, but it is still ambiguous. The string abab has still two different derivation trees:
```         S                        S
// | \                   // | \
a S  b   S               a S  b  S
|   // | \             //| \   |
lambda a S b  S          b S a  S lambda
|    |            |    |
lambda lambda      lambda lambda
```

Intuitively, the reson why the second grammar is ambiguous is because in the production S -> a S b S we do not enforce the b to be the "matching b" for the a. An analogous problem is related to the production S -> b S a S. Intuitively, we could eliminate the ambiguity by forging the first S in a S b S to generate "the shortest string" with the same number of a's and b's. Analogously for b S a S. This can be done by introducing new syntactic categories (non-terminal symbols) T and U, and by stratifying the productions as follows:

S -> lambda | a T b S | b U a S
T -> lambda | a T b T
U -> lambda | b U a U
We will call G the above grammar consisting of all the productions for S, T and U. We will also call G1 the subgrammar consisting of the productions
T -> lambda | a T b T
and we will call G2 the subgrammar consisting of the productions
U -> lambda | b U a U
Intuitively, G1 generates the language of the "balanced" strings on a and b, namely those strings with the same number of a's and b's, and where each b comes after the corresponding a. For instance, strings like
ab , abab , aabb , aabbab , ... etc.
This language can be formally defined as the language
L1 = { x in {a,b}* | #a(x) = #b(x) and for all x1, x2 s.t. x = x1b x2, #a(x1) >= 1 + #b(x1) }
Note that if we replaced "a" by "(" and "b" by ")", then G1 would generate the language of balanced parentheses, namely all the strings of the form
( ) , ( )( ) , (( )) , (( ))( ) , ... etc.
The grammar G2 generates the symmetric language, with the roles of a and b exchanged. Formally we can define the language (intuitively) generated by G2 as
L2 = { x in {a,b}* | #a(x) = #b(x) and for all x1, x2 s.t. x = x1a x2, #b(x1) >= 1 + #a(x1) }

We are now going to prove by mathematical induction and by structural induction that G generates the required language, namely

Proposition 1 L(G) = L

and that G is unambiguous, namely:

Proposition 2 For all x in L(G), there exists only one derivation tree in G for x.

In order to prove Propositions 1 and 2, we need to prove analogous properties for the subgrammars G1 and G2. Specifically, we need to prove the following lemmata:

Lemma 1 L(G1) = L1

Lemma 2 For all x in L(G1), there exists only one derivation tree in G1 for x.

Lemma 3 L(G2) = L2

Lemma 4 For all x in L(G2), there exists only one derivation tree in G2 for x.

We are going to prove only Lemmata 1 and 2: The proofs of Lemmata 3 and 4 are analogous.

### Proof of Lemma 1 (L(G1) = L1 )

Part 1: L(G1) is contained in L1
We need to prove that, for every x in L(G1), x enjoys the following properties:
1. #a(x) = #b(x)
2. for all x1, x2 s.t. x = x1b x2, #a(x1) >= 1 + #b(x1)
We are going to prove these properties by structural induction (remember that L(G1) can be seen as defined inductively).
• base case: x = lambda. We have:
1. #a(lambda) = 0 = #b(lambda)
2. there exists no x1, x2 s.t. x = x1a x2, hence this point is trivially satisfied
• inductive step: x = a y b z where y and z are in L(G1) as well. We have:
1. #a(x) = #a(a y b z) = 1 + #a(y) + #a(z)
#b(x) = #b(a y b z) = #b(y) + 1 + #b(z).
By inductive hypothesis #a(y) = #b(y) and #a(z) = #b(z), hence #a(x) = #b(x).
2. Let x = x1 b x2. We have three cases:
• x1 = a y. In this case we have:
#a(x1) = 1 + #a(y) = (by inductive hypothesis) 1 + #b(y) = 1 + #b(x1)
• x1 = a y1 where y1 is a prefix of y, namely y = y1b y2. In this case we have:
#a(x1) = 1 + #a(y1) >= (by inductive hypothesis) 1 + 1 + #b(y1) = 2 + #b(x1) > 1 + #b(x1)
• x1 = a y b z1 where z1 is a prefix of z, namely z = z1b z2. In this case we have:
#a(x1) = 1 + #a(y) + #a(z1) >= (by inductive hypothesis) 1 + #b(y) + 1 + #b(z1) = 1 + #b(x1)
Part 2: L1 is contained in L(G1)
We are going to prove this part by strong mathematical induction on the length of the strings. Remember that the principle of strong mathematical induction has the following schema:
If for all n and for all k < n P(k) implies P(n), then we can deduce that for all n, P(n) holds.
In practice, the principle of strong mathematical induction allows us to use the inductive hypothesis on all k < n, instead than just on n-1.

Let x be a string in L1.

• if |x| = 0, then x = lambda and therefore x can be generated by using the production T -> lambda.
• if |x| > 0, then x must start with "a" (by definition of L1), and must contain a matching "b" somewhere. Let us consider the shortest y such that x = a y b z for some z and #a(y) = #b(y). By a case analysis similar to the one in previous part, we can show that y and z are also in L1 (for proving that for all y1, y2 s.t. y = y1b y2, #a(y1) >= 1 + #b(y1) holds, we must use the fact that y is the shortest string satisfying the properties x = a y b z and #a(y) = #b(y)).
By inductive hypothesis we have that y and z are in L(G1), namely T ->* y and T ->* z.
Hence we can obtain a derivation
T -> a T b T ->* a y b T ->* a y b z
which shows that the string a y b z, namely x, is in L(G1)

### Proof of Lemma 2 (G1 is not ambiguous)

We are going to prove this lemma by structural induction. Let x be a string in L(G1).
1. base case: x = lambda. In this case, there can be only one derivation tree for x: the one obtained by applying the production T -> lambda.
2. inductive step: x = a y b z, with y, z in L(G1). Observe that y and z are uniquely defined by these properties. Namely, if x = a v b w, with v, w in L(G1), then y = v and z = w.
In fact, assume by contradition that y and v are different, and consider the case that y is a prefix of v. But then we must have v = y b v' for some v'. By definition of L1, #a(y) >= 1 + #b(y) holds, which implies that y is not in L1. Contradiction.
The case in which v is a prefix of y is analogous.
Since y and v are equal, also z and w must be equal.
Now, by inductive hypothesis, y and z must have each a unique derivation tree. Let us call these trees ty and tz respectively. We have that x admits only one derivation tree, which is the following one:
```           T
// | \
a T  b  T
|     |
ty    tz
```

### Proof of Proposition 1 (L(G) = L)

Part 1: L(G) is contained in L
We need to prove that, for every x in L(G), #a(x) = #b(x). We are going to prove this property by structural induction.
• base case: x = lambda. We have: #a(lambda) = 0 = #b(lambda)
• inductive step, case 1: x = a y b z where y is in L(G1) and z is in L(G). We have:
#a(x) = #a(a y b z) = 1 + #a(y) + #a(z)
#b(x) = #b(a y b z) = #b(y) + 1 + #b(z).
By Lemma 1 #a(y) = #b(y) and by inductive hipothesis #a(z) = #b(z), hence #a(x) = #b(x).
• inductive step, case 2: x = b y a z where y is in L(G2) and z is in L(G). We have:
#a(x) = #a(b y a z) = #a(y) + 1 + #a(z)
#b(x) = #b(b y a z) = 1 + #b(y) + #b(z).
By Lemma 3 #a(y) = #b(y) and by inductive hipothesis #a(z) = #b(z), hence #a(x) = #b(x).
Part 2: L is contained in L(G)
We are going to prove this part by strong mathematical induction on the length of the strings. Let x be a string in L1.
• if |x| = 0, then x = lambda and therefore x can be generated by using the production S -> lambda.
• if |x| > 0, and x starts with "a": Since #a(x) = #b(x), x must contain a matching "b" somewhere. Let us consider the shortest y such that x = a y b z for some z and #a(y) = #b(y). By a case analysis similar to the one in the proof of Lemma 1, we can show that y is in L1 and that z is in L. (for proving that for all y1, y2 s.t. y = y1b y2, #a(y1) >= 1 + #b(y1) holds, we must use the fact that y is the shortest string satisfying the properties x = a y b z and #a(y) = #b(y)).
By Lemma 1 we have that y is in L(G1), namely T ->* y.
By inductive hypothesis we have that z is in L(G), namely S ->* z.
Hence we can obtain a derivation
S -> a T b S ->* a y b S ->* a y b z
which shows that the string a y b z, namely x, is in L(G).
• if |x| > 0, and x starts with "b": the proof is analogous to previous case.

### Proof of Proposition 2 (G is not ambiguous)

We are going to prove this proposition by structural induction. Let x be a string in L(G).
1. base case: x = lambda. In this case, there can be only one derivation tree for x: the one obtained by applying the production S -> lambda.
2. inductive step: x = a y b z, with y in L(G1) and z in L(G). Observe that y and z are uniquely defined by these properties. Namely, if x = a v b w, with v in L(G1) and w in L(G), then y = v and z = w.
In fact, assume by contradition that y and v are different, and consider the case that y is a prefix of v. But then we must have v = y b v' for some v'. By definition of L1, #a(y) >= 1 + #b(y) holds, which implies that y is not in L1. Contradiction.
The case in which v is a prefix of y is analogous.
Since y and v are equal, also z and w must be equal.
Now, by inductive hypothesis, y and z must have each a unique derivation tree, with roots T and S respectively. Let us call these trees ty and tz respectively. We have that x admits only one derivation tree, which is the following one:
```           S
// | \
a T  b  S
|     |
ty    tz
```
3. inductive step: x = b y a z, with y in L(G2) and z in L(G). The proof is analogous to previous case.