Fall 2000, CSE 468: Lectures

Fall 2000, CSE 468: Lectures 25 and 26

Proofs by induction on context free grammars

In these lectures we will show how to prove by induction certain properties of context-free grammars. In particular, the property of generating a given language, and the property of being unambiguous.

Our running example will be a grammar for generating the language of the strings with the same number of a's and b's, namely the language

L = { x in {a,b}^* | #_a(x) = #_b(x) }

We want to define a non-ambiguous grammar G for such a language, and prove formally that indeed L(G) = L and that G is non-ambiguous.

Note that the grammar

S -> lambda | a S b | b S a | S S

generates (intuitively) the given language, but it is ambiguous. For instance, the string abab has two different derivation trees:

      S                S
    / | \            /   \
   a  S  b         S       S
    / | \        / | \   / | \
   b  S  a      a  S  b a  S  b
      |            |       |
   lambda       lambda   lambda

Consider now the grammar

S -> lambda | a S b S | b S a S

Also this grammar generates the given langauge, but it is still ambiguous. The string abab has still two different derivation trees:

         S                        S
      // | \                   // | \
    a S  b   S               a S  b  S
      |   // | \             //| \   |
  lambda a S b  S          b S a  S lambda
           |    |            |    |
       lambda lambda      lambda lambda

Intuitively, the reson why the second grammar is ambiguous is because in the production S -> a S b S we do not enforce the b to be the "matching b" for the a. An analogous problem is related to the production S -> b S a S. Intuitively, we could eliminate the ambiguity by forging the first S in a S b S to generate "the shortest string" with the same number of a's and b's. Analogously for b S a S. This can be done by introducing new syntactic categories (non-terminal symbols) T and U, and by stratifying the productions as follows:

S -> lambda | a T b S | b U a S
T -> lambda | a T b T
U -> lambda | b U a U

We will call G the above grammar consisting of all the productions for S, T and U. We will also call G₁ the subgrammar consisting of the productions

T -> lambda | a T b T

and we will call G₂ the subgrammar consisting of the productions

U -> lambda | b U a U

Intuitively, G₁ generates the language of the "balanced" strings on a and b, namely those strings with the same number of a's and b's, and where each b comes after the corresponding a. For instance, strings like

ab , abab , aabb , aabbab , ... etc.

This language can be formally defined as the language

L₁ = { x in {a,b}^* | #_a(x) = #_b(x) and for all x₁, x₂ s.t. x = x₁b x₂, #_a(x₁) >= 1 + #_b(x₁) }

Note that if we replaced "a" by "(" and "b" by ")", then G₁ would generate the language of balanced parentheses, namely all the strings of the form

( ) , ( )( ) , (( )) , (( ))( ) , ... etc.

The grammar G₂ generates the symmetric language, with the roles of a and b exchanged. Formally we can define the language (intuitively) generated by G₂ as

L₂ = { x in {a,b}^* | #_a(x) = #_b(x) and for all x₁, x₂ s.t. x = x₁a x₂, #_b(x₁) >= 1 + #_a(x₁) }

We are now going to prove by mathematical induction and by structural induction that G generates the required language, namely

Proposition 1 L(G) = L

and that G is unambiguous, namely:

Proposition 2 For all x in L(G), there exists only one derivation tree in G for x.

In order to prove Propositions 1 and 2, we need to prove analogous properties for the subgrammars G₁ and G₂. Specifically, we need to prove the following lemmata:

Lemma 1 L(G₁) = L₁

Lemma 2 For all x in L(G₁), there exists only one derivation tree in G₁ for x.

Lemma 3 L(G₂) = L₂

Lemma 4 For all x in L(G₂), there exists only one derivation tree in G₂ for x.

We are going to prove only Lemmata 1 and 2: The proofs of Lemmata 3 and 4 are analogous.

Proof of Lemma 1 (L(G₁) = L₁ )

Part 1: L(G₁) is contained in L₁

We need to prove that, for every x in L(G₁), x enjoys the following properties:

#_a(x) = #_b(x)
for all x₁, x₂ s.t. x = x₁b x₂, #_a(x₁) >= 1 + #_b(x₁)

We are going to prove these properties by structural induction (remember that L(G₁) can be seen as defined inductively).

base case: x = lambda. We have:
1. #_a(lambda) = 0 = #_b(lambda)
2. there exists no x₁, x₂ s.t. x = x₁a x₂, hence this point is trivially satisfied
inductive step: x = a y b z where y and z are in L(G₁) as well. We have:
1. #_a(x) = #_a(a y b z) = 1 + #_a(y) + #_a(z)
  #_b(x) = #_b(a y b z) = #_b(y) + 1 + #_b(z).
  By inductive hypothesis #_a(y) = #_b(y) and #_a(z) = #_b(z), hence #_a(x) = #_b(x).
2. Let x = x₁ b x₂. We have three cases:
  - x₁ = a y. In this case we have:
    #_a(x₁) = 1 + #_a(y) = (by inductive hypothesis) 1 + #_b(y) = 1 + #_b(x₁)
  - x₁ = a y₁ where y₁ is a prefix of y, namely y = y₁b y₂. In this case we have:
    #_a(x₁) = 1 + #_a(y₁) >= (by inductive hypothesis) 1 + 1 + #_b(y₁) = 2 + #_b(x₁) > 1 + #_b(x₁)
  - x₁ = a y b z₁ where z₁ is a prefix of z, namely z = z₁b z₂. In this case we have:
    #_a(x₁) = 1 + #_a(y) + #_a(z₁) >= (by inductive hypothesis) 1 + #_b(y) + 1 + #_b(z₁) = 1 + #_b(x₁)

Part 2: L₁ is contained in L(G₁)

We are going to prove this part by strong mathematical induction on the length of the strings. Remember that the principle of strong mathematical induction has the following schema:

If for all n and for all k < n P(k) implies P(n), then we can deduce that for all n, P(n) holds.

In practice, the principle of strong mathematical induction allows us to use the inductive hypothesis on all k < n, instead than just on n-1.

Let x be a string in L₁.

if |x| = 0, then x = lambda and therefore x can be generated by using the production T -> lambda.
if |x| > 0, then x must start with "a" (by definition of L₁), and must contain a matching "b" somewhere. Let us consider the shortest y such that x = a y b z for some z and #_a(y) = #_b(y). By a case analysis similar to the one in previous part, we can show that y and z are also in L₁ (for proving that for all y₁, y₂ s.t. y = y₁b y₂, #_a(y₁) >= 1 + #_b(y₁) holds, we must use the fact that y is the shortest string satisfying the properties x = a y b z and #_a(y) = #_b(y)).
By inductive hypothesis we have that y and z are in L(G₁), namely T ->^* y and T ->^* z.
Hence we can obtain a derivation
T -> a T b T ->^* a y b T ->^* a y b z
which shows that the string a y b z, namely x, is in L(G₁)

Proof of Lemma 2 (G₁ is not ambiguous)

We are going to prove this lemma by structural induction. Let x be a string in L(G₁).

base case: x = lambda. In this case, there can be only one derivation tree for x: the one obtained by applying the production T -> lambda.
inductive step: x = a y b z, with y, z in L(G₁). Observe that y and z are uniquely defined by these properties. Namely, if x = a v b w, with v, w in L(G₁), then y = v and z = w.
In fact, assume by contradition that y and v are different, and consider the case that y is a prefix of v. But then we must have v = y b v' for some v'. By definition of L₁, #_a(y) >= 1 + #_b(y) holds, which implies that y is not in L₁. Contradiction.
The case in which v is a prefix of y is analogous.
Since y and v are equal, also z and w must be equal.
Now, by inductive hypothesis, y and z must have each a unique derivation tree. Let us call these trees ty and tz respectively. We have that x admits only one derivation tree, which is the following one:
```
           T
        // | \
      a T  b  T
        |     |
        ty    tz
    
```

Proof of Proposition 1 (L(G) = L)

Part 1: L(G) is contained in L

We need to prove that, for every x in L(G), #_a(x) = #_b(x). We are going to prove this property by structural induction.

base case: x = lambda. We have: #_a(lambda) = 0 = #_b(lambda)
inductive step, case 1: x = a y b z where y is in L(G₁) and z is in L(G). We have:
#_a(x) = #_a(a y b z) = 1 + #_a(y) + #_a(z)
#_b(x) = #_b(a y b z) = #_b(y) + 1 + #_b(z).
By Lemma 1 #_a(y) = #_b(y) and by inductive hipothesis #_a(z) = #_b(z), hence #_a(x) = #_b(x).
inductive step, case 2: x = b y a z where y is in L(G₂) and z is in L(G). We have:
#_a(x) = #_a(b y a z) = #_a(y) + 1 + #_a(z)
#_b(x) = #_b(b y a z) = 1 + #_b(y) + #_b(z).
By Lemma 3 #_a(y) = #_b(y) and by inductive hipothesis #_a(z) = #_b(z), hence #_a(x) = #_b(x).

Part 2: L is contained in L(G)

We are going to prove this part by strong mathematical induction on the length of the strings. Let x be a string in L₁.

if |x| = 0, then x = lambda and therefore x can be generated by using the production S -> lambda.
if |x| > 0, and x starts with "a": Since #_a(x) = #_b(x), x must contain a matching "b" somewhere. Let us consider the shortest y such that x = a y b z for some z and #_a(y) = #_b(y). By a case analysis similar to the one in the proof of Lemma 1, we can show that y is in L₁ and that z is in L. (for proving that for all y₁, y₂ s.t. y = y₁b y₂, #_a(y₁) >= 1 + #_b(y₁) holds, we must use the fact that y is the shortest string satisfying the properties x = a y b z and #_a(y) = #_b(y)).
By Lemma 1 we have that y is in L(G₁), namely T ->^* y.
By inductive hypothesis we have that z is in L(G), namely S ->^* z.
Hence we can obtain a derivation
S -> a T b S ->^* a y b S ->^* a y b z
which shows that the string a y b z, namely x, is in L(G).
if |x| > 0, and x starts with "b": the proof is analogous to previous case.

Proof of Proposition 2 (G is not ambiguous)

We are going to prove this proposition by structural induction. Let x be a string in L(G).

base case: x = lambda. In this case, there can be only one derivation tree for x: the one obtained by applying the production S -> lambda.
inductive step: x = a y b z, with y in L(G₁) and z in L(G). Observe that y and z are uniquely defined by these properties. Namely, if x = a v b w, with v in L(G₁) and w in L(G), then y = v and z = w.
In fact, assume by contradition that y and v are different, and consider the case that y is a prefix of v. But then we must have v = y b v' for some v'. By definition of L₁, #_a(y) >= 1 + #_b(y) holds, which implies that y is not in L₁. Contradiction.
The case in which v is a prefix of y is analogous.
Since y and v are equal, also z and w must be equal.
Now, by inductive hypothesis, y and z must have each a unique derivation tree, with roots T and S respectively. Let us call these trees ty and tz respectively. We have that x admits only one derivation tree, which is the following one:
```
           S
        // | \
      a T  b  S
        |     |
        ty    tz
    
```
inductive step: x = b y a z, with y in L(G₂) and z in L(G). The proof is analogous to previous case.