CSE 428: Lecture notes 2 

Derivation Tree

Derivation trees (also called "parse trees" in Sethi's book) are a way to represent the generation of strings in a grammar. They also give information about the structure of the strings, i.e. the way they are organized in syntactical categories.
Definition
Given a grammar  G = < T , N , s , P > , a derivation tree t for G is a tree such that:
The labels of the leaves (fringe) represent the string generated by t. We will indicate it by string(t).
It is easy to see that a derivation tree represents a set of derivations (usually more than one) for the same string, and that for each derivation there is a derivation tree for the same string. Hence L(G) coincides with the set of strings generated by all possible derivation trees for G. More formally, if we denote by DT(G) the set of all derivation trees for G, we have the following result:
Proposition
L(G) = { alpha in T* | alpha = string(t) for some t in DT(G) }
Example
Let us consider again the language of numerical expressions, with productions
Exp ::= Num | Exp + Exp | Exp * Exp
We have that a possible derivation tree for the string 2 + 3 * 5 is the following:


          Exp 
          /|\ 
         / | \ 
        /  |  \ 
      Exp  +  Exp 
       |      /|\ 
       |     / | \ 
       |    /  |  \ 
      Num Exp  *  Exp 
       |   |       | 
       2  Num     Num 
           |       | 
           3       5 

This tree corresponds to several derivations for the same string, which differ only for the choice of the non-terminal to expand at each derivation step.

Ambiguity

The structure of an expression is usually essential to interpret its meaning. The expression 2 + 3 * 5 for example has two different values depending on its intended structure: If we assume it to be 2 + ( 3 * 5 ) (i.e. 3 and 5 grouped together by *) then the result is 17. If, on the other hand, we assume it to be ( 2 + 3 ) * 5, then the result is 25. In order to avoid this kind of ambiguity, it is essential that the grammar generates only one possible structure for each string in the language. Since the structure is represented by the derivation tree, we have the following definition:
Definition
A grammar G is ambiguous if there exist a string in L(G) which can be derived by two (or more) different derivation trees.
Example
The grammar in the example above  is ambiguous, in fact the string 2 + 3 * 5 can be generated also by the following tree:

             Exp 
             /|\ 
            / | \ 
           /  |  \ 
         Exp  *  Num 
         /|\      | 
        / | \     5 
       /  |  \ 
     Exp  +  Exp 
      |       | 
     Num     Num 
      |       | 
      2       3 
This tree corresponds to the grouping ( 2 + 3 ) * 5, while the tree in the example above corresponds to 2 + ( 3 * 5 ).
There are languages which are intrinsically ambiguous, i.e. it is not possible to eliminate their ambiguities without changing the language.
Definition
A language L is intrinsically ambiguous if can be generated only by ambiguous grammars, i.e. for every grammar G such that L=L(G), we have that G is ambiguous.
Luckily, languages which are interesting from the point of view of programming usually are not intrinsically ambiguous, and therefore we can find non-ambiguous grammars which generates them. When a (non-intrinsically ambiguous) language L is presented by an ambiguous grammar G, "to eliminate the ambiguities of G" means to find another grammar G', which is non ambiguous, and which generates the same language L.

We will consider three common examples of ambiguities, and the way to eliminate them:
  1. Precedence
  2. Associativity
  3. Dangling-else

Precedence

In the examples above, the ambiguity in the interpretation of 2 + 3 * 5 can be eliminated by imposing the precedence of one operator over the other. We say that op has precedence over op' if an expression of the form
e1 op e2 op' e3 (respectively e1 op' e2 op e3 )
is interpreted only as
(e1 op e2) op' e3 (respectively e1 op' (e2 op e3) )
In other words, the grouping power of op is greater than the grouping power of op'.

From the point of view of derivation trees, the fact that e1 op e2 op' e3  is interpreted as (e1 op e2) op' e3means that the introduction of op must be done at a level strictly lower than op', i.e. in a sub-tree whose root is a child of op'. In order to modify the grammar so that it generates only this kind of tree, a possible solution is to introduce a new syntactic category producing expressions of the form e1 op e2, and to force a hierarchical order w.r.t. to the main category of expressions of the form e1 op' e2.

Example
We can eliminate the ambiguities from the grammar in the example above by introducing a new syntactic category Term producing expressions of the form
e1 * e2
where e1 and e2 may contain * again, but not +. This can be done by organizing hierarchically the productions as follows:
Exp ::= Exp + Term | Term
Term ::= Term * Num | Num
This modification corresponds to assigning * a higher priority w.r.t. + (following the mathematical convention). Consider again the string 2 + 3 * 5. It is easy to see that in the new grammar there is only one tree which can generate it:
         Exp 
         /|\ 
        / | \ 
       /  |  \ 
     Exp  +  Term 
      |      /|\ 
      |     / | \ 
      |    /  |  \ 
    Term Term *  Num 
      |    |      | 
     Num  Num     5 
      |    | 
      2    3 

Associativity

Consider again the grammar for numerical expressions in previous example, and consider the grammar obtained by modifying the productions for  Exp in the following way:
Exp ::= Exp + Exp | Term
This new grammar is ambiguous. In fact, it allows two different derivation trees for the string 2 + 3 + 5: one corresponding to the structure (2 + 3) + 5 and one corresponding to the structure 2 + (3 + 5).

In the case of the + operator, this kind of ambiguity is not a problem, because of its algebraic properties: + is associative, i.e. (2 + 3) + 5 and 2 + (3 + 5) have the same value.

In general, however, an operator might not be associative. This is for instance the case for the - and ^ (exponentiation) operators: (5 - 3) - 2 and 5 - (3 - 2) have different values, as well as (5 ^ 3) ^ 2 and 5 ^ (3 ^ 2).

In order to eliminate this kind of ambiguity, we mush establish whether the operator is left-associative or right-associative. Left-associative means that e1 op e2 op e3 is interpreted as (e1 op e2) op e3 (op associates to the left). Vice versa, right-associative means that it is interpreted as e1 op (e2 op e3) (op associates to the right).

We can impose left-associativity (resp. right-associativity) by using the following technique: In the production introducing op, we place the syntactic category producing op to the left (resp. to the right) of op. Note that in previous example this is done for both + and * : they are forced to be left-associative.

Example
Consider the following grammar (productions) for numerical expressions constructed with the - operation:
Exp ::= Num | Exp - Exp
This grammar is ambiguous since it allows both the interpretations (5 - 3) - 2 and 5 - (3 - 2). If we want to impose the left-associativity (following the mathematical convention), it is sufficient to modify the productions in the following way:
Exp ::= Num | Exp - Num
Example
Consider the following grammar (productions) for numerical expressions constructed with the ^ operation:
Exp ::= Num | Exp ^ Exp
This grammar is ambiguous since it allows both the interpretations (5 ^ 3) ^ 2 and 5 ^ (3 ^ 2). If we want to impose the right-associativity (following the mathematical convention), it is sufficient to modify the productions in the following way:
Exp ::= Num | Num ^ Exp