CSE 428: Lecture notes 2 Derivation Tree Derivation trees (also called "parse trees" in Sethi's book) are a way to represent the generation of strings in a grammar. They also give information about the structure of the strings, i.e. the way they are organized in syntactical categories. Definition Given a grammar G = < T , N , s , P > , a derivation tree t for G is a tree such that: the root is labeled by s the leaves are labeled by terminal symbols each intermediate node is labeled by a non-terminal symbol, and, if its label is A, then its children are labeled by symbols s_1 , s_2 , ... , s_n such that there exists a production A ::= s_1 s_2 ... s_n in P The labels of the leaves (fringe) represent the string generated by t. We will indicate it by string(t). It is easy to see that a derivation tree represents a set of derivations (usually more than one) for the same string, and that for each derivation there is a derivation tree for the same string. Hence L(G) coincides with the set of strings generated by all possible derivation trees for G. More formally, if we denote by DT(G) the set of all derivation trees for G, we have the following result: Proposition L(G) = { alpha in T* | alpha = string(t) for some t in DT(G) } Example Let us consider again the language of numerical expressions, with productions Exp ::= Num | Exp + Exp | Exp * Exp We have that a possible derivation tree for the string 2 + 3 * 5 is the following: Exp /|\ / | \ / | \ Exp + Exp | /|\ | / | \ | / | \ Num Exp * Exp | | | 2 Num Num | | 3 5 This tree corresponds to several derivations for the same string, which differ only for the choice of the non-terminal to expand at each derivation step. Ambiguity The structure of an expression is usually essential to interpret its meaning. The expression 2 + 3 * 5 for example has two different values depending on its intended structure: If we assume it to be 2 + ( 3 * 5 ) (i.e. 3 and 5 grouped together by *) then the result is 17. If, on the other hand, we assume it to be ( 2 + 3 ) * 5, then the result is 25. In order to avoid this kind of ambiguity, it is essential that the grammar generates only one possible structure for each string in the language. Since the structure is represented by the derivation tree, we have the following definition: Definition A grammar G is ambiguous if there exist a string in L(G) which can be derived by two (or more) different derivation trees. Example The grammar in the example above is ambiguous, in fact the string 2 + 3 * 5 can be generated also by the following tree: Exp /|\ / | \ / | \ Exp * Num /|\ | / | \ 5 / | \ Exp + Exp | | Num Num | | 2 3 This tree corresponds to the grouping ( 2 + 3 ) * 5, while the tree in the example above corresponds to 2 + ( 3 * 5 ). There are languages which are intrinsically ambiguous, i.e. it is not possible to eliminate their ambiguities without changing the language. Definition A language L is intrinsically ambiguous if can be generated only by ambiguous grammars, i.e. for every grammar G such that L=L(G), we have that G is ambiguous. Luckily, languages which are interesting from the point of view of programming usually are not intrinsically ambiguous, and therefore we can find non-ambiguous grammars which generates them. When a (non-intrinsically ambiguous) language L is presented by an ambiguous grammar G, "to eliminate the ambiguities of G" means to find another grammar G', which is non ambiguous, and which generates the same language L. We will consider three common examples of ambiguities, and the way to eliminate them: 1.Precedence 2.Associativity 3.Dangling-else Precedence In the examples above, the ambiguity in the interpretation of 2 + 3 * 5 can be eliminated by imposing the precedence of one operator over the other. We say that op has precedence over op' if an expression of the form e_1 op e_2 op' e_3 (respectively e_1 op' e_2 op e_3 ) is interpreted only as (e_1 op e_2) op' e_3 (respectively e_1 op' (e_2 op e_3) ) In other words, the grouping power of op is greater than the grouping power of op'. >From the point of view of derivation trees, the fact that e_1 op e_2 op' e_3 is interpreted as (e_1 op e_2) op' e_3 means that the introduction of op must be done at a level strictly lower than op', i.e. in a sub-tree whose root is a child of op'. In order to modify the grammar so that it generates only this kind of tree, a possible solution is to introduce a new syntactic category producing expressions of the form e_1 op e_2, and to force a hierarchical order w.r.t. to the main category of expressions of the form e_1 op' e_2. Example We can eliminate the ambiguities from the grammar in the example above by introducing a new syntactic category Term producing expressions of the form e_1 * e_2 where e1 and e2 may contain * again, but not +. This can be done by organizing hierarchically the productions as follows: Exp ::= Exp + Term | Term Term ::= Term * Num | Num This modification corresponds to assigning * a higher priority w.r.t. + (following the mathematical convention). Consider again the string 2 + 3 * 5. It is easy to see that in the new grammar there is only one tree which can generate it: Exp /|\ / | \ / | \ Exp + Term | /|\ | / | \ | / | \ Term Term * Num | | | Num Num 5 | | 2 3 Associativity Consider again the grammar for numerical expressions in previous example, and consider the grammar obtained by modifying the productions for Exp in the following way: Exp ::= Exp + Exp | Term This new grammar is ambiguous. In fact, it allows two different derivation trees for the string 2 + 3 + 5: one corresponding to the structure (2 + 3) + 5 and one corresponding to the structure 2 + (3 + 5). In the case of the + operator, this kind of ambiguity is not a problem, because of its algebraic properties: + is associative, i.e. (2 + 3) + 5 and 2 + (3 + 5) have the same value. In general, however, an operator might not be associative. This is for instance the case for the - and ^ (exponentiation) operators: (5 - 3) - 2 and 5 - (3 - 2) have different values, as well as (5 ^ 3) ^ 2 and 5 ^ (3 ^ 2). In order to eliminate this kind of ambiguity, we mush establish whether the operator is left-associative or right-associative. Left-associative means that e_1 op e_2 op e_3 is interpreted as (e_1 op e_2) op e_3 (op associates to the left). Vice versa, right-associative means that it is interpreted as e_1 op (e_2 op e_3) (op associates to the right). We can impose left-associativity (resp. right-associativity) by using the following technique: In the production introducing op, we place the syntactic category producing op to the left (resp. to the right) of op. Note that in previous example this is done for both + and * : they are forced to be left-associative. Example Consider the following grammar (productions) for numerical expressions constructed with the - operation: Exp ::= Num | Exp - Exp This grammar is ambiguous since it allows both the interpretations (5 - 3) - 2 and 5 - (3 - 2). If we want to impose the left-associativity (following the mathematical convention), it is sufficient to modify the productions in the following way: Exp ::= Num | Exp - Num Example Consider the following grammar (productions) for numerical expressions constructed with the ^ operation: Exp ::= Num | Exp ^ Exp This grammar is ambiguous since it allows both the interpretations (5 ^ 3) ^ 2 and 5 ^ (3 ^ 2). If we want to impose the right-associativity (following the mathematical convention), it is sufficient to modify the productions in the following way: Exp ::= Num | Num ^ Exp