CSE 428: Lecture 2

We give now the formal definition of context-free grammar. We will use the following notation: if X is a set of symbols, then X^* is the set of all strings on X, including the empty string lambda. A string on X is a finite sequence of symbols of X, possibly repeated.

Context-free grammars

Definition: A context-free grammar is a tuple

Note that the tokens do not necessarily coincide with the symbols of the keyboard. For instance, the Pascal's keywords if, then and else are tokens (not sequences of tokens). Whenever not clear from the context, we will distinguish these tokens from the symbols of syntactical categories by underlyining them, or by using the boldface style for the first and normal style for the latter.

The term "context-free" refers to the fact that the productions are restricted to be of the form A ::= alpha. In general a grammar production would be of the form beta ::= alpha, where beta is a non-empty string containing at least a non-terminal symbol. Such kind of productions are called context-free because, intuitively, they can only define a syntactic category in isolation, i.e. independently from the context.

A grammar defines a language, namely a particular subset of all the possible strings on the alphabet. Intuitively, the language generated by a grammar G = < T , N , S , P > is the set of all possible strings in T that can be obtained starting from the initial symbol s, and then applying repeatedly the productions, until we obtain a string of terminal symbols only. More formally:

Definition: A derivation is a sequence
Definition: The language generated by a grammar G, L(G), is the set of all strings in T which can be obtained with a derivation starting from the initial symbol s. Formally:

Let us see some examples of languages and grammars

Example: The language of balanced parentheses: Examples of strings generated by this grammar are:
Example: The language of arithmetic expressions with plus and times operations: Examples of strings generated by this grammar are:
Exercise: In the example above, we have assumed an infinite set of productions for Num. This is unrealistic. Give a grammar which generates the natural numbers in their usual representation, i.e. as sequences of digits 0,..., 9 starting with a digit different from 0

Derivation Tree

Derivation trees (also called "parse trees" in Sethi's book) are a way to represent the generation of strings in a grammar. They also give information about the structure of the strings, i.e. the way they are organized in syntactical categories.

Definition: Given a grammar G = < T , N , S , P > , a derivation tree t for G is a tree such that:

It is easy to see that a derivation tree represents a set of derivations (usually more than one) for the same string, and that for each derivation there is a derivation tree for the same string. Hence L(G) coincides with the set of strings generated by all possible derivation trees for G. More formally, if we denote by DT(G) the set of all derivation trees for G, we have the following result:

Proposition: L(G) = { alpha in T^* | alpha = string(t) for some t in DT(G) }

Example: Let us consider again the language of arithmetic expressions, with productions; Exp ::= Num | Exp + Exp | Exp * Exp; We have that a possible derivation tree for the string 2 + 3 * 5 is the following:; This tree corresponds to several derivations for the same string, which differ only for the choice of the non-terminal to expand at each derivation step.

Ambiguity

The structure of an expression is usually essential to interpret its meaning. The expression 2 + 3 * 5 for example has two different values depending on its intended structure: If we assume it to be 2 + ( 3 * 5 ) (i.e. * is applied to 3 and 5) then the result is 17. If, on the other hand, we assume it to be ( 2 + 3 ) * 5, then the result is 25. In order to avoid this kind of ambiguity, it is essential that the grammar generates only one possible structure for each string in the language. Since the structure is represented by the derivation tree, we have the following definition:

Definition: A grammar G is ambiguous if there exist a string in L(G) which can be derived by two (or more) different derivation trees.
Example: The grammar in the example above is ambiguous, in fact the string 2 + 3 * 5 can be generated also by the following tree:

There are languages which are intrinsically ambiguous, i.e. it is not possible to eliminate their ambiguities without changing the language.

Definition: A language L is intrinsically ambiguous if can be generated only by ambiguous grammars, i.e. for every grammar G such that L=L(G), we have that G is ambiguous.

Fortunately, languages which are interesting from the point of view of programming usually are not intrinsically ambiguous, and therefore we can find non-ambiguous grammars which generates them. When a (non-intrinsically ambiguous) language L is presented by an ambiguous grammar G, "to eliminate the ambiguities of G" means to find another grammar G', which is non ambiguous, and which generates the same language L.