CSE 428: Lecture 2
We give now the formal definition of contextfree grammar. We will use the following notation: if X is a set of symbols, then X^{*} is the set of all strings on X, including the empty string lambda. A string on X
is a finite sequence of symbols of X, possibly repeated.
Contextfree grammars

Definition

A contextfree grammar is a tuple
G = < T , N , S , P >
where:

T is a set of terminal
symbols (tokens, alphabet)

N is a set of nonterminal
symbols (syntactic categories)

S is an element of N
and it is called starting (or initial)
symbol

P is a set of productions,
i.e. rules of the form A ::= alpha, where
A is an element of N
and alpha is a string of terminal and
nonterminal symbols, i.e. an element of (N
union T)^{*}
Note that the tokens do not necessarily coincide with the symbols of the
keyboard. For instance, the Pascal's keywords if, then and
else are tokens (not sequences of tokens). Whenever not clear from
the context, we will distinguish these tokens from the symbols of syntactical
categories by underlyining them, or by using the boldface style for the first
and normal style for the latter.
The term "contextfree" refers to the fact that the productions are
restricted to be of the form A ::= alpha.
In general a grammar production would be of the form beta
::= alpha, where beta is
a nonempty string containing at least a nonterminal symbol. Such kind
of productions are called contextfree because, intuitively, they can only
define a syntactic category in isolation, i.e. independently from the context.
A grammar defines a language, namely a particular subset of all the
possible strings on the alphabet. Intuitively, the language generated by
a grammar G = < T , N , S , P > is
the set of all possible strings in T
that can be obtained starting from the initial symbol
s, and then applying repeatedly the productions, until we obtain
a string of terminal symbols only. More formally:

Definition

A derivation is a sequence
alpha_{0} > alpha_{1 }>
... > alpha_{n}
such that:

each alpha_{i} is an element
of (N union T)*

for each i < n, alpha_{i}
is of the form beta A gamma, where
beta and gamma
are elements of (N union T)*,
and there is a production of the form A ::= delta
such that alpha_{i+1 }is
beta delta gamma.
We will represent such a derivation with the notation alpha_{0}
>* alpha_{n}

Definition

The language generated by a grammar
G, L(G),
is the set of all strings in T which
can be obtained with a derivation starting from the initial symbol s.
Formally:
L(G) = { alpha in T^{*} 
S >* alpha}
Let us see some examples of languages and grammars

Example: The language of balanced parentheses

Terminals: the symbols ( and )

Nonterminals: the symbol A

Productions: A ::= ( )  (A)  AA
(This is an abbreviated notation standing for the productions A
::= ( ) , A ::= (A) , and
A ::= AA .)

Examples of strings generated by this grammar are:
The corresponding derivations are:

A > ( )

A > AA > ( )A > ( )( )

A > AA > (A)A > (A)( ) > (( ))( )

Example: The language of arithmetic expressions
with plus and times operations

Terminals: the symbols + and *
and (the representation of) the natural numbers

Nonterminals: the symbols Exp and
Num

Initial symbol: Exp

Productions:
Exp ::= Num  Exp + Exp  Exp * Exp
Num ::= 0  1  2  3  ...
(note that we need infinitely many productions for Num)

Examples of strings generated by this grammar are:
The corresponding derivations are:

Exp > Num > 2

Exp > Exp + Exp > Num + Exp > Num + Num >
2 + Num > 2 + 3

Exp > Exp + Exp > Num + Exp > Num + Exp * Exp
>* Num + Num * Num >* 2 + 3
* 5

Exercise

In the example above, we have assumed an infinite set of productions for
Num. This is unrealistic. Give a grammar
which generates the natural numbers in their usual representation, i.e.
as sequences of digits 0,..., 9
starting with a digit different from 0
Derivation Tree
Derivation trees (also called "parse
trees") are a way to represent the generation of
strings in a grammar. They also give information about the structure
of the strings, i.e. the way they are organized in syntactical categories.

Definition

Given a grammar G = < T , N , S , P > ,
a derivation tree t
for G is a tree such that:

the root is labeled by S

the leaves are labeled by terminal symbols

each intermediate node is labeled by a nonterminal symbol, and,
if its label is A, then its children
are labeled by symbols s_{1} , s_{2}
, ... , s_{n} such that there exists a production
A ::= s_{1} s_{2} ... s_{n}
in P
The labels of the leaves (fringe) represent the string generated
by t.
We will indicate it by string(t).
It is easy to see that a derivation tree represents a set of
derivations
(usually more than one) for the same string, and that for each derivation
there is a derivation tree for the same string. Hence L(G)
coincides with the set of strings generated by all possible derivation
trees for G. More formally, if
we denote by DT(G) the set of all derivation
trees for G, we have the following result:

Proposition

L(G) = { alpha in T^{*}  alpha = string(t) for some t in DT(G) }

Example

Let us consider again the language of arithmetic expressions, with productions

Exp ::= Num  Exp + Exp  Exp * Exp

We have that a possible derivation tree for the string
2 + 3 * 5 is the following:
Exp
/\
/  \
/  \
Exp + Exp
 /\
 /  \
 /  \
Num Exp * Exp
  
2 Num Num
 
3 5

This tree corresponds to several derivations for the same string, which
differ only for the choice of the nonterminal to expand at each derivation
step.
Ambiguity
The structure of an expression is usually essential to interpret its meaning.
The expression 2 + 3 * 5 for
example has two different values depending on its intended structure: If
we assume it to be
2 + (
3 * 5 ) (i.e.
* is applied to
3 and 5) then the result
is 17. If, on the other hand, we assume
it to be ( 2
+ 3 ) * 5,
then the result is 25. In order to
avoid this kind of ambiguity, it is essential that the grammar
generates only one possible structure for each string in the language.
Since the structure is represented by the derivation tree, we have the
following definition:

Definition

A grammar G is ambiguous
if there exist a string in L(G) which
can be derived by two (or more) different derivation trees.

Example

The grammar in the example above is ambiguous,
in fact the string 2 + 3 * 5
can be generated also by the following tree:
Exp
/\
/  \
/  \
Exp * Num
/\ 
/  \ 5
/  \
Exp + Exp
 
Num Num
 
2 3
This tree corresponds to the grouping (
2 + 3 )
* 5, while the tree in the example above corresponds to 2
+ ( 3 * 5
).
There are languages which are intrinsically ambiguous,
i.e. it is not possible to eliminate their ambiguities without changing
the language.

Definition

A language L is intrinsically ambiguous
if can be generated only by ambiguous grammars, i.e. for every grammar
G such that L=L(G),
we have that G is ambiguous.
Fortunately, languages which are interesting from the point of view of programming
usually are not intrinsically ambiguous, and therefore we can find nonambiguous
grammars which generates them. When a (nonintrinsically ambiguous) language
L is presented by an ambiguous grammar
G, "to eliminate the ambiguities of
G" means to find another grammar
G',
which is non ambiguous, and which generates the same language
L.