CSE 428: Lecture 2
We give now the formal definition of context-free grammar. We will use the following notation: if X is a set of symbols, then X* is the set of all strings on X, including the empty string lambda. A string on X
is a finite sequence of symbols of X, possibly repeated.
Context-free grammars
-
Definition
-
A context-free grammar is a tuple
G = < T , N , S , P >
where:
-
T is a set of terminal
symbols (tokens, alphabet)
-
N is a set of non-terminal
symbols (syntactic categories)
-
S is an element of N
and it is called starting (or initial)
symbol
-
P is a set of productions,
i.e. rules of the form A ::= alpha, where
A is an element of N
and alpha is a string of terminal and
non-terminal symbols, i.e. an element of (N
union T)*
Note that the tokens do not necessarily coincide with the symbols of the
keyboard. For instance, the Pascal's keywords if, then and
else are tokens (not sequences of tokens). Whenever not clear from
the context, we will distinguish these tokens from the symbols of syntactical
categories by underlyining them, or by using the boldface style for the first
and normal style for the latter.
The term "context-free" refers to the fact that the productions are
restricted to be of the form A ::= alpha.
In general a grammar production would be of the form beta
::= alpha, where beta is
a non-empty string containing at least a non-terminal symbol. Such kind
of productions are called context-free because, intuitively, they can only
define a syntactic category in isolation, i.e. independently from the context.
A grammar defines a language, namely a particular subset of all the
possible strings on the alphabet. Intuitively, the language generated by
a grammar G = < T , N , S , P > is
the set of all possible strings in T
that can be obtained starting from the initial symbol
s, and then applying repeatedly the productions, until we obtain
a string of terminal symbols only. More formally:
-
Definition
-
A derivation is a sequence
alpha0 -> alpha1 ->
... -> alphan
such that:
-
each alphai is an element
of (N union T)*
-
for each i < n, alphai
is of the form beta A gamma, where
beta and gamma
are elements of (N union T)*,
and there is a production of the form A ::= delta
such that alphai+1 is
beta delta gamma.
We will represent such a derivation with the notation alpha0
->* alphan
-
Definition
-
The language generated by a grammar
G, L(G),
is the set of all strings in T which
can be obtained with a derivation starting from the initial symbol s.
Formally:
L(G) = { alpha in T* |
S ->* alpha}
Let us see some examples of languages and grammars
-
Example: The language of balanced parentheses
-
Terminals: the symbols ( and )
-
Non-terminals: the symbol A
-
Productions: A ::= ( ) | (A) | AA
(This is an abbreviated notation standing for the productions A
::= ( ) , A ::= (A) , and
A ::= AA .)
-
Examples of strings generated by this grammar are:
The corresponding derivations are:
-
A -> ( )
-
A -> AA -> ( )A -> ( )( )
-
A -> AA -> (A)A -> (A)( ) -> (( ))( )
-
Example: The language of arithmetic expressions
with plus and times operations
-
Terminals: the symbols + and *
and (the representation of) the natural numbers
-
Non-terminals: the symbols Exp and
Num
-
Initial symbol: Exp
-
Productions:
Exp ::= Num | Exp + Exp | Exp * Exp
Num ::= 0 | 1 | 2 | 3 | ...
(note that we need infinitely many productions for Num)
-
Examples of strings generated by this grammar are:
The corresponding derivations are:
-
Exp -> Num -> 2
-
Exp -> Exp + Exp -> Num + Exp -> Num + Num ->
2 + Num -> 2 + 3
-
Exp -> Exp + Exp -> Num + Exp -> Num + Exp * Exp
->* Num + Num * Num ->* 2 + 3
* 5
-
Exercise
-
In the example above, we have assumed an infinite set of productions for
Num. This is unrealistic. Give a grammar
which generates the natural numbers in their usual representation, i.e.
as sequences of digits 0,..., 9
starting with a digit different from 0
Derivation Tree
Derivation trees (also called "parse
trees" in Sethi's book) are a way to represent the generation of
strings in a grammar. They also give information about the structure
of the strings, i.e. the way they are organized in syntactical categories.
-
Definition
-
Given a grammar G = < T , N , S , P > ,
a derivation tree t
for G is a tree such that:
-
the root is labeled by S
-
the leaves are labeled by terminal symbols
-
each intermediate node is labeled by a non-terminal symbol, and,
if its label is A, then its children
are labeled by symbols s1 , s2
, ... , sn such that there exists a production
A ::= s1 s2 ... sn
in P
The labels of the leaves (fringe) represent the string generated
by t.
We will indicate it by string(t).
It is easy to see that a derivation tree represents a set of
derivations
(usually more than one) for the same string, and that for each derivation
there is a derivation tree for the same string. Hence L(G)
coincides with the set of strings generated by all possible derivation
trees for G. More formally, if
we denote by DT(G) the set of all derivation
trees for G, we have the following result:
-
Proposition
-
L(G) = { alpha in T* | alpha = string(t) for some t in DT(G) }
-
Example
-
Let us consider again the language of arithmetic expressions, with productions
-
Exp ::= Num | Exp + Exp | Exp * Exp
-
We have that a possible derivation tree for the string
2 + 3 * 5 is the following:
Exp
/|\
/ | \
/ | \
Exp + Exp
| /|\
| / | \
| / | \
Num Exp * Exp
| | |
2 Num Num
| |
3 5
-
This tree corresponds to several derivations for the same string, which
differ only for the choice of the non-terminal to expand at each derivation
step.
Ambiguity
The structure of an expression is usually essential to interpret its meaning.
The expression 2 + 3 * 5 for
example has two different values depending on its intended structure: If
we assume it to be
2 + (
3 * 5 ) (i.e.
* is applied to
3 and 5) then the result
is 17. If, on the other hand, we assume
it to be ( 2
+ 3 ) * 5,
then the result is 25. In order to
avoid this kind of ambiguity, it is essential that the grammar
generates only one possible structure for each string in the language.
Since the structure is represented by the derivation tree, we have the
following definition:
-
Definition
-
A grammar G is ambiguous
if there exist a string in L(G) which
can be derived by two (or more) different derivation trees.
-
Example
-
The grammar in the example above is ambiguous,
in fact the string 2 + 3 * 5
can be generated also by the following tree:
Exp
/|\
/ | \
/ | \
Exp * Num
/|\ |
/ | \ 5
/ | \
Exp + Exp
| |
Num Num
| |
2 3
This tree corresponds to the grouping (
2 + 3 )
* 5, while the tree in the example above corresponds to 2
+ ( 3 * 5
).
There are languages which are intrinsically ambiguous,
i.e. it is not possible to eliminate their ambiguities without changing
the language.
-
Definition
-
A language L is intrinsically ambiguous
if can be generated only by ambiguous grammars, i.e. for every grammar
G such that L=L(G),
we have that G is ambiguous.
Fortunately, languages which are interesting from the point of view of programming
usually are not intrinsically ambiguous, and therefore we can find non-ambiguous
grammars which generates them. When a (non-intrinsically ambiguous) language
L is presented by an ambiguous grammar
G, "to eliminate the ambiguities of
G" means to find another grammar
G',
which is non ambiguous, and which generates the same language
L.