CSE 428: Lecture 2  

Formal Definition Programming Languages

Why formal definitions

Natural languages are full of ambiguities. An example is the phrase: This sentence in fact can be interpreted in different ways: does each boy like a different girl or do they all like the same girl?

Another example is:

Again, several meanings are possible, depending on whether  the rabbit is a pet, or a dish, or a fur...

In the human-to-human communication, ambiguities in the language are usually resolved by using additional knowledge (context, culture,...). When we deal with computers, however, we cannot pretend them to have such knowledge.  Yet we want them to understand and interpret a program according to the programmer's intentions. Therefore it is important that programming languages be defined rigorously (i.e. formally) and leave no room to ambiguity.  It is essential that all the people which deal with the language (implementers, tool-developers, users) attribute to each sentence the same meaning.

The various levels of definition of a language

There are various levels of definition:


Static semantics

Semantics (or dynamic semantics)


Context-free grammars

A context-free grammar is a tuple
G = < T , N ,  s , P >
Note that the tokens do not necessarily coincide with the symbols of the keyboard. For instance, the Pascal's keywords if, then and else are tokens (not sequences of tokens). Whenever not clear from the context, we will distinguish these tokens from the symbols of syntactical categories by underlyining them, or by using the boldface style for the first and normal style for the latter.

The term "context-free" refers to the fact that the productions are restricted to be of the form A ::= alpha. In general a grammar production would be of the form  beta ::= alpha, where  beta is a non-empty string containing at least a non-terminal symbol. Such kind of productions are called context-free because, intuitively, they can only define a syntactic category in isolation, i.e. independently from the context.

A grammar defines a language, namely a particular subset of all the possible strings on the alphabet. Intuitively, the language generated by a grammar G = < T , N ,  s , P > is the set of all possible strings in T that can be obtained starting from the initial symbol  s, and then applying repeatedly the productions, until we obtain a string of terminal symbols only. More formally:

A derivation is a sequence
alpha0 -> alpha1 -> ... -> alphan
such that: We will represent such a derivation with the notation  alpha0  ->*  alphan
The language generated by a grammar G, L(G), is the set of all strings in T which can be obtained with a derivation starting from the initial symbol s. Formally:
L(G) = { alpha in T* | s ->* alpha}
Let us see some examples of languages and grammars
Example: The language of balanced parentheses
Examples of strings generated by this grammar are:
The corresponding derivations are:
Example: The language of numerical expressions with plus and times operations
Examples of strings generated by this grammar are:
The corresponding derivations are:
In the example above, we have assumed an infinite set of productions for Nat. This is unrealistic. Give a grammar which generates the natural numbers in their usual representation, i.e. as sequences of digits 0,..., 9 starting with a digit different from 0