CSE 428: Lecture 2
Formal Definition Programming Languages
Why formal definitions
Natural languages are full of ambiguities. An example is the phrase:
This sentence in fact can be interpreted in different ways: does each boy
like a different girl or do they all like the same girl?
Another example is:
Again, several meanings are possible, depending on whether the rabbit
is a pet, or a dish, or a fur...
In the humantohuman communication, ambiguities in the language are
usually resolved by using additional knowledge (context, culture,...).
When we deal with computers, however, we cannot pretend them to have such
knowledge. Yet we want them to understand and interpret a program
according to the programmer's intentions. Therefore it is important that
programming languages be defined rigorously (i.e. formally) and leave no
room to ambiguity. It is essential that all the people which deal
with the language (implementers, tooldevelopers, users) attribute to each
sentence the same meaning.
The various levels of definition of a language
There are various levels of definition:
Syntax

This term refers to the way a program is written, i.e. which sequences
of symbols are considered "legal" programs. It is obvious that in order
to give a rigorous meaning to sentences it is important first to establish
what these sentences can be and how are they constructed.

Usually the syntax is described by a contextfree
grammar, which is a particularly simple kind of recursive definition.

The syntactical correctness of a program is
checked statically, i.e. before the execution (the common way of saying
is "at compile time"). The tool which does
this check is called parser.
Static semantics

The static semantics specifies additional restrictions to the set of legal
sentences. The word "semantics" here is misleading: the concept of static
semantics is more similar to syntax rather than to semantics.

The typical kind of restrictions specified by static semantics are those
which cannot be captured with a contextfree grammar, like consistency
between the declaration of a variable and the way it is used in a program
(types). The rules of the static semantics
are usually defined in a formalism similar to the one used for specifying
the (dynamic) semantics. (This is the reason why the name "semantics" is
used.)

As for the syntax, also the correctness w.r.t. static semantics (static
correctness) is checked at compiletime. This check is called static
analysis.

The reason for distinguishing two levels of static rules is convenience:
it is usually more convenient to keep the syntax simple so to have a more
efficient parsing, and to check the other static constraints in a separate
phase.
Semantics (or dynamic semantics)

Defines the meaning of programs, i.e. what should be the result (depending
on the data in input).

Various styles/formalisms have been proposed for describing the semantics:

Operational. The
meaning is defined by describing the way an abstract
machine interprets the program. There are two main kinds of description
in this style:

Big steps (or natural semantics)

Small steps
This formalism is the most popular, nowadays. It is simple enough to be
understood by the programmers and it is also the most appropriate as a
guide to implementation.

Denotational. The meaning is defined
by mapping each construct/operator in the elements/functions of a certain
mathematical domain. This style is very elegant but too complicated for
the majority of programmers.

Axiomatic. The meaning of each sentence
is a logical assertion, relating the properties of the state before and
after the execution of the sentence. This formalism is used especially
in the development of systems for checking the dynamic correctness of programs,
but it is not so popular anymore since those systems are usually too complicated
to be really useful.
Contextfree grammars

Definition

A contextfree grammar is a tuple
G = < T , N , s , P >
where:

T is a set of terminal
symbols (tokens, alphabet)

N is a set of nonterminal
symbols (syntactic categories)

s is an element of N
and it is called starting (or initial)
symbol

P is a set of productions,
i.e. rules of the form A ::= alpha, where
A is an element of N
and alpha is a string of terminal and
nonterminal symbols, i.e. an element of (N
union T)^{*}
Note that the tokens do not necessarily coincide with the symbols of the
keyboard. For instance, the Pascal's keywords if, then and
else are tokens (not sequences of tokens). Whenever not clear from
the context, we will distinguish these tokens from the symbols of syntactical
categories by underlyining them, or by using the boldface style for the first
and normal style for the latter.
The term "contextfree" refers to the fact that the productions are
restricted to be of the form A ::= alpha.
In general a grammar production would be of the form beta
::= alpha, where beta is
a nonempty string containing at least a nonterminal symbol. Such kind
of productions are called contextfree because, intuitively, they can only
define a syntactic category in isolation, i.e. independently from the context.
A grammar defines a language, namely a particular subset of all the
possible strings on the alphabet. Intuitively, the language generated by
a grammar G = < T , N , s , P > is
the set of all possible strings in T
that can be obtained starting from the initial symbol
s, and then applying repeatedly the productions, until we obtain
a string of terminal symbols only. More formally:

Definition

A derivation is a sequence
alpha_{0} > alpha_{1 }>
... > alpha_{n}
such that:

each alpha_{i} is an element
of (N union T)*

for each i < n, alpha_{i}
is of the form beta A gamma, where
beta and gamma
are elements of (N union T)*,
and there is a production of the form A ::= delta
such that alpha_{i+1 }is
beta delta gamma.
We will represent such a derivation with the notation alpha_{0}
>* alpha_{n}

Definition

The language generated by a grammar
G, L(G),
is the set of all strings in T which
can be obtained with a derivation starting from the initial symbol s.
Formally:
L(G) = { alpha in T^{*} 
s >* alpha}
Let us see some examples of languages and grammars

Example: The language of balanced parentheses

Terminals: the symbols ( and )

Nonterminals: the symbol A

Productions: A ::= ()  (A)  AA
(This is an abbreviated notation standing for the productions A
::= () , A ::= (A) , and
A ::= AA .)

Examples of strings generated by this grammar are:
The corresponding derivations are:

A > ()

A > AA > ()A > ()()

A > AA > (A)A > (A)() > (())()

Example: The language of numerical expressions
with plus and times operations

Terminals: the symbols + and *
and (the representation of) the natural numbers

Nonterminals: the symbols Exp and
Nat

Initial symbol: Exp

Productions:
Exp ::= Nat  Exp + Exp  Exp * Exp
Nat ::= 0  1  2  3  ...
(note that we need infinitely many productions for Nat)

Examples of strings generated by this grammar are:
The corresponding derivations are:

Exp > Nat > 2

Exp > Exp + Exp > Nat + Exp > Nat + Nat >
2 + Nat > 2 + 3

Exp > Exp + Exp > Nat + Exp > Nat + Exp * Exp
>* Nat + Nat * Nat >* 2 + 3
* 5

Exercise

In the example above, we have assumed an infinite set of productions for
Nat. This is unrealistic. Give a grammar
which generates the natural numbers in their usual representation, i.e.
as sequences of digits 0,..., 9
starting with a digit different from 0