CSE 428: Lecture 1

Formal Definition Programming Languages

Why formal definitions

Natural languages are full of ambiguities. An example is the phrase:

Every boy likes a girl

This sentence in fact can be interpreted in different ways: does each boy like a different girl or do they all like the same girl?

Another example is:

I like your rabbit

Again, several meanings are possible, depending on whether the rabbit is a pet, or a dish, or a fur...

In the human-to-human communication, ambiguities in the language are usually resolved by using additional knowledge (context, culture,...). When we deal with computers, however, we cannot pretend them to have such knowledge. Yet we want them to understand and interpret a program according to the programmer's intentions. Therefore it is important that programming languages be defined rigorously (i.e. formally) and leave no room to ambiguity. It is essential that all the people which deal with the language (implementers, tool-developers, users) attribute to each sentence the same meaning.

The various levels of definition of a language

There are various levels of definition:

Syntax

This term refers to the way a program is written, i.e. which sequences of symbols are considered "legal" programs. It is obvious that in order to give a rigorous meaning to sentences it is important first to establish what these sentences can be and how are they constructed.
Usually the syntax is described by a context-free grammar, which is a particularly simple kind of recursive definition.
The syntactical correctness of a program is checked statically, i.e. before the execution (the common way of saying is "at compile time"). The tool which does this check is called parser.

Static semantics

The static semantics specifies additional restrictions to the set of legal sentences. The word "semantics" here is misleading: the concept of static semantics is more similar to syntax rather than to semantics.
The typical kind of restrictions specified by static semantics are those which cannot be captured with a context-free grammar, like consistency between the declaration of a variable and the way it is used in a program (types). The rules of the static semantics are usually defined in a formalism similar to the one used for specifying the (dynamic) semantics. (This is the reason why the name "semantics" is used.)
As for the syntax, also the correctness w.r.t. static semantics (static correctness) is checked at compile-time. This check is called static analysis.
The reason for distinguishing two levels of static rules is convenience: it is usually more convenient to keep the syntax simple so to have a more efficient parsing, and to check the other static constraints in a separate phase.

Semantics (or dynamic semantics)

Defines the meaning of programs, i.e. what should be the result (depending on the data in input).
Various styles/formalisms have been proposed for describing the semantics:

Operational. The meaning is defined by describing the way an abstract machine interprets the program. There are two main kinds of description in this style:

Big steps (or natural semantics)
Small steps

Denotational. The meaning is defined by mapping each construct/operator in the elements/functions of a certain mathematical domain. This style is very elegant but too complicated for the majority of programmers.
Axiomatic. The meaning of each sentence is a logical assertion, relating the properties of the state before and after the execution of the sentence. This formalism is used especially in the development of systems for checking the dynamic correctness of programs, but it is not so popular anymore since those systems are usually too complicated to be really useful.

Context-free grammars

Definition: A context-free grammar is a tuple

Note that the tokens do not necessarily coincide with the symbols of the keyboard. For instance, the Pascal's keywords if, then and else are tokens (not sequences of tokens). Whenever not clear from the context, we will distinguish these tokens from the symbols of syntactical categories by underlyining them, or by using the boldface style for the first and normal style for the latter.

The term "context-free" refers to the fact that the productions are restricted to be of the form A ::= alpha. In general a grammar production would be of the form beta ::= alpha, where beta is a non-empty string containing at least a non-terminal symbol. Such kind of productions are called context-free because, intuitively, they can only define a syntactic category in isolation, i.e. independently from the context.

A grammar defines a language, namely a particular subset of all the possible strings on the alphabet. Intuitively, the language generated by a grammar G = < T , N , s , P > is the set of all possible strings in T that can be obtained starting from the initial symbol s, and then applying repeatedly the productions, until we obtain a string of terminal symbols only. More formally:

Definition: A derivation is a sequence
Definition: The language generated by a grammar G, L(G), is the set of all strings in T which can be obtained with a derivation starting from the initial symbol s. Formally:

Let us see some examples of languages and grammars

Example: The language of balanced parentheses: Examples of strings generated by this grammar are:
Example: The language of numerical expressions with plus and times operations: Examples of strings generated by this grammar are:
Exercise: In the example above, we have assumed an infinite set of productions for Nat. This is unrealistic. Give a grammar which generates the natural numbers in their usual representation, i.e. as sequences of digits 0,..., 9 starting with a digit different from 0