Formal Definition Programming Languages

Why formal definitions

Natural languages are full of ambiguities. An example is the phrase: 

      Every boy likes a girl

This sentence in fact can be interpreted in different ways: does each boy
like a different girl or do they all like the same girl? 

Another example is: 

      I like your rabbit

Again, several meanings are possible, depending on whether  the rabbit is a
pet, or a dish, or a fur... 

In the human-to-human communication, ambiguities in the language are
usually resolved by using additional knowledge (context, culture,...). When
we deal with computers, however, we cannot pretend them to have such
knowledge.  Yet we want them to understand and interpret a program
according to the programmer's intentions. Therefore it is important that
programming languages be defined rigorously (i.e. formally) and leave no
room to ambiguity.  It is essential that all the people which deal with the
language (implementers, tool-developers, users) attribute to each sentence
the same meaning. 

The various levels of definition of a language

There are various levels of definition: 

Syntax

      This term refers to the way a program is written, i.e. which
      sequences of symbols are considered "legal" programs. It is obvious
      that in order to give a rigorous meaning to sentences it is important
      first to establish what these sentences can be and how are they
      constructed. 
      Usually the syntax is described by a context-free grammar, which is
      a particularly simple kind of recursive definition. 
      The syntactical correctness of a program is checked statically, i.e.
      before the execution (the common way of saying is "at compile
      time"). The tool which does this check is called parser. 

Static semantics

      The static semantics specifies additional restrictions to the set of
      legal sentences. The word "semantics" here is misleading: the
      concept of static semantics is more similar to syntax rather than to
      semantics. 
      The typical kind of restrictions specified by static semantics are
      those which cannot be captured with a context-free grammar, like
      consistency between the declaration of a variable and the way it is
      used in a program (types). The rules of the static semantics are
      usually defined in a formalism similar to the one used for specifying
      the (dynamic) semantics. (This is the reason why the name
      "semantics" is used.) 
      As for the syntax, also the correctness w.r.t. static semantics (static
      correctness) is checked at compile-time. This check is called static
      analysis. 
      The reason for distinguishing two levels of static rules is
      convenience: it is usually more convenient to keep the syntax simple
      so to have a more efficient parsing, and to check the other static
      constraints in a separate phase. 

Semantics (or dynamic semantics)

      Defines the meaning of programs, i.e. what should be the result
      (depending on the data in input). 
      Various styles/formalisms have been proposed for describing the
      semantics: 
            Operational. The meaning is defined by describing the way
            an abstract machine interprets the program. There are two
            main kinds of description in this style: 
                  Big steps (or natural semantics) 
                  Small steps 
            This formalism is the most popular, nowadays. It is simple
            enough to be understood by the programmers and it is also
            the most appropriate as a guide to implementation. 
            Denotational. The meaning is defined by mapping each
            construct/operator in the elements/functions of a certain
            mathematical domain. This style is very elegant but too
            complicated for the majority of programmers. 
            Axiomatic. The meaning of each sentence is a logical
            assertion, relating the properties of the state before and after
            the execution of the sentence. This formalism is used
            especially in the development of systems for checking the
            dynamic correctness of programs, but it is not so popular
            anymore since those systems are usually too complicated to
            be really useful. 

  

Context-free grammars

Definition 
      A context-free grammar is a tuple 

            G = < T , N ,  s , P >

      where: 
            T is a set of terminal symbols (tokens, alphabet) 
            N is a set of non-terminal symbols (syntactic categories) 
            s is an element of N and it is called starting (or initial)
            symbol 
            P is a set of productions, i.e. rules of the form A ::= alpha,
            where A is an element of N and alpha is a string of terminal
            and non-terminal symbols, i.e. an element of  (N union T)* 

Note that the tokens do not necessarily coincide with the symbols of the
keyboard. For instance, the Pascal's keywords if, then and else are tokens
(not sequences of tokens). Whenever not clear from the context, we will
distinguish these tokens from the symbols of syntactical categories by
underlyining them, or by using the boldface style for the first and normal
style for the latter. 

The term "context-free" refers to the fact that the productions are restricted
to be of the form A ::= alpha. In general a grammar production would be
of the form  beta ::= alpha, where  beta is a non-empty string containing
at least a non-terminal symbol. Such kind of productions are called
context-free because, intuitively, they can only define a syntactic category
in isolation, i.e. independently from the context. 

A grammar defines a language, namely a particular subset of all the
possible strings on the alphabet. Intuitively, the language generated by a
grammar G = < T , N ,  s , P > is the set of all possible strings in T that can
be obtained starting from the initial symbol  s, and then applying repeatedly
the productions, until we obtain a string of terminal symbols only. More
formally: 

Definition 
      A derivation is a sequence 

            alpha0 -> alpha1 -> ... -> alphan

      such that: 
            each alphai is an element of  (N union T)* 
            for each i < n, alphai is of the form beta A gamma, where
            beta and gamma are elements of  (N union T)*, and there is
            a production of the form A ::= delta such that  alphai+1 is 
            beta delta gamma. 
      We will represent such a derivation with the notation  alpha0  ->* 
      alphan 
        
Definition 
      The language generated by a grammar G, L(G), is the set of all
      strings in T which can be obtained with a derivation starting from
      the initial symbol s. Formally: 

            L(G) = { alpha in T* | s ->* alpha}

Let us see some examples of languages and grammars 

Example: The language of balanced parentheses 
      Terminals: the symbols ( and ) 
      Non-terminals: the symbol A 
      Productions: A ::= () | (A) | AA 
      (This is an abbreviated notation standing for the productions A ::=
      () , A ::= (A) , and A ::= AA .)
      Examples of strings generated by this grammar are: 
            () 
            ()() 
            (())() 
      The corresponding derivations are: 
            A -> () 
            A -> AA -> ()A -> ()() 
            A -> AA -> (A)A -> (A)() -> (())() 
Example: The language of numerical expressions with plus and
times operations 
      Terminals: the symbols + and * and (the representation of) the
      natural numbers 
      Non-terminals: the symbols Exp and Nat 
      Initial symbol: Exp 
      Productions: 
      Exp ::= Nat | Exp + Exp | Exp * Exp 
      Nat ::= 0 | 1 | 2 | 3 | ... 
      (note that we need infinitely many productions for Nat)
      Examples of strings generated by this grammar are: 
            2 
            2 + 3 
            2 + 3 * 5 
      The corresponding derivations are: 
      Exp -> Nat -> 2 

      Exp -> Exp + Exp -> Nat + Exp -> Nat + Nat -> 2 + Nat -> 2 + 3 
     
      Exp -> Exp + Exp -> Nat + Exp -> Nat + Exp * Exp 
          ->* Nat + Nat * Nat ->* 2 + 3 * 5 

Exercise 
      In the example above, we have assumed an infinite set of
      productions for Nat. This is unrealistic. Give a grammar which
      generates the natural numbers in their usual representation, i.e. as
      sequences of digits 0,..., 9 starting with a digit different from 0