Fall 98, CSE 468: Lectures

Fall 98, CSE 468: Lecture 5 (Sep 4)

Regular Expressions and Regular Languages

Regular expressions

All the expressions constructed on

lambda
symbols of the alphabet
+ (binary)
concatenation (binary)
* (Kleene's star, unary)

We will use parentheses to represent the structure of an expression, and we will assume that * has precedence over concatenation, concatenation has precedence over +, and that concatenation and + are associative.

Language represented by a regular expression

The empty string lambda stands for {lambda}, an alphabet symbol a stands for {a}, + stands for union, and concatenation and * stand for the homonymous operations on languages.

Example. The set of strings on {a,b} with odd lenght can be represented by the regular expression (a + b)(aa + ab + ba + bb)^*.

Regular Languages

The class of Regular Languages is constituted by the languages represented by regular expressions. These are important in computer science, because they are particularly simple to recognize (as we will see later) and there are interesting parts of a programming language that can be expressed as regular languages. Typically, the tokens of a programming language are a regular language. The component of the system which recognizes them is called "scanner".

Examples

The identifiers in a programming language (strings of letters or digits starting with a letter). The corresponding expression is
(a + b + ... + z)(a + b + ... + z + 0 + 1 + ... + 9)^*
The representation of positive natural numbers (0 and strings of digits starting with a symbol different from 0). The corresponding expression is
0 + (1 + ... + 9)(0 + 1 + ... + 9)^*
The representation of positive (approximations of) real numbers (integer part followed by a dot followed by a sequence of digits). The corresponding expression is
(0 + (1 + ... + 9)(0 + 1 + ... + 9)^*).(0 + 1 + ... + 9)^*

Language recognition

Given the formal specification of a language, one of the main questions which arise is, how to tell whether or not a certain string belongs to the given language. And of course, we want to do it automatically.

This concept is fundamental in computer science: When we write a program, the first thing we want the computer to check is whether the program is syntactically correct. This is nothing else than the problem of recognizing whether our program (a string of symbols) belongs to the language specified by the syntax.

Abstract machine for language recognition

A device for recognizing strings in a given language, in general, will take in input one simbol of the string at the time, and will take a decision (reject/accept/go on) depending on

the current symbol, and
the symbols previously examined.

Finite Automata

In the above description of an abstract machine for language recognition, it will turn out that, if the language is regular, the amount of memory necessary for point 2 is bounded. I.e. it does not depend on the string to be recognized. (But depends, of course, on the language. Note that each machine corresponds to a paticular language.) The bounded memory is the characteristic of a class of machines called Finite Automata.

Example. Automata for recognizing strings on {a,b} with odd lenght. Note that at each step, the only information we need to remember, about the portion of the string already examined, is whether it is odd or even.