Fall 2000, CSE 468: Lecture 6 (Sep 8)

Regular Languages

In previous lecture we have seen that the regular languages as all those languages that can be represented by regular expressions.

Regular languages are important in computer science, because they are particularly simple to define and to reason about (by using algebraic properties of regular expressions), are easy to recognize (as we will see later), and there are interesting parts of a programming language that can be expressed as regular languages. Typically, the tokens of a programming language are a regular language. The component of the system which recognizes them is called "scanner".

Regular expressions are also used in search engines and are allowed in many systems (like for instance Unix) as parameters of commands.

Examples

The identifiers in a programming language (strings of letters or digits starting with a letter). The corresponding expression is
(a + b + ... + z)(a + b + ... + z + 0 + 1 + ... + 9)^*
The representation of positive natural numbers (0 and strings of digits starting with a symbol different from 0). The corresponding expression is
0 + (1 + ... + 9)(0 + 1 + ... + 9)^*
The representation of positive (approximations of) real numbers (integer part followed by a dot followed by a sequence of digits). The corresponding expression is
(0 + (1 + ... + 9)(0 + 1 + ... + 9)^*).(0 + 1 + ... + 9)^*

Language recognition

Given the formal specification of a language, one of the main questions which arise is, how to tell whether or not a certain string belongs to the given language. And of course, we want to do it automatically.

This concept is fundamental in computer science: When we write a program, the first thing we want the computer to check is whether the program is syntactically correct. This is nothing else than the problem of recognizing whether our program (a string of symbols) belongs to the language specified by the syntax.

Abstract machine for language recognition

A device for recognizing strings in a given language, in general, will take in input one simbol of the string at the time, and will take a decision (reject/accept/go on) depending on

the current symbol, and
the symbols previously examined.

Finite Automata

In the above description of an abstract machine for language recognition, it will turn out that, if the language is regular, the amount of memory necessary for point 2 is bounded. I.e. the size of the memory does not depend on the string to be recognized. (But depends, of course, on the language. Note that each machine corresponds to a paticular language.) The bounded memory is the characteristic of a class of machines called Finite Automata.

Example. Consider an automaton for recognizing the strings on {a,b} with even length. At each step, the only information we need to remember, about the portion of the string already examined, is whether it is odd or even.