Fall 98, CSE 468:
Lecture 5 (Sep 4)
Regular Expressions and Regular Languages
Regular expressions
All the expressions constructed on
- lambda
- symbols of the alphabet
- + (binary)
- concatenation (binary)
- * (Kleene's star, unary)
We will use parentheses to represent the structure
of an expression, and we will assume that * has precedence
over concatenation, concatenation has precedence over +, and
that concatenation and + are associative.
Language represented by a regular expression
The empty string
lambda stands for {lambda}, an alphabet symbol a stands for {a},
+ stands for union, and concatenation and * stand for the homonymous
operations on languages.
Example. The set of strings on {a,b}
with odd lenght can be represented by the regular expression
(a + b)(aa + ab + ba + bb)*.
Regular Languages
The class of Regular Languages is constituted by the
languages represented by regular expressions.
These are important in computer science, because they are particularly
simple to recognize (as we will see later) and there are interesting
parts of a programming language that can be expressed as regular languages.
Typically, the tokens of a programming language are
a regular language. The component of the system which
recognizes them is called "scanner".
Examples
- The identifiers in a programming language
(strings of letters or digits starting with a letter).
The corresponding expression is
(a + b + ... + z)(a + b + ... + z + 0 + 1 + ... + 9)*
- The representation of positive natural numbers
(0 and strings of digits starting with a symbol different from 0).
The corresponding expression is
0 + (1 + ... + 9)(0 + 1 + ... + 9)*
- The representation of positive (approximations of) real numbers
(integer part followed by a dot followed by a sequence of digits).
The corresponding expression is
(0 + (1 + ... + 9)(0 + 1 + ... + 9)*).(0 + 1 + ... + 9)*
Language recognition
Given the formal specification of a language,
one of the main questions which
arise is, how to tell whether or not a certain string
belongs to the given language. And of course,
we want to do it automatically.
This concept is fundamental in computer science:
When we write a program, the first thing we want
the computer to check is whether the program is syntactically
correct. This is nothing else than the problem of recognizing
whether our program (a string of symbols) belongs to the
language specified by the syntax.
Abstract machine for language recognition
A device for recognizing strings in a given language,
in general, will take
in input one simbol of the string at the time, and will take a decision
(reject/accept/go on) depending on
- the current symbol, and
- the symbols previously examined.
Finite Automata
In the above description of an abstract machine for
language recognition,
it will turn out that, if the language is regular,
the amount of memory necessary for point 2 is
bounded. I.e. it does not depend on the string
to be recognized.
(But depends, of course, on the language. Note that
each machine corresponds to a paticular language.)
The bounded memory is the characteristic of a
class of machines called
Finite Automata.
Example. Automata for recognizing
strings on {a,b} with odd lenght.
Note that at each step, the only information we need to remember, about the
portion of the string already examined, is whether it is odd or even.