Next: Multi-tape S-attributed grammars Up: A multi-tape S-attributed grammar's Previous: A multi-tape S-attributed grammar's

S-attributed context-free grammars

We recall the basic definitions of context-free grammars and of their natural extension to S-attributed grammars. These definitions are adapted to the fact that we use this formalism to model biological properties of sequences.

Definition 1 (Context-free grammar)
A context-free grammar

consists of a finite set of terminals

, a finite set of nonterminals

such that $V_T \cap V_N = \emptyset$ , a finite set of productions (rewriting rules) P and a start symbol $S \in V_N$ . Let $V = V_T \cup V_N$ denote the vocabulary for the grammar. Each production in P has the form $A \rightarrow \alpha$ , where $A \in V_N$ and $\alpha \in V ^{\star}$ . A is the left-hand side of the production and $\alpha$ its right-hand side.

The transitive closure of the derivation relation $\rightarrow$ is denoted $\stackrel {\star}{\rightarrow}$ . A derivation tree is the planar representation of a sequence of derivations $A \rightarrow \alpha$ such that $A \in V_N$ and $\alpha \in V ^{\star}$ . The set of strings in $V_T^{\star}$ derived from S is called the language generated by G and it is denoted by

. The empty string is denoted by $\varepsilon$ .
In order to keep the number of derivation trees finite for a given word $\omega \in L(G)$ , we assume that the grammar is non-circular, which means that no nonterminal

may verify $A \stackrel {+}{\rightarrow} A$ . We also assume that the grammar is epsilon-free (i.e. it has no rules of the form $A \rightarrow \varepsilon$ ). An ambiguous grammar is a grammar for which there exists a string of symbols having at least two different derivation trees. For example, the grammar whose derivation trees describe t-RNA secondary structures has to be ambiguous because a given RNA has potentially several different secondary structures. Parsing is the process of finding a derivation tree for a string in

, which is called the parse tree of the sequence.
S-attributed context-free grammars, which are a proper subset of attributed-grammars introduced by Knuth in his seminal paper [2], are an extension of context-free grammar allowing the assignment of a value (called attribute) to every vertex of a derivation tree.

Definition 2 (S-attributed grammar)
An S-attributed grammar, denoted by

$,{\cal A},S_{\cal A},F_P)$ , is an extension of the context-free grammar

; an attribute $x \in {\cal A}$ is attached to each symbol $X \in V$ , and a string of attributes $\lambda \in {\cal A}^{\star}$ to each string $\alpha \in V ^{\star}$ . $S_{\cal A}$ is a function from

to ${\cal A}$ assigning attributes to terminals.

is a set of functions from ${\cal A}^{\star}$ to ${\cal A}$ . A function $f_{A \rightarrow \alpha}$ is in

for every production $A \rightarrow \alpha$ in P.

The attribute of a string $\alpha \in V ^{\star}$ , denoted by $\lambda _{\alpha}$ , is the concatenation of the attributes of the symbols in $\alpha$ . When a function $f_{A \rightarrow \alpha}$ is applied to the attribute $\lambda _{\alpha}$ derived from A it returns the attribute $x = f_{A \rightarrow \alpha}$ of ${\cal A}$ . Thus, the functions of

determine the bottom-up computation of the attributes of nonterminals A in derivations $\alpha A \beta \stackrel {\star}{\rightarrow} u$ , where u must belong to

so that the attribute of A be computable.

Example 1 We consider the following ambiguous context-free grammar for arithmetical expressions.

$\begin{displaymath} \begin{array}{c} S \rightarrow S+S \\ S \rightarrow S \time... ...vert \; 6 \; \vert \; 7 \; \vert\; 8 \; \vert \; 9. \end{array}\end{displaymath}$

We extend it into an S-attributed grammar such that each attribute stores the value of the arithmetical expression corresponding to the derivation subtree rooted at this node.

$\displaystyle {\cal A} = \mathbb{N}$

$\begin{displaymath} \begin{array}{llll} S_{\cal A}(0)=0 \\ \vdots \\ S_{\cal A}(9)=9 \\ S_{\cal A}(+,\times)=0 \\ \end{array}\end{displaymath}$

$\begin{displaymath} F_P = \left\{ \begin{array}{lll} f_{S \rightarrow S+S}(x0z) ... ...\\ f_{S \rightarrow a \in V_T}(x) = x \\ \end{array}\right\} \end{displaymath}$

Thus, the sequence $8 \times 9+5$ can be parsed in different ways corresponding to different derivation trees, hence producing different values

$\begin{displaymath} \begin{array}{ccc} \par\setlength{\GapDepth}{1.5em}\setlengt... ... \end{bundle} } \end{bundle} } \par\end{bundle}\par\end{array}\end{displaymath}$

As previously noted, grammars used for biological sequence analysis are ambiguous because they aim at describing the space of all possible configurations. Attributes are designed to give an evaluation of the quality for each of them which gives a choice criterion. For example, if the attribute of a vertex is an energy or a probability, the criterion may be the selection of the derivation tree with the lowest energy or the highest probability at the root. (But attributes are not restricted to simple real values in general even if this paper deals mainly with this kind). The selection of the best possible value for an attribute is done by means of a function called optimization constraint.

Definition 3 (Optimization constraint)
Let G be an S-attributed grammar and let $a_1 \cdots a_n$ be a string of L(G). Let x be the attribute of a derivation $A \rightarrow \alpha _1 \rightarrow \cdots \rightarrow \alpha _k \rightarrow a_{i+1} \cdots a_j$ and let $x^{\prime}$ be the attribute of another derivation $A \rightarrow \alpha _1^{\prime} \rightarrow \cdots \rightarrow \alpha _k^{\prime} \rightarrow a_{i+1} \cdots a_j$ . The optimisation constraint ${\cal C}_{\cal A}$ associated with the nonterminal A takes as an input the attributes x and $x^{\prime}$ and returns another attribute $x^{\prime \prime}$ . We note $x^{\prime \prime} = {\cal C}_{\cal A}(x,x^{\prime})$ .

In practice, ${\cal C}_{\cal A}$ will be a comparison function between

and $x^{\prime}$ which returns the optimal attribute (the lowest energy, if attributes are free energy). Finally, we denote by $\lambda _{\omega}$ the optimal attribute of a string $\omega \in L(G)$ .

Next: Multi-tape S-attributed grammars Up: A multi-tape S-attributed grammar's Previous: A multi-tape S-attributed grammar's

Jerome Waldispuhl 2002-12-18