Yann Ponty, CNRS researcher@LIX, Ecole Polytechnique

2009-2010-CASM-M2 BIBS

J'interviens au sein du Master Bioinformatique et Biostatistiques de l'Université Paris-Sud 11 (BIBS), dans le cadre de la matière Combinatoire, Algorithmes, Séquences et Modélisation (CASM). Les cours des 1er et 8 Décembre 2009 ont consisté en une présentation assez intense des approches par programmation dynamique développées autours du repliement de l'ARN, chaque séance étant suivie d'un mini-TP.

Vous trouverez sur cette page :

Les modalités de contrôle continu
Les articles/projets proposés pour l'examen de contrôle continu
Les slides et énoncés de TP

Modalités de contrôle continu

Le contrôle continu de CASM comporte :

Examen écrit le 19 Janvier à 14h, bat 470 s1, portant sur l'ensemble du programme, et comptant pour 2/3 de la note finale.
Voici le sujet [pdf] et la correction de l'exercice 3 [pdf]
Examen d'article ou projet comptant pour le 1/3 restant de la note finale.
Déroulement : Il est possible de choisir soit un projet, soit une lecture d'article. Dans les deux cas, un court rapport (+/- 5 pages) devra être remis une semaine avant une soutenance collective (15 min. chacun), prévue pour la semaine du 15 Février (Salle et date à préciser).

Projets

Les projets pourront être réalisés en monomes ou binomes. Chaque projet ne pourra être choisi qu'une seule fois, aussi veuillez vous manifester le plus tôt possible si vous souhaitez traiter l'un des projets.

[Dispo] - Le premier sujet de projet porte sur la recherche de petits ARN de structure connue dans un génome. L'originalité de ce projet réside dans l'incorporation à la stratégie (programmation dynamique simple) de recherche d'un critère d'isostérie. Lire le sujet [PDF]

[Pris-Chotard-Hieu] - Le deuxième sujet porte sur l'implémentation d'un algorithme de repliement avec pseudo-noeuds simples, l'algorithme d'Akutsu. Lire le sujet [PDF]

Liste des articles proposés

La lecture d'article sera effectuée et soutenue en monome. Chaque article ne pourra être choisi que par une personne au plus. Veuillez me contacter dès que vous avez choisi un article, que je puisse le retirer de la liste des articles disponibles.

[Pris-Weiman] D. Aldous
Probability Distributions on Cladograms
Random Discrete Structures, Vol. Math. Appl. 76 ,pp. 1--18, 1996 [More...] [Less...]

Article fondateur sur l'analyse des arbres phylogénétiques aléatoires.

Abstract By analogy with the theory surrounding the Ewens sampling formula in neutral population genetics, we ask whether there exists a natural one parameter family of probability distributions on cladograms ("evolutionary trees") which plays a central role in neutral evolutionary theory. Unfortunately the answer seems to be "no"-- see Conjecture 2. But we can embed the two most popular models into an interesting family which we call "beta-splitting" models. We briefly describe some mathematical results about this family, which exhibits qualitatively different behavior for different ranges of the parameter Beta. [Less...]
[Dispo] R. Backofen, D. Tsur, S. Zakov et M. Ziv-Ukelson
Sparse RNA Folding: Time and Space Efficient Algorithms
Proc. 20th Symp. Combinatorial Pattern Matching, LNCS, vol.5577, pp. 249-262, 2009 [More...] [Less...]

Optimisation récente et spectaculaire de la complexité empirique du repliement par minimisation de l'énergie libre.

Abstract The classical algorithm for RNA single strand folding requires O(nZ) time and O(n2 ) space, where n denotes the length of the input sequence and Z is a sparsity parameter that satisfies n <= Z <= n^2 . We show how to reduce the space complexity of this algorithm. The space reduction is based on the observation that some solutions for subproblems are not examined after a certain stage of the algorithm, and may be discarded from memory. This yields an O(nZ) time and O(Z) space algorithm, that outputs both the cardinality of the optimal folding as well as a corresponding secondary structure. The space-efficient approach also extends to the related RNA simultaneous alignment with folding problem, and can be applied to reduce the space complexity of the fastest algorithm for this problem from O(n^2 m^2 ) down to O(nm^2 + Z), where n and m denote the lengths of the input sequences to be aligned, and Z is a sparsity parameter that satisfies nm <= Z <= n^2 m^2 . In addition, we also show how to speed up the base-pairing maximization variant of RNA single strand folding. The speed up is achieved by combining two independent existing techniques, which restrict the number of expressions that need to be examined in bottleneck computations of these algorithms. This yields an O(LZ) time and O(Z) space algorithm, where L denotes the maximum cardinality of a folding of the input sequence. [Less...]
[Pris-Sedano] B. Behzadi et J.-M. Steyaert
The Minisatellite Transformation Problem Revisited: A Run Length Encoded Approach
Proceedings of WABI'04, pp 290-301, 2004 [More...] [Less...]

La lecture de cette article sera complétée par celle de sa version longue, parue récemment :
M. I. Abouelhoda, R. Giegerich, B. Behzadi et J.-M. Steyaert
Alignment of Minisatellite Maps Based on Run-Length Encoding Scheme
J. Bioinformatics and Computational Biology 7(2): 287-308 2009

Abstract In this paper we present a more efficient algorithm for comparison of minisatellites which has complexity O(n'^3+ m'^3 + mn'^2+ nm'^2 +mn) where n and m are the lengths of the maps and n' and m' are the sizes of run-length encoded maps. We show that this algorithm makes a significant improvement for the real biological data, dividing the computing time by a factor 30 on a significant set of data. [Less...]
[Pris-Soulat] A. Bergeron, J. Mixtacki et J. Stoye
On sorting by translocations
Journal of Computational Biology, vol. 13, num. 2, pp. 567--578, 2006 [More...] [Less...]

La lecture de cet article prolonge la presentation faite du tri par reversion.Une analyse des configurations faciles ou difficiles y est réalisée,qui simplifie considérablement les présentations des articles précédents,bases sur des notions hermétiques de forteresses, épingles, etc...Un algorithme intéressant en découle.

Abstract The study of genome rearrangements is an important tool in comparative genomics. This paper revisits the problem of sorting a multichromosomal genome by translocations, i.e., exchanges of chromosome ends. We give an elementary proof of the formula for computing the translocation distance in linear time, and we give a new algorithm for sorting by translocations, correcting an error in a previous algorithm by Hannenhalli. [Less...]
[Dispo] V. Boeva, J. Clément, M. Régnier, M. A Roytberg et V. J Makeev
Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules
Algorithms for Molecular Biology, vol. 2, num. 13, 2007 [More...] [Less...]

Approfondi et complète la présentation faite en cours par M. Regnier

Abstract
Background
cis-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors. The phenomenon that binding sites form clusters in CRMs is exploited in many algorithms to locate CRMs in a genome. This gives rise to the problem of calculating the statistical significance of the event that multiple sites, recognized by different factors, would be found simultaneously in a text of a fixed length. The main difficulty comes from overlapping occurrences of motifs. So far, no tools have been developed allowing the computation of p-values for simultaneous occurrences of different motifs which can overlap.
Results
We developed and implemented an algorithm computing the p-value that s different motifs occur respectively k1, ..., ks or more times, possibly overlapping, in a random text. Motifs can be represented with a majority of popular motif models, but in all cases, without indels. Zero or first order Markov chains can be adopted as a model for the random text. The computational tool was tested on the set of cis-regulatory modules involved in D. melanogaster early development, for which there exists an annotation of binding sites for transcription factors. Our test allowed us to correctly identify transcription factors cooperatively/competitively binding to DNA.
Method
The algorithm that precisely computes the probability of simultaneous motif occurrences is inspired by the Aho-Corasick automaton and employs a prefix tree together with a transition function. The algorithm runs with the O(n|A|(m|H| + K|D|^K) \prod k_i) time complexity, where n is the length of the text, |A| is the alphabet size, m is the maximal motif length, |H| is the total number of words in motifs, K is the order of Markov model, and k_i is the number of occurrences of the i-th motif.
Conclusion
The primary objective of the program is to assess the likelihood that a given DNA segment is CRM regulated with a known set of regulatory factors. In addition, the program can also be used to select the appropriate threshold for PWM scanning. Another application is assessing similarity of different motifs.
Availability
Project web page, stand-alone version and documentation can be found at http://bioinform.genetika.ru/AhoPro/ [Less...]
[Pris-Bourlet] Y. E. Ding, C. Y. U. Chan et C. E. Lawrence
RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble
RNA, Vol. 11, No. 8., pp. 1157-1166, 2005 [More...] [Less...]

Mise en pratique des méthodes d'échantillonnage statistique dans l'ensemble de Boltzmann pour une amélioration de la précision du repliement in silico.

Abstract Prediction of RNA secondary structure by free energy minimization has been the standard for over two decades. Here we describe a novel method that forsakes this paradigm for predictions based on Boltzmann-weighted structure ensemble. We introduce the notion of a centroid structure as a representative for a set of structures and describe a procedure for its identification. In comparison with the minimum free energy (MFE) structure using diverse types of structural RNAs, the centroid of the ensemble makes 30.0% fewer prediction errors as measured by the positive predictive value (PPV) with marginally improved sensitivity. The Boltzmann ensemble can be separated into a small number (3.2 on average) of clusters. Among the centroids of these clusters, the “best cluster centroid” as determined by comparison to the known structure simultaneously improves PPV by 46.5% and sensitivity by 21.7%. For 58% of the studied sequences for which the MFE structure is outside the cluster containing the best centroid, the improvements by the best centroid are 62.5% for PPV and 31.4% for sensitivity. These results suggest that the energy well containing the MFE structure under the current incomplete energy model is often different from the one for the unavailable complete model that presumably contains the unique native structure. Centroids are available on the Sfold server at http://sfold.wadsworth.org. [Less...]
[Dispo] R. Giegerich et S. Kurtz
From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction
Algorithmica, vol 19, num 3, pp. 331--353, 1997 [More...] [Less...]

Article de synthèse sur les construction d'arbres suffixes et leur application au pattern-matching.

Abstract We review the linear time suffix tree constructions by Weiner, McCreight, and Ukkonen.We use the terminology of the most recent algorithm, Ukkonen's online construction, to explainits historic predecessors. This reveals relationships much closer than one would expect, since thethree algorithms are based on rather different intuitive ideas. Moreover, it completely explainsthe differences between these algorithms in terms of simplicity, efficiency, and implementationcomplexity. [Less...]
[Dispo] R. Lorenz, C. Flamm et I. L. Hofacker
2D Projections of RNA folding Landscapes
GCB'09, vol. 157 of Lecture Notes in Informatics, pp. 11-20, Bonn, 2009 [More...] [Less...]

Heuristique pour le calcul (NP-complet) des barrières énergétiques, permettant la mise en oeuvre d'une approche évitant la simulation pour l'étude des paysages cinétiques.

Abstract The analysis of RNA folding landscapes yields insights into the kinetic foldingbehavior not available from classical structure prediction methods. This is especiallyimportant for multi-stable RNAs whose function is related to structural changes,as in the case of riboswitches. However, exact methods such as barrier tree analysisscale exponentially with sequence length. Here we present an algorithm that computesa projection of the energy landscape into two dimensions, namely the distancesto two reference structures. This yields an abstraction of the high-dimensional energylandscape that can be conveniently visualized, and can serve as the basis for estimatingenergy barriers and refolding pathways. With an asymptotic time complexity of O(n^7)the algorithm is computationally demanding. However, by exploiting the sparsity ofthe dynamic programming matrices and parallelization for multi-core processors, ourimplementation is practical for sequences of up to 400 nt, which includes most RNAsof biological interest. [Less...]
[Pris-Luo] I. Miklós, I. M. Meyer et B. Nagy
Moments of the Boltzmann distribution for RNA secondary structures
Bulletin of mathematical biology, vol. 67, no 5, pp. 1031-1047, 2005 [More...] [Less...]

Extraction de moments d'ordres supérieurs pour des paramètres additifs (Énergie, nombre de paires de bases...) dans la distribution de Boltzmann, fondée sur une vision combinatoire de l'ensemble des conformations.

Abstract We here present a dynamic programming algorithm which is capable of calculating arbitrary moments of the Boltzmann distribution for RNA secondary structures. We have implemented the algorithm in a program called RNA-VARIANCE and investigate the difference between the Boltzmann distribution of biological and random RNA sequences. We find that the minimum free energy structure of biological sequences has a higher probability in the Boltzmann distribution than random sequences. Moreover, we show that the free energies of biological sequences have a smaller variance than random sequences and that the minimum free energy of biological sequences is closer to the expected free energy of the rest of the structures than that of random sequences. These results suggest that biologically functional RNA sequences not only require a thermodynamically stable minimum free energy structure, but also an ensemble of structures whose free energies are close to the minimum free energy. [Less...]
[Dispo] A. O. Mooers et S. B. Heard
Inferring Evolutionary Process from Phylogenetic Tree Shape
The Quarterly Review of Biology, Vol. 72, No. 1, pp. 31-54, 1997 [More...] [Less...]

Article de revue sur les interprétations biologiques d'observations statistiques sur les arbres phylogénétique, ou Comment les mathématiques des arbres aléatoire aident à reconstituer l'histoire du vivant ?

Abstract Inference about macroevolutionary processes traditionally depended solely on the fossil record, but such inferences can be strengthened by also considering the shapes of the phylogenetic trees that link extant taxa. The realization that phylogenies reflect macroevolutionary processes has led to growing literature of theoretical and comparative studies of tree shape. Two aspects of tree shape are particularly important: tree balance and the distribution of branch lengths. We examine and evaluate recent developments in and connections between these two aspects, and suggest directions for future research. Studies of tree shape promise useful and powerful tests of macroevolutionary hypotheses. With appropriate further research, tree shapes may help us detect mass extinctions and adaptive radiations, measure continuous variation in speciation and extinction rates, and associate changes in these rates with ecological or biogeographical causes. The usefulness of tree shape extends well beyond the study of macroevolution. We discuss applications to other areas of biology, including coevolution, phylogenetic inference, population biology, and developmental biology. [Less...]
[Dispo] A. Xayaphoummine, T. Bucher, F. Thalmann et H. Isambert
Prediction and statistics of pseudoknots in RNA structures using exactly clustered stochastic simulations
PNAS, Vol. 100, No. 26., pp. 15310-15315, 2003 [More...] [Less...]

Approche alternative - par simulation - du repliement de l'ARN, accélérée par une jolie astuce algorithmique. Permet la modélisation "hors équilibre", i.e. les phénomènes cinétiques (Repliement co-transcriptionnel, ), ainsi que celle des pseudo-noeuds, au prix d'une perte d'exhaustivité.

Abstract Ab initio RNA secondary structure predictions have long dismissed helices interior to loops, so-called pseudoknots, despite their structuralimportance. Here we report that many pseudoknots can be predicted through long-time-scale RNA-folding simulations, which follow the stochastic closing and opening of individual RNA helices.The numerical efficacy of these stochastic simulations relies on an O(n^2) clustering algorithm that computes time averages over a continuously updated set of n reference structures. Applying thisexact stochastic clustering approach, we typically obtain a 5- to 100-fold simulation speed-up for RNA sequences up to 400 bases, while the effective acceleration can be as high as 105-fold for short,multistable molecules (<150 bases). We performed extensive folding statistics on random and natural RNA sequences and found that pseudoknots are distributed unevenly among RNA structures andaccount for up to 30% of base pairs in G+C-rich RNA sequences (online RNA-folding kinetics server including pseudoknots: http://kinefold.u-strasbg.fr). [Less...]

Slides et énoncés de TP

Séance 1 : Structure(s) de l'ARN, programmation dynamique, repliement dans le modèle d'énergie de Nussinov/Jacobson.
[Slides cours] et [Énoncé TP+Rappels Python]
Séance 2 : Repliement dans le modèle de Turner, structures sous-optimales, ensemble de Boltzmann et échantillonnage statistique.
[Slides cours] et [Énoncé TP]