Yann Ponty, Chercheur CNRS@LIX, Ecole Polytechnique

2010-2011-CASM-M2 BIBS

J'interviens au sein du Master Bioinformatique et Biostatistiques de l'Université Paris-Sud 11 (BIBS), dans le cadre de la matière Combinatoire, Algorithmes, Séquences et Modélisation (CASM). Les cours des 6, 13 Décembre et 3 Janvier 2010 auront lieu dans la salle 003 du bâtiment 308, et consisteront en présentation de différentes approches par programmation dynamique développées autour du repliement de l'ARN. Chaque séance sera suivie d'un mini-TP, où on s'attachera à mettre en pratique les principes évoqués.

Vous trouverez sur cette page :

Les modalités de contrôle continu
L'examen de fin d'année partiellement corrigé
Les articles proposés pour l'examen de contrôle continu
Les slides et énoncés de TP

Modalités de contrôle continu

Le contrôle continu de CASM comporte :

Examen écrit, portant sur l'ensemble du programme, et comptant pour 2/3 de la note finale.
Voici les annales 2009-2010 [pdf] (+ correction partielle [pdf])
Examen d'article ou projet comptant pour le 1/3 restant de la note finale.
Déroulement : Il sera possible de choisir un projet ou une lecture d'article. Dans les deux cas, un court rapport (+/- 5 pages) devra être remis au moins une semaine avant une soutenance collective (15 min. chacun).

Examen écrit

Vous pouvez télécharger ici une version partiellement corrigée de l'examen écrit.

Liste des articles proposés

La lecture d'article sera effectuée et soutenue en monome. Chaque article ne pourra être choisi que par une personne au plus. Veuillez me contacter dès que vous avez choisi un article, que je puisse le retirer de la liste des articles disponibles.

[Pirs-Chaumier] Nagarajan N, Cook C, Di Bonaventura M, Ge H, Richards A, Bishop-Lilly KA, DeSalle R, Read TD and Pop M
Finishing genomes with limited resources: lessons from an ensemble of microbial genomes
BMC Genomics, 11:242, 2010 [Article] [More...] [Less...]

Abstract While new sequencing technologies have ushered in an era where microbial genomes can be easily sequenced, thegoal of routinely producing high-quality draft and finished genomes in a cost-effective fashion has still remainedelusive. Due to shorter read lengths and limitations in library construction protocols, shotgun sequencing andassembly based on these technologies often results in fragmented assemblies. Correspondingly, while draft assembliescan be obtained in days, finishing can take many months and hence the time and effort can only be justified for highprioritygenomes and in large sequencing centers. In this work, we revisit this issue in light of our own experience inproducing finished and nearly-finished genomes for a range of microbial species in a small-lab setting. These genomeswere finished with surprisingly little investments in terms of time, computational effort and lab work, suggesting thatthe increased access to sequencing might also eventually lead to a greater proportion of finished genomes from smalllabs and genomics cores. [Less...]
[Pris-Bouaziz] Otto TD, Sanders M, Berriman M and Newbold C
Iterative Correction of Reference Nucleotides (iCORN) using second generation sequencing technology
Bioinformatics, vol. 26 no. 14 , pp. 1704–1707, 2010 [Article] [More...] [Less...]

Abstract Motivation: The accuracy of reference genomes is important fordownstream analysis but a low error rate requires expensive manualinterrogation of the sequence. Here, we describe a novel algorithm(Iterative Correction of Reference Nucleotides) that iteratively alignsdeep coverage of short sequencing reads to correct errors inreference genome sequences and evaluate their accuracy.Results: Using Plasmodium falciparum (81% A + T content) as anextreme example, we show that the algorithm is highly accurateand corrects over 2000 errors in the reference sequence. Wegive examples of its application to numerous other eukaryotic andprokaryotic genomes and suggest additional applications.Availability: The software is available at http://icorn.sourceforge.net [Less...]
[Pris-Maurel] Chevreux B, Pfisterer T, Drescher B, Driesel AJ, Müller WEG, Wetter T and Suhai S
Using the miraEST Assembler for Reliable and Automated mRNA Transcript Assembly and SNP Detection in Sequenced ESTs
Genome Res. 14: 1147-1159 2004 [Article] [More...] [Less...]

Abstract We present an EST sequence assembler that specializes in reconstruction of pristine mRNA transcripts, while at thesame time detecting and classifying single nucleotide polymorphisms (SNPs) occuring in different variations thereof.The assembler uses iterative multipass strategies centered on high-confidence regions within sequences and has afallback strategy for using low-confidence regions when needed. It features special functions to assemble highnumbers of highly similar sequences without prior masking, an automatic editor that edits and analyzes alignmentsby inspecting the underlying traces, and detection and classification of sequence properties like SNPs with a highspecificity and a sensitivity down to one mutation per sequence. In addition, it includes possibilities to useincorrectly preprocessed sequences, routines to make use of additional sequencing information such as base-errorprobabilities, template insert sizes, strain information, etc., and functions to detect and resolve possible misassemblies.The assembler is routinely used for such various tasks as mutation detection in different cell types, similarity analysisof transcripts between organisms, and pristine assembly of sequences from various sources for oligo design in clinicalmicroarray experiments. [Less...]
[Pris-Blouin] Richter DC, Ott F, Auch AF, Schmid R, Huson DH
MetaSim—A Sequencing Simulator for Genomics and Metagenomics
PLos One, 3(10): e3373, 2008 [Article] [More...] [Less...]

Abstract Background: The new research field of metagenomics is providing exciting insights into various, previously unclassifiedecological systems. Next-generation sequencing technologies are producing a rapid increase of environmental data inpublic databases. There is great need for specialized software solutions and statistical methods for dealing with complexmetagenome data sets.Methodology/Principal Findings: To facilitate the development and improvement of metagenomic tools and the planningof metagenomic projects, we introduce a sequencing simulator called MetaSim. Our software can be used to generatecollections of synthetic reads that reflect the diverse taxonomical composition of typical metagenome data sets. Based on adatabase of given genomes, the program allows the user to design a metagenome by specifying the number of genomespresent at different levels of the NCBI taxonomy, and then to collect reads from the metagenome using a simulation of anumber of different sequencing technologies. A population sampler optionally produces evolved sequences based onsource genomes and a given evolutionary tree.Conclusions/Significance: MetaSim allows the user to simulate individual read datasets that can be used as standardizedtest scenarios for planning sequencing projects or for benchmarking metagenomic software. [Less...]
[Pris-Kowal] Kulakovskiy IV, Boeva VA, Favorov AV and Makeev VJ
Deep and wide digging for binding motifs in ChIP-Seq data
Bioinformatics, vol. 26 no. 20, pp 2622–2623, 2010 [Article] [More...] [Less...]

Abstract Summary: ChIP-Seq data are a new challenge for motif discovery.Such a data typically consists of thousands of DNA segmentswith base-specific coverage values. We present a new version ofour DNA motif discovery software ChIPMunk adapted for ChIPSeqdata. ChIPMunk is an iterative algorithm that combines greedyoptimization with bootstrapping and uses coverage profiles asmotif positional preferences. ChIPMunk does not require truncationof long DNA segments and it is practical for processing up totens of thousands of data sequences. Comparison with traditional(MEME) or ChIP-Seq-oriented (HMS) motif discovery tools showsthat ChIPMunk identifies the correct motifs with the same or betterquality but works dramatically faster.Availability and implementation: ChIPMunk is freely availablewithin the ru_genetika Java package: http://line.imb.ac.ru/ChIPMunk. Web-based version is also available. [Less...]
[Pris-Cattan] Touzet H and Varré JS
Efficient and accurate P-value computation for Position Weight Matrices
Algorithms for Molecular Biology, 2:15 2007 [Article] [More...] [Less...]

Abstract Background: Position Weight Matrices (PWMs) are probabilistic representations of signals insequences. They are widely used to model approximate patterns in DNA or in protein sequences.The usage of PWMs needs as a prerequisite to knowing the statistical significance of a wordaccording to its score. This is done by defining the P-value of a score, which is the probability thatthe background model can achieve a score larger than or equal to the observed value. This givesrise to the following problem: Given a P-value, find the corresponding score threshold. Existingmethods rely on dynamic programming or probability generating functions. For many examples ofPWMs, they fail to give accurate results in a reasonable amount of time.Results: The contribution of this paper is two fold. First, we study the theoretical complexity ofthe problem, and we prove that it is NP-hard. Then, we describe a novel algorithm that solves theP-value problem efficiently. The main idea is to use a series of discretized score distributions thatimproves the final result step by step until some convergence criterion is met. Moreover, thealgorithm is capable of calculating the exact P-value without any error, even for matrices with nonintegercoefficient values. The same approach is also used to devise an accurate algorithm for thereverse problem: finding the P-value for a given score. Both methods are implemented in a softwarecalled TFM-PVALUE, that is freely available.Conclusion: We have tested TFM-PVALUE on a large set of PWMs representing transcriptionfactor binding sites. Experimental results show that it achieves better performance in terms ofcomputational time and precision than existing tools. [Less...]
[Pris-Galopin] Möhl M, Salari R, Will S, Backofen R, Sahinalp SC
Sparsification of RNA structure prediction including pseudoknots
Algorithms Mol Biol.;5(1):39, 2010. [Article] [More...] [Less...]

Transposition à l'algorithmique du repliement avec pseudo-noeud d'une optimisation générique spectaculaire, la sparsification.

Abstract
BACKGROUND: Although many RNA molecules contain pseudoknots, computational prediction of pseudoknotted RNA structure is still in its infancy due to high running time and space consumption implied by the dynamic programming formulations of the problem.
RESULTS: In this paper, we introduce sparsification to significantly speedup the dynamic programming approaches for pseudoknotted RNA structure prediction, which also lower the space requirements. Although sparsification has been applied to a number of RNA-related structure prediction problems in the past few years, we provide the first application of sparsification to pseudoknotted RNA structure prediction specifically and to handling gapped fragments more generally - which has a much more complex recursive structure than other problems to which sparsification has been applied. We analyse how to sparsify four pseudoknot structure prediction algorithms, among those the most general method available (the Rivas-Eddy algorithm) and the fastest one (Reeder-Giegerich algorithm). In all algorithms the number of "candidate" substructures to be considered is reduced.
CONCLUSIONS: Our experimental results on the sparsified Reeder-Giegerich algorithm suggest a linear speedup over the unsparsified implementation. [Less...]
[Pris-Desdouits] Mathews, D.M., Sabina, J., Zuker, M., and Turner, D.H.
Expanded Sequence Dependence of Thermodynamic Parameters Improves Prediction of RNA Secondary Structure
Journal of Molecular Biology, 288(5), pp. 911-940, 1999 [Article] [More...] [Less...]

État des lieux de la performance de MFold et incorporation de contraintes molles préfigurant l'incorporation de données expérimentales pour une détermination structurale haut-débit.

Abstract An improved dynamic programming algorithm is reported for RNAsecondary structure prediction by free energy minimization. Thermodynamicparameters for the stabilities of secondary structure motifs arerevised to include expanded sequence dependence as revealed by recentexperiments. Additional algorithmic improvements include reducedsearch time and storage for multibranch loop free energies and improvedimposition of folding constraints. An extended database of 151,503 nt in955 structures? determined by comparative sequence analysis wasassembled to allow optimization of parameters not based on experimentsand to test the accuracy of the algorithm. On average, the predicted lowestfree energy structure contains 73% of known base-pairs whendomains of fewer than 700 nt are folded; this compares with 64% accuracyfor previous versions of the algorithm and parameters. For a givensequence, a set of 750 generated structures contains one structure that, onaverage, has 86% of known base-pairs. Experimental constraints, derivedfrom enzymatic and ¯avin mononucleotide cleavage, improve the accuracyof structure predictions. [Less...]
[Pris-Traore] Lyngsø, R.B. and Pedersen, C.N.S.
RNA Pseudoknot Prediction in Energy Based Models
Journal of Computational Biology, 7(3/4), pp. 409-428, 2000. [Article] [More...] [Less...]

Entre autres choses, preuve de la difficulté algorithmique du repliement des ARN avec pseudonoeuds généraux (sous un modèle d'énergie minimal)

Abstract RNA molecules are sequences of nucleotides that serve as more than mere intermediariesbetween DNA and proteins, e.g., as catalytic molecules. Computational prediction of RNAsecondary structure is among the few structure prediction problems that can be solved satisfactorilyin polynomial time. Most work has been done to predict structures that do notcontain pseudoknots. Allowing pseudoknots introduces modeling and computational problems.In this paper we consider the problem of predicting RNA secondary structures withpseudoknots based on free energy minimization. We give a brief comparison of energybasedmethods for predicting RNA secondary structures with pseudoknots. We then provethat the general problem of predicting RNA secondary structures containing pseudoknots isNP complete for a large class of reasonable models of pseudoknots. [Less...]
[Pris-Payen] Sankoff, D.
Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems
SIAM Journal on Applied Mathematics, 45(5), pp. 810-825, 1985 [Article] [More...] [Less...]

L'algorithme historique (toujours LA référence 25 ans plus tard !) permettant l'alignement et le repliement simultané de deux ARN.

Abstract The alignment of finite sequences, the inference of ribonucleic acid secondary structures (folding), and the reconstruction of ancestral sequences on a phylogenetic tree, are three problems which have dynamic programming solutions, which we formulate in a common mathematical framework. Combining the objective functions for alignment (parsimony, or minimal mutations) and folding (free energy), we present an algorithm which solves all three problems simultaneously for a set of $N$ sequences of length $n$ in time proportional to $n^{3N} $ and storage $n^{2N} $. Incorporating a "cutting corners" constraint against biologically unlikely alignments reduces these requirements so that they are proportional to $n^{3} $ and $n^{2} $, respectively, for fixed $N$. [Less...]
[Pris-Larsonneur] Underwood JG, Uzilov AV, Katzman S, Onodera CS, Mainzer JE, Mathews DH, Lowe TM, Salama SR and Haussler D
FragSeq: transcriptome-wide RNA structure probing using high-throughput sequencing
Nature Methods, 7, 995–1001, 2010 [Article] [More...] [Less...]

Abstract Classical approaches to determine structures of noncoding RNA (ncRNA) probed only one RNA at a time with enzymes and chemicals, using gel electrophoresis to identify reactive positions. To accelerate RNA structure inference, we developed fragmentation sequencing (FragSeq), a high-throughput RNA structure probing method that uses high-throughput RNA sequencing of fragments generated by digestion with nuclease P1, which specifically cleaves single-stranded nucleic acids. In experiments probing the entire mouse nuclear transcriptome, we accurately and simultaneously mapped single-stranded RNA regions in multiple ncRNAs with known structure. We probed in two cell types to verify reproducibility. We also identified and experimentally validated structured regions in ncRNAs with, to our knowledge, no previously reported probing data. [Less...]
[Pris-Sauteraud] Wilkinson KA, Gorelick RJ, Vasa SM, Guex N, Rein A, Mathews DH, Giddings MC, Weeks KM
High-Throughput SHAPE Analysis Reveals Structures in HIV-1 Genomic RNA Strongly Conserved across Distinct Biological States
PLoS Biol 6(4): e96, 2008 [Article] [More...] [Less...]

Incorporation de données expérimentales dans les algorithmes de repliement pour une prédiction grande échelle de la structures d'un génome ARN

Abstract Replication and pathogenesis of the human immunodeficiency virus (HIV) is tightly linked to the structure of its RNA genome, but genome structure in infectious virions is poorly understood. We invent high-throughput SHAPE (selective 2?-hydroxyl acylation analyzed by primer extension) technology, which uses many of the same tools as DNA sequencing, to quantify RNA backbone flexibility at single-nucleotide resolution and from which robust structural information can be immediately derived. We analyze the structure of HIV-1 genomic RNA in four biologically instructive states, including the authentic viral genome inside native particles. Remarkably, given the large number of plausible local structures, the first 10% of the HIV-1 genome exists in a single, predominant conformation in all four states. We also discover that noncoding regions functioning in a regulatory role have significantly lower (p-value < 0.0001) SHAPE reactivities, and hence more structure, than do viral coding regions that function as the template for protein synthesis. By directly monitoring protein binding inside virions, we identify the RNA recognition motif for the viral nucleocapsid protein. Seven structurally homologous binding sites occur in a well-defined domain in the genome, consistent with a role in directing specific packaging of genomic RNA into nascent virions. In addition, we identify two distinct motifs that are targets for the duplex destabilizing activity of this same protein. The nucleocapsid protein destabilizes local HIV-1 RNA structure in ways likely to facilitate initial movement both of the retroviral reverse transcriptase from its tRNA primer and of the ribosome in coding regions. Each of the three nucleocapsid interaction motifs falls in a specific genome domain, indicating that local protein interactions can be organized by the long-range architecture of an RNA. High-throughput SHAPE reveals a comprehensive view of HIV-1 RNA genome structure, and further application of this technology will make possible newly informative analysis of any RNA in a cellular transcriptome. [Less...]
[Pris-Pereira] Busch A and Backofen R
INFO-RNA—a fast approach to inverse RNA folding
Bioinformatics, vol.22, issue 15, pp. 1823-1831, 2006. [Article] [More...] [Less...]

Un des algorithmes les plus efficaces pour le problème du design d'ARN

Abstract Motivation: The structure of RNA molecules is often crucial for their function. Therefore, secondary structure prediction has gained much interest. Here, we consider the inverse RNA folding problem, which means designing RNA sequences that fold into a given structure.Results: We introduce a new algorithm for the inverse folding problem (INFO-RNA) that consists of two parts; a dynamic programming method for good initial sequences and a following improved stochastic local search that uses an effective neighbor selection method. During the initialization, we design a sequence that among all sequences adopts the given structure with the lowest possible energy. For the selection of neighbors during the search, we use a kind of look-ahead of one selection step applying an additional energy-based criterion. Afterwards, the pre-ordered neighbors are tested using the actual optimization criterion of minimizing the structure distance between the target structure and the mfe structure of the considered neighbor.We compared our algorithm to RNAinverse and RNA-SSD for artificial and biological test sets. Using INFO-RNA, we performed better than RNAinverse and in most cases, we gained better results than RNA-SSD, the probably best inverse RNA folding tool on the market. [Less...]
[Pris-Guilhot-Gaudeffroy] Zadeh JN, Wolfe BR, Pierce NA
Nucleic acid sequence design via efficient ensemble defect optimization
J Comput Chem, vol 32, issue 3, pp. 439–452, February 2011 [Article] [More...] [Less...]

Une contribution récente généralisant le problème du design d'ARN à n-molécules, ouvrant la voie vers une biologie synthétique de l'ARN.

Abstract We describe an algorithm for designing the sequence of one or more interacting nucleic acid strands intended to adopt a target secondary structure at equilibrium. Sequence design is formulated as an optimization problem with the goal of reducing the ensemble defect below a user-specified stop condition. For a candidate sequence and a given target secondary structure, the ensemble defect is the average number of incorrectly paired nucleotides at equilibrium evaluated over the ensemble of unpseudoknotted secondary structures. To reduce the computational cost of accepting or rejecting mutations to a random initial sequence, candidate mutations are evaluated on the leaf nodes of a tree-decomposition of the target structure. During leaf optimization, defect-weighted mutation sampling is used to select each candidate mutation position with probability proportional to its contribution to the ensemble defect of the leaf. As subsequences are merged moving up the tree, emergent structural defects resulting from crosstalk between sibling sequences are eliminated via reoptimization within the defective subtree starting from new random subsequences. Using a ?(N3) dynamic program to evaluate the ensemble defect of a target structure with N nucleotides, this hierarchical approach implies an asymptotic optimality bound on design time: for sufficiently large N, the cost of sequence design is bounded below by 4/3 the cost of a single evaluation of the ensemble defect for the full sequence. Hence, the design algorithm has time complexity O(N3). For target structures containing N in {100,200,400,800,1600,3200} nucleotides and duplex stems ranging from 1 to 30 base pairs, RNA sequence designs at 37°C typically succeed in satisfying a stop condition with ensemble defect less than N/100. Empirically, the sequence design algorithm exhibits asymptotic optimality and the exponent in the time complexity bound is sharp. [Less...]

Slides et énoncés de TP

Séance 1 : Structure(s) de l'ARN, programmation dynamique, repliement dans le modèle d'énergie de Nussinov/Jacobson.
[Slides cours], [Version compacte-4pp]
[Énoncé TP 1 + Rappels Python]
[Wrappers Python Vienna Package]
Séance 2 : Structures sous-optimales, ensemble de Boltzmann, échantillonnage statistique, pseudo-noeuds simples et quelques algorithmes de comparaison.
[Slides cours], [Version compacte-4pp]
[Énoncé TP 2]
Séance 3 : Approches génériques et modulaires pour la programmation dynamique, modèles syntaxiques et hypergraphes.
[Slides cours], [Version compacte-4pp]
[Énoncé TP 3]