2013-2014-CASM-M2 BIBS

J'interviens au sein du Master Bioinformatique et Biostatistiques (BIBS) de l'Université Paris-Sud/11 et co-habilité par l'École Polytechnique, dans le cadre de la matière Combinatoire, Algorithmes, Séquences et Modélisation (CASM). Le cours du 18 Novembre 2013 aura lieu au PUIO (Salle E205), et consistera en un présentation de différentes approches issues de la programmation dynamique développées dans le cadre du repliement de l'ARN.

Slides

Modalités de contrôle continu

Le contrôle continu de CASM comporte :

  • Examen écrit (date/lieu à venir) : Il portera sur l'ensemble du programme, et comptera pour 2/3 de la note finale. On pourra s'entraîner sur les examens des années précédentes :
  • Examen d'article : Soutenance organisée collectivement (date/lieu à venir) et comptant pour 1/3 de la note finale.
    Déroulement : Choisir individuellement un article, dont vous devrez effectuer une étude synthétique. Un court rapport (+/- 5 pages) devra nous être remis au moins une semaine avant la soutenance individuelle (15 min. chacun).

Liste des articles proposés

Une lecture d'article sera effectuée et la soutenue en monome. Chaque article ne pourra être choisi que par une personne au plus. Veuillez me contacter dès que vous avez choisi un article, que je puisse le retirer de la liste des articles disponibles.

  1. Assemblage - [Pris - Rincon]
    Sequence assembly demystified [PDF]
    Niranjan Nagarajan and Mihai Pop
    [More...]
    Abstract Advances in sequencing technologies and increased access to sequencing services have led to renewed interest in sequence and genome assembly. Concurrently, new applications for sequencing have emerged, including gene expression analysis, discovery of genomic variants and metagenomics, and each of these has different needs and challenges in terms of assembly. We survey the theoretical foundations that underlie modern assembly and highlight the options and practical trade-offs that need to be considered, focusing on how individual features address the needs of specific applications. We also review key software and the interplay between experimental design and efficacy of assembly.
  2. Assemblage - [Pris-HeliouAmélie]
    Pathset Graphs: A Novel Approach for Comprehensive Utilization of Paired Reads in Genome Assembly [PDF]
    Son k. Pham, Dmitry Antipov, Alexander Sirotkin, Glenn Tesler, Pavel A. Pevzner, and Max A. Alekseyev
    [More...]
    Abstract One of the key advances in genome assembly that has led to a significant improvement incontig lengths has been improved algorithms for utilization of paired reads (mate-pairs).While in most assemblers, mate-pair information is used in a post-processing step, therecently proposed Paired de Bruijn Graph (PDBG) approach incorporates the mate-pairinformation directly in the assembly graph structure. However, the PDBG approach facesdifficulties when the variation in the insert sizes is high. To address this problem, we firsttransform mate-pairs into edge-pair histograms that allow one to better estimate the distancebetween edges in the assembly graph that represent regions linked by multiple mate-pairs.Further, we combine the ideas of mate-pair transformation and PDBGs to construct newdata structures for genome assembly: pathsets and pathset graphs.
  3. Assemblage - [Pris-Mollion]
    GAGE : A critical evaluation of genome assemblies and assembly algorithms [PDF]
    Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, Daniela Puiu, Tanja Magoc,Sergey Koren, Todd J. Treangen, Michael C. Schatz, Arthur L. Delcher,Michael Roberts, Guillaume Marcxais, Mihai Pop, and James A. Yorke
    [More...]
    Abstract New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists toinitiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology cangenerate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one ofthese projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. Thesesequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assemblyremains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. Inthis study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, allgenerated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well asother significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarchingconclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on thequality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among differentassemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not wellcorrelated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freelyavailable, as are all assemblers used in this study.
  4. Assemblage des génomes - [Pris-Baudoin]
    Succinct data structures for assembling large genomes [PDF]
    Thomas C. Conway and Andrew J. Bromage
    [More...]
    Abstract Motivation: Second-generation sequencing technology makes it feasible for many researches to obtain enough sequence reads to attempt the de novo assembly of higher eukaryotes (including mammals). De novo assembly not only provides a tool for understanding wide scale biological variation, but within human biomedicine, it offers a direct way of observing both large-scale structural variation and fine-scale sequence variation. Unfortunately, improvements in the computational feasibility for de novo assembly have not matched the improvements in the gathering of sequence data. This is for two reasons: the inherent computational complexity of the problem and the in-practice memory requirements of tools. Results: In this article, we use entropy compressed or succinct data structures to create a practical representation of the de Bruijn assembly graph, which requires at least a factor of 10 less storage than the kinds of structures used by deployed methods. Moreover, because our representation is entropy compressed, in the presence of sequencing errors it has better scaling behaviour asymptotically than conventional approaches. We present results of a proof-of concept assembly of a human genome performed on a modest commodity server.Availability: Binaries of programs for constructing and traversing the de Bruijn assembly graph are available from http://www.genomics.csse.unimelb.edu.au/succinctAssembly.
  5. Assemblage - [Pris-Tran]
    Fast gapped-read alignment with Bowtie 2 [PDF]
    Ben Langmead and Steven L Salzberg
    [More...]
    Abstract As the rate of sequencing increases, greater throughput is demanded from read aligners. The full-text minute index is often used to make alignment very fast and memory-efficient, but the approach is ill-suited to finding longer, gapped alignments. Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.
  6. Combinatoire des séquences - [Pris-HeliouAlice]
    Informed and Automated k-Mer Size Selection for Genome Assembly [PDF]
    Rayan Chikhi and Paul Medvedev
    [More...]
    Abstract Motivation: Genome assembly tools based on the de Bruijn graphframework rely on a parameter k, which represents a trade-offbetween several competing effects that are difficult to quantify. Thereis currently a lack of tools that would automatically estimate the best kto use and/or quickly generate histograms of k-mer abundances thatwould allow the user to make an informed decision.Results: We develop a fast and accurate sampling method thatconstructs approximate abundance histograms with a several ordersof magnitude performance improvement over traditional methods. Wethen present a fast heuristic that uses the generated abundancehistograms for putative k values to estimate the best possible valueof k. We test the effectiveness of our tool using diverse sequencingdatasets and find that its choice of k leads to some of the bestassemblies.Availability: Our tool KMERGENIE is freely available at:http://kmergenie.bx.psu.edu/
  7. Combinatoire des séquences - [Pris-Pitel]
    Separating Significant Matches from Spurious Matches in DNA Sequences [PDF]
    Hugo Devillers and Sophie Schbath
    [More...]
    Abstract Word matches are widely used to compare genomic sequences. Complete genome alignmentmethods often rely on the use of matches as anchors for building their alignments, andvarious alignment-free approaches that characterize similarities between large sequencesare based on word matches. Among matches that are retrieved from the comparison of twogenomic sequences, a part of them may correspond to spurious matches (SMs), which arematches obtained by chance rather than by homologous relationships. The number of SMsdepends on the minimal match length (‘) that has to be set in the algorithm used to retrievethem. Indeed, if ‘ is too small, a lot of matches are recovered but most of them are SMs.Conversely, if ‘ is too large, fewer matches are retrieved but many smaller significantmatches are certainly ignored. To date, the choice of ‘ mostly depends on empiricalthreshold values rather than robust statistical methods. To overcome this problem, wepropose a statistical approach based on the use of a mixture model of geometric distributionsto characterize the distribution of the length of matches obtained from the comparisonof two genomic sequences.
  8. Variantes structurales - [Pris-Yousra]
    Computational methods for discovering structural variation with next-generation sequencing [PDF]
    Paul Medvedev, Monica Stanciu an Michael Brudno
    [More...]
    Abstract In the last several years, a number of studies have described large-scale structuralvariation in several genomes. Traditionally, such methods have used whole-genome arraycomparative genome hybridization or single-nucleotide polymorphism arrays to detectlarge regions subject to copy-number variation. Later techniques have been based onpaired-end mapping of Sanger sequencing data, providing better resolution and accuracy.With the advent of next-generation sequencing, a new generation of methods is beingdeveloped to tackle the challenges of short reads, while taking advantage of the highcoverage the new sequencing technologies provide. In this survey, we describe thesemethods, including their strengths and their limitations, and future research directions.
  9. Variantes Structurales - [Pris-Abboud]
    CREST maps somatic structural variation in cancer genomeswith base-pair resolution [PDF]
    Jianmin Wang, Charles G. Mullighan, John Easton, Stefan Roberts, Jing Ma, Michael C. Rusch, Ken Chen, Christopher C. Harris, Li Ding, Sue L. Heatley, Linda Holmfeldt, Debbie Payne-Turner, Xian Fan, Lei Wei, David Zhao, John C. Obenauer, Clayton Naeve, Elaine R. Mardis, Richard K. Wilson, James R. Downing, and Jinghui Zhang
    [More...]
    Abstract We developed CREST (Clipping REveals STructure), an algorithm that uses next-generation sequencing reads with partial alignments to a reference genome to directly map structural variations at the nucleotide level of resolution. Application of CREST to whole-genome sequencing data from five pediatric T-lineage acute lymphoblastic leukemias (T-ALLs) and a human melanoma cell line, COLO-829, identified 160 somatic structural variations. Experimental validation exceeded 80% demonstrating that CREST had a high predictive accuracy.
  10. Variantes structurales - [Pris-Chiche]
    PeSV-Fisher: Identification of Somatic and Non-Somatic Structural Variants Using Next Generation Sequencing Data [PDF]
    Georgia Escaramis, Cristian Tornador, Laia Bassaganyas, Raquel Rabionet,Jose M. C. Tubio, Alexander Martinez-Fundichely, Mario Caceres, Marta Gut,Stephan Ossowski, Xavier Estivill
    [More...]
    Abstract Next-generation sequencing technologies expedited research to develop efficient computational tools for the identification of structural variants (SVs) and their use to study human diseases. As deeper data is obtained, the existence of highercomplexity SVs in some genomes becomes more evident, but the detection and definition of most of these complexrearrangements is still in its infancy. The full characterization of SVs is a key aspect for discovering their biologicalimplications. Here we present a pipeline (PeSV-Fisher) for the detection of deletions, gains, intra- and inter-chromosomaltranslocations, and inversions, at very reasonable computational costs. We further provide comprehensive information onco-localization of SVs in the genome, a crucial aspect for studying their biological consequences. The algorithm uses acombination of methods based on paired-reads and read-depth strategies. PeSV-Fisher has been designed with the aim tofacilitate identification of somatic variation, and, as such, it is capable of analysing two or more samples simultaneously,producing a list of non-shared variants between samples. We tested PeSV-Fisher on available sequencing data, andcompared its behaviour to that of frequently deployed tools (BreakDancer and VariationHunter). We have also tested thisalgorithm on our own sequencing data, obtained from a tumour and a normal blood sample of a patient with chroniclymphocytic leukaemia, on which we have also validated the results by targeted re-sequencing of different kinds ofpredictions. This allowed us to determine confidence parameters that influence the reliability of breakpoint predictions.
  11. Algorithmique de l'ARN - [Pris-Dumestier]
    Sparse RNA Folding: Time and Space Efficient Algorithms [PDF]
    Rolf Backofen, Dekel Tsur, Shay Zakov, and Michal Ziv-Ukelson
    Nouvelle technique algorithmique permettant d'éviter le calcul de toute la matrice de programmation dynamique pour le repliement de l'ARN. [More...]
    Abstract The classical algorithm for RNA single strand folding requiresO(nZ) time and O(n2) space, where n denotes the length of theinput sequence and Z is a sparsity parameter that satisfies n < Z < n2.We show how to reduce the space complexity of this algorithm. Thespace reduction is based on the observation that some solutions for subproblemsare not examined after a certain stage of the algorithm, andmay be discarded from memory. This yields an O(nZ) time and O(Z)space algorithm, that outputs both the cardinality of the optimal foldingas well as a corresponding secondary structure. The space-effcientapproach also extends to the related RNA simultaneous alignment withfolding problem, and can be applied to reduce the space complexity of thefastest algorithm for this problem from O(n2m2) down to O(nm2 + Z~),where n and m denote the lengths of the input sequences to be aligned,and ~ Z is a sparsity parameter that satisfies nm < Z~ < n2m2.In addition, we also show how to speed up the base-pairing maximizationvariant of RNA single strand folding. The speed up is achieved by combiningtwo independent existing techniques, which restrict the numberof expressions that need to be examined in bottleneck computations ofthese algorithms. This yields an O(LZ) time and O(Z) space algorithm,where L denotes the maximum cardinality of a folding of the input sequence.
  12. Algorithmique de l'ARN - [Pris-Monet]
    The RNA Newton polytope and learnability of energy parameters [PDF]
    Elmirasadat Forouzmand and Hamidreza Chitsaz
    Est il possible d'apprendre des paramètres du modèle d'énergie de façon à replier correctement un ensemble de structures ? [More...]
    Abstract Motivation: Computational RNA structure prediction is a mature importantproblem that has received a new wave of attention with thediscovery of regulatory non-coding RNAs and the advent of highthroughputtranscriptome sequencing. Despite nearly two scoreyears of research on RNA secondary structure and RNA–RNA interactionprediction, the accuracy of the state-of-the-art algorithms arestill far from satisfactory. So far, researchers have proposed increasinglycomplex energy models and improved parameter estimationmethods, experimental and/or computational, in anticipation ofendowing their methods with enough power to solve the problem.The output has disappointingly been only modest improvements, notmatching the expectations. Even recent massively featured machinelearning approaches were not able to break the barrier. Why is that?Approach: The first step toward high-accuracy structure prediction isto pick an energy model that is inherently capable of predicting eachand every one of known structures to date. In this article, we introducethe notion of learnability of the parameters of an energy model as ameasure of such an inherent capability. We say that the parameters ofan energy model are learnable iff there exists at least one set of suchparameters that renders every known RNA structure to date the minimumfree energy structure. We derive a necessary condition for thelearnability and give a dynamic programming algorithm to assess it.Our algorithm computes the convex hull of the feature vectors of allfeasible structures in the ensemble of a given input sequence.Interestingly, that convex hull coincides with the Newton polytope ofthe partition function as a polynomial in energy parameters. To thebest of our knowledge, this is the first approach toward computing theRNA Newton polytope and a systematic assessment of the inherentcapabilities of an energy model. The worst case complexity of ouralgorithm is exponential in the number of features. However, dimensionalityreduction techniques can provide approximate solutions toavoid the curse of dimensionality.Results: We demonstrated the application of our theory to a simpleenergy model consisting of a weighted count of A-U, C-G and G-Ubase pairs. Our results show that this simple energy model satisfiesthe necessary condition for more than half of the input unpseudoknottedsequence–structure pairs (55%) chosen from the RNASTRAND v2.0 database and severely violates the condition for~13%, which provide a set of hard cases that require further investigation.From 1350 RNA strands, the observed 3D feature vector for749 strands is on the surface of the computed polytope. For 289 RNAstrands, the observed feature vector is not on the boundary of thepolytope but its distance from the boundary is not more than one.A distance of one essentially means one base pair difference betweenthe observed structure and the closest point on the boundary of thepolytope, which need not be the feature vector of a structure. For171 sequences, this distance is larger than two, and for only 11sequences, this distance is larger than five.
  13. Algorithmique de l'ARN - [Pris-Pichené]
    Computational Design of RNAs with Complex Energy Landscapes [PDF]
    Christian Höner zu Siederdissen, Stefan Hammer, Ingrid Abfalter, Ivo L. Hofacker, Christoph Flamm, and Peter F. Stadler
    Principes de conception pour des ARN programmables, ayant des chemins de repliement prédéfinis. [More...]
    Abstract RNA has become an integral building material in synthetic biology. Dominated by their secondary structures, which can be computed efficiently, RNA molecules are amenable not only to in vitro and in vivo selection, but also to rational, computation-based design. While the inverse folding problem of constructing an RNA sequence with a prescribed ground-state structure has received considerable attention for nearly two decades, there have been few efforts to design RNAs that can switch between distinct prescribed conformations. We introduce a user-friendly tool for designing RNA sequences that fold into multiple target structures. The underlying algorithm makes use of a combination of graph coloring and heuristic local optimization to find sequences whose energy landscapes are dominated by the prescribed conformations. A flexible interface allows the specification of a wide range of design goals. We demonstrate that bi- and tri-stable “switches” can be designed easily with moderate computational effort for the vast majority of compatible combinations of desired target structures. RNAdesign is freely available under the GPL-v3 license.
  14. Algorithmique de l'ARN/Biologie synthétique - [Pris-Alfred]
    De novo automated design of small RNA circuits for engineering synthetic riboregulation in living cells [PDF]
    Guillermo Rodrigo, Thomas E. Landrain, and Alfonso Jaramillo
    The full monty : conception quasi non-supervisé de circuits à base d'ARN structurés. [More...]
    Abstract A grand challenge in synthetic biology is to use our current knowledge of RNA science to perform the automatic engineering of completely synthetic sequences encoding functional RNAs in living cells. We report here a fully automated design methodology and experimental validation of synthetic RNA interaction circuits working in a cellular environment. The computational algorithm, based on a physicochemical model, produces novel RNA sequences by exploring the space of possible sequences compatible with predefined structures. We tested our methodology in Escherichia coli by designing several positive riboregulators with diverse structures and interaction models, suggesting that only the energy of formation and the activation energy (free energy barrier to overcome for initiating the hybridization reaction) are sufficient criteria to engineer RNA interaction and regulation in bacteria. The designed sequences exhibit nonsignificant similarity to any known noncoding RNA sequence. Our riboregulatory devices work independently and in combination with transcription regulation to create complex logic circuits. Our results demonstrate that a computational methodology based on first-principles can be used to engineer interacting RNAs with allosteric behavior in living cells.
  15. Algorithmique de l'ARN - [Pris-Bouri]
    RNA folding with soft constraints: reconciliation of probing data and thermodynamic secondary structure prediction [PDF]
    Stefan Washietl, Ivo L. Hofacker, Peter F. Stadler, and Manolis Kellis
    Incorporation de données légères de sondage enzymatique dans la prédiction de structure secondaire des ARN. [More...]
    Abstract Thermodynamic folding algorithms and structure probing experiments are commonly used to determine the secondary structure of RNAs. Here we propose a formal framework to reconcile information from both prediction algorithms and probing experiments. The thermodynamic energy parameters are adjusted using ‘pseudo-energies’ to minimize the discrepancy between prediction and experiment. Our framework differs from related approaches that used pseudo-energies in several key aspects. (i) The energy model is only changed when necessary and no adjustments are made if prediction and experiment are consistent. (ii) Pseudo-energies remain biophysically interpretable and hold positional information where experiment and model disagree. (iii) The whole thermodynamic ensemble of structures is considered thus allowing to reconstruct mixtures of suboptimal structures from seemingly contradicting data. (iv) The noise of the energy model and the experimental data is explicitly modeled leading to an intuitive weighting factor through which the problem can be seen as folding with ‘soft’ constraints of different strength. We present an efficient algorithm to iteratively calculate pseudo-energies within this framework and demonstrate how this approach can be used in combination with SHAPE chemical probing data to improve secondary structure prediction. We further demonstrate that the pseudo-energies correlate with biophysical effects that are known to affect RNA folding such as chemical nucleotide modifications and protein binding.
  16. Algorithmique de l'ARN - [Pris-Nkili]
    A Combinatorial Framework for Designing (Pseudoknotted) RNA Algorithms [PDF]
    Y. Ponty and C. Saule
    Programmation dynamique générique. [More...]
    Abstract We extend an hypergraph representation, introduced by Finkelstein and Roytberg, to unify dynamic programming algorithms in the context of RNA folding with pseudoknots. Classic applications of RNA dynamic programming energy minimization, partition function, base-pair probabilities\ldots) are reformulated within this framework, giving rise to very simple algorithms. This reformulation allows one to conceptually detach the conformation space/energy model -- captured by the hypergraph model -- from the specific application, assuming unambiguity of the decomposition. To ensure the latter property, we propose a new combinatorial methodology based on generating functions. We extend the set of generic applications by proposing an exact algorithm for extracting generalized moments in weighted distribution, generalizing a prior contribution by Miklos and al. Finally, we illustrate our full-fledged programme on three exemplary conformation spaces (secondary structures, Akutsu's simple type pseudoknots and kissing hairpins). This readily gives sets of algorithms that are either novel or have complexity comparable to classic implementations for minimization and Boltzmann ensemble applications of dynamic programming.