Yann Ponty, CNRS researcher@LIX, Ecole Polytechnique

Publications

Sorted by date Sorted by type

2024
- Peer-reviewed journal articles
  - H.-T. Yao, B. Marchand, S. Berkemer, Y. Ponty, and S. Will
    
    Infrared: a declarative tree decomposition-powered framework for bioinformatics
    Algorithms for Molecular Biology, BioMed Central, To appear 2024 [doi] [hal]
    
    Abstract (click to read). Motivation: Many bioinformatics problems can be approached as optimization or controlled sampling tasks, and solved exactly and efficiently using Dynamic Programming (DP). However, such exact methods are typically tailored towards specific settings, complex to develop, and hard to implement and adapt to problem variations.Methods: We introduce the Infrared framework to overcome such hindrances for a large class of problems. Its under‐lying paradigm is tailored toward problems that can be declaratively formalized as sparse feature networks, a generalization of constraint networks. Classic Boolean constraints specify a search space, consisting of putative solutions whose evaluation is performed through a combination of features. Problems are then solved using generic cluster tree elimination algorithms over a tree decomposition of the feature network. Their overall complexities are linear on the number of variables, and only exponential in the treewidth of the feature network. For sparse feature networks, associated with low to moderate treewidths, these algorithms allow to find optimal solutions, or generate controlled samples, with practical empirical efficiency.Results: Implementing these methods, the Infrared software allows Python programmers to rapidly develop exactoptimization and sampling applications based on a tree decomposition-based efficient processing. Instead of directly coding specialized algorithms, problems are declaratively modeled as sets of variables over finite domains, whose dependencies are captured by constraints and functions. Such models are then automatically solved by generic DP algorithms. To illustrate the applicability of Infrared in bioinformatics and guide new users, we model and discuss variants of bioinformatics applications. We provide reimplementations and extensions of methods for RNA design, RNA sequence-structure alignment, parsimony-driven inference of ancestral traits in phylogenetic trees/networks, and design of coding sequences. Moreover, we demonstrate multidimensional Boltzmann sampling. These applications of the framework—together with our novel results—underline the practical relevance of Infrared. Remarkably, the achieved complexities are typically equivalent to the ones of specialized algorithms and implementations.Availability Infrared is available at https://amibio.gitlabpages.inria.fr/Infrared with extensive documentation, including various usage examples and API reference; it can be installed using Conda or from source.
- Conferences with proceedings
  - T. Yacoub, R. González-Alemán, F. Leclerc, I. Chauvot de Beauchêne, and Y. Ponty
    
    Color Coding for the Fragment-Based Docking, Design and Equilibrium Statistics of Protein-Binding ssRNAs
    Proceedings of RECOMB 2024 - 28th International Conference of Research in Computational Molecular Biology, United States, 2024 [hal]
    
    Abstract (click to read). We revisit the fragment-based docking and design of single-stranded RNA aptamers (ssRNAs), consisting of $k$ nucleotides, onto a rigid protein. Fragments, representing either one or multiple pieced nucleotides, are individually docked onto the protein surface using a force field, and some among the resulting $n$ poses are pieced together to form a conformation compatible with the input ssRNA sequence. Relaxing the sequence compatibility constraint, a similar methodology can be used to design ssRNAs that preferentially bind a protein of interest, possibly targeting a pocket. However, a brute-force enumeration of clash-free conformations quickly becomes prohibitive due to their superexponential ($\Theta(n^k)$ worst-case) combinatorial explosion, hindering the potential of fragment-based methods towards docking and design.We adopt the color-coding technique, introduced by Alon, Yuster and Zwick, to optimize over self-avoiding fragment assemblies in time/space linear on $n$ the number of poses, and in time only exponential on $k$ the number of fragments. The dynamic programming algorithm at the core of our method is surprisingly simple, and can be extended to produce suboptimal candidates, or modified to perform Boltzmann sampling of candidates assemblies. Using a rejection principle, and further optimized by a clique decomposition of clashing poses, these algorithms can be leveraged into efficient algorithms optimizing over clash-free complexes. The resulting sampling procedure can further be adapted into statistically-consistent estimators for any computable feature of interest.We showcase some of the capabilities of this new framework by reanalyzing a set of 7 documented ssRNA-protein complexes, demonstrating its practical relevance and versatility.
2023
- Peer-reviewed journal articles
  - B. Marchand, S. Will, S. Berkemer, Y. Ponty, and L. Bulteau
    
    Automated design of dynamic programming schemes for RNA folding with pseudoknots
    Algorithms for Molecular Biology, BioMed Central, 18(1):18 pp, 2023 [doi] [hal]
    
    Abstract (click to read). Although RNA secondary structure prediction is a textbook application of dynamic programming (DP) and routine task in RNA structure analysis, it remains challenging whenever pseudoknots come into play. Since the prediction of pseudoknotted structures by minimizing (realistically modelled) energy is NP-hard, specialized algorithms have been proposed for restricted conformation classes that capture the most frequently observed configurations. To achieve good performance, these methods rely on specific and carefully hand-crafted DP schemes. In contrast, we generalize and fully automatize the design of DP pseudoknot prediction algorithms. For this purpose, we formalize the problem of designing DP algorithms for an (infinite) class of conformations, modeled by (a finite number of) fatgraphs, and automatically build DP schemes minimizing their algorithmic complexity. We propose an algorithm for the problem, based on the tree-decomposition of a well-chosen representative structure, which we simplify and reinterpret as a DP scheme. The algorithm is fixed-parameter tractable for the tree-width tw of the fatgraph, and its output represents a $O(n^{tw+1})$ algorithm (and even possibly $O(n^{tw})$ in simple energy models) for predicting the MFE folding of an RNA of length n. We demonstrate, for the most common pseudoknot classes, that our automatically generated algorithms achieve the same complexities as reported in the literature for hand-crafted schemes. Our framework supports general energy models, partition function computations, recursive substructures and partial folding, and could pave the way for algebraic dynamic programming beyond the context-free case.
- Conferences with proceedings
  - T. Boury, Y. Ponty, and V. Reinharz
    
    Automatic exploration of the natural variability of RNA non-canonical geometric patterns with a parameterized sampling technique
    Proceedings of WABI 2023 - 23rd Workshop on Algorithms in Bioinformatics, United States, 2023 [doi] [hal]
    
    Abstract (click to read). Motivation: Recurrent substructures in RNA, known as 3D motifs, consist of networks of base pair interactions and are critical to understanding the relationship between structure and function. Their structure is naturally expressed as a graph which has led to many graph-based algorithms to automatically catalog identical motifs found in 3D structures. Yet, due to the complexity of the problem, state-of-the-art methods are often optimized to find exact matches, limiting the search to a subset of potential solutions, or do not allow explicit control over the desired variability. Results: We developed FuzzTree, a method able to efficiently sample subgraphs in an RNA structure that lie in a close neighborhood of a requested motif. It is the first method that allows explicit control over (1) the admissible geometric variability in the interactions, (2) the number of missing edges, and (3) introduction of discontinuities in the backbone given close distances in the 3D structure. Our tool relies on a multidimensional Boltzmann sampling procedure with complexity parameterized by the treewidth of the requested motif. We applied our method to the well-known internal loop Kink-Turn motif, which can be divided into 12 subgroups. Given only the graph representing the main Kink-Turn subgroup, FuzzTree retrieved over 3/4 of all kink-turns. We also highlight two occurrences of new sampled patterns. Our tool is available as free software and can be customized for different parameters and types of graphs.
- Miscellaneous & Submitted
  - Y. Ponty, and S. Roy
    
    ISMB/ECCB 2023 proceedings
    [doi] [hal]
    
    Abstract (click to read).
  - S. Findeiß, C. Flamm, and Y. Ponty
    
    Preface/Executive summary
    [doi] [hal]
    
    Abstract (click to read). This report documents the program and outcomes of Dagstuhl Seminar 22381 "Rational Design of RiboNucleic Acids" (RNAs). The seminar covered a wide array of models, algorithmic strategies, molecular scales and modalities, all targeting in silico design of RNAs performing predefined biological functions. It consisted in a series of talks, each being allocated a generous time budget enabling frequent (welcomed!) interruptions and fruitful discussions. Applications of rational RNA design include mRNA vaccines; RNAs acting as sensors; self-replicating RNAs, relevant to RNA world/origin of life studies; populations of RNAs performing computations, e.g. through strand-displacement systems; RNA origamis forming nano-architectures through self-assembly; weakly interacting RNAs inducing the formation of droplets within cells through liquid-liquid phase separation. Those diverse applications are typically tackled by Bioinformatics-inclined scientists, contributing to distinct areas of life science and, as a result, somewhat isolated and sometimes unaware of similar pursuits in neighboring fields. The overarching goals of this meeting were to gather computational scientists from multiple fields, increase awareness of relevant efforts in distant communities, and ultimately contribute to a transversal perspective where RNA design becomes an object of study in itself.
  - T. Boury, L. Bulteau, B. Marchand, and Y. Ponty
    
    Independent set reconfiguration: general and RNA-focused parameterized algorithms
    [hal]
    
    Abstract (click to read). In this paper, we study the Independent Set (IS) reconfiguration problemin graphs, and its applications to RNA kinetics.An IS reconfiguration is a scenario transforming an IS $L$ into another IS $R$,inserting/removing one vertex at a time while keeping thecardinalities of intermediate sets as large as possible. Wefocus on the \emph{bipartite} variant where only start and end vertices areallowed in intermediate ISs.Our motivation is an applicationto the \emph{RNA energy barrier}, a classic hard problemfrom bioinformatics, which asks, given twoRNA structures given as input, whether there exists a reconfiguration pathwayconnecting them and staying below an energy threshold. A natural parameter for this problem would be the differencebetween the initial IS size and the threshold (\emph{barrier}).We first show the para-NP hardness of the problem with respect to thisparameter. We then investigate two new parameters, the \emph{cardinalityrange} $\range$ and a measure of \emph{arboricity} $\Phi$.$\rho$ denotes the maximum allowed size differencebetween an IS along the reconfiguration and a maximum IS, while $\Phi$ is a measure of the amount of ``branching'' in the two input RNA structures. We show that bipartite IS reconfiguration is XP for $\range$ in the general case,and XP for $\Phi$ in the sub-case of bipartite graphs stemming from RNA instances.%, and FPT when combining both $\Phi$ and $\range$ in the RNA case.We give two different routes yielding XP algorithms for $\rho$: The first is a direct$O(n^{2})$-space, $O(n^{2\rho+2.5})$-time algorithm based on a separation lemma; The second builds on a parameterized equivalence with thedirected pathwidth problem, leading to a $O(n^{\rho+1})$-space,$O(n^{\rho+2})$-time algorithm for the reconfiguration problem through anadaptation of a prior result by Tamaki \cite{Tamaki2011}. This equivalenceis an interesting result in its own right, connecting a reconfigurationproblem (which is essentially a \emph{connectivity} problem within a\emph{reconfiguration network}) with a \emph{structural} parameter for anauxiliary graph. For $\Phi$, our $O(n^{\Phi+1})$-algorithm stems from seeingthe problem as an instance of \emph{minimum cumulative-cost scheduling},and relies on a \emph{merging} procedure that might be of independent interest.These results improve upon a partial $O(n^{2\rho+2.5})$-algorithm that onlyapplied to the RNA case.We also demonstrate their practicality of these algorithms througha benchmark on small random RNA instances.
  - J. Fernandez-De-Cossio-Diaz, P. Hardouin, F.-X. Du Moutier, A. Di Gioacchino, B. Marchand, Y. Ponty, B. Sargueil, R. Monasson, and S. Cocco
    
    Designing molecular RNA switches with Restricted Boltzmann machines
    [doi] [hal]
    
    Abstract (click to read). Riboswitches are structured allosteric RNA molecules capable of switching between competing conformations in response to a metabolite binding event, eventually triggering a regulatory response. Computational modelling of these molecules is complicated by complex tertiary contacts, conditioned to the presence of their cognate metabolite. In this work, we show that Restricted Boltzmann machines (RBM), a simple two-layer machine learning model, capture intricate sequence dependencies induced by secondary and tertiary structure, as well as the switching mechanism, resulting in a model that can be successfully used for the design of allosteric RNA. As a case study we consider the aptamer domain of SAM-I riboswitches. To validate the functionality of designed sequences experimentally by SHAPE-MaP, we develop a tailored analysis pipeline adequate for high-throughput probing of diverse homologous sequences. We find that among the probed 84 RBM designed sequences, showing up to 20% divergence from any natural sequence, about 28% (and 47% of the 45 among them having low RBM effective energies), are correctly structured and undergo a structural allosteric in response to SAM. Finally, we show how the flexibility of the molecule to switch conformations is connected to fine energetic features of its structural components.
2022
- Peer-reviewed journal articles
  - A. Churkin, Y. Ponty, and D. Barash
    
    IndelsRNAmute: predicting deleterious multiple point substitutions and indels mutations
    BMC Bioinformatics, BioMed Central, 23(S8):424 pp, 2022 [doi] [hal]
    
    Abstract (click to read). Background: RNA deleterious point mutation prediction was previously addressed with programs such as RNAmute and MultiRNAmute. The purpose of these programs is to predict a global conformational rearrangement of the secondary structure of a functional RNA molecule, thereby disrupting its function. RNAmute was designed to deal with only single point mutations in a brute force manner, while in MultiRNAmute an efficient approach to deal with multiple point mutations was developed. The approach used in MultiRNAmute is based on the stabilization of the suboptimal RNA folding prediction solutions and/or destabilization of the optimal folding prediction solution of the wild type RNA molecule. The MultiRNAmute algorithm is significantly more efficient than the brute force approach in RNAmute, but in the case of long sequences and large m-point mutation sets the MultiRNAmute becomes exponential in examining all possible stabilizing and destabilizing mutations. Results: An inherent limitation in the RNAmute and MultiRNAmute programs is their ability to predict only substitution mutations, as these programs were not designed to work with deletion or insertion mutations. To address this limitation we herein develop a very fast algorithm, based on suboptimal folding solutions, to predict a predefined number of multiple point deleterious mutations as specified by the user. Depending on the user’s choice, each such set of mutations may contain combinations of deletions, insertions and substitution mutations. Additionally, we prove the hardness of predicting the most deleterious set of point mutations in structural RNAs. Conclusions: We developed a method that extends our previous MultiRNAmute method to predict insertion and deletion mutations in addition to substitutions. The additional advantage of the new method is its efficiency to find a predefined number of deleterious mutations. Our new method may be exploited by biologists and virologists prior to site-directed mutagenesis experiments, which involve indel mutations along with substitutions. For example, our method may help to investigate the change of function in an RNA virus via mutations that disrupt important motifs in its secondary structure.
  - B. Marchand, Y. Ponty, and L. Bulteau
    
    Tree Diet: Reducing the Treewidth to Unlock FPT Algorithms in RNA Bioinformatics
    Algorithms for Molecular Biology, BioMed Central, 17(1):8 pp, 2022 [doi] [hal]
    
    Abstract (click to read). Hard graph problems are ubiquitous in Bioinformatics, inspiring the design of specialized Fixed-Parameter Tractable algorithms, many of which rely on a combination of tree-decomposition and dynamic programming. The time/space complexities of such approaches hinge critically on low values for the treewidth tw of the input graph. In order to extend their scope of applicability, we introduce the Tree-Diet problem, i.e. the removal of a minimal set of edges such that a given tree-decomposition can be slimmed down to a prescribed treewidth tw ′. Our rationale is that the time gained thanks to a smaller treewidth in a parameterized algorithm compensates the extra post-processing needed to take deleted edges into account. Our core result is an FPT dynamic programming algorithm for Tree-Diet, using 2 O(tw) n time and space. We complement this result with parameterized complexity lower-bounds for stronger variants (e.g., NP-hardness when tw ′ or tw−tw ′ is constant). We propose a prototype implementation for our approach which we apply on difficult instances of selected RNA-based problems: RNA design, sequence-structure alignment, and search of pseudoknotted RNAs in genomes, revealing very encouraging results. This work paves the way for a wider adoption of tree-decomposition-based algorithms in Bioinformatics.
- Conferences with proceedings
  - B. Marchand, S. Will, S. Berkemer, L. Bulteau, and Y. Ponty
    
    Automated design of dynamic programming schemes for RNA folding with pseudoknots
    Proceedings of WABI 2022 - 22nd Workshop on Algorithms in Bioinformatics, Germany, 2022 [hal]
    
    Abstract (click to read). Despite being a textbook application of dynamic programming (DP) and routine task in RNA structure analysis, RNA secondary structure prediction remains challenging whenever pseudoknots come into play. To circumvent the NP-hardness of energy minimization in realistic energy models, specialized algorithms have been proposed for restricted conformation classes that capture the most frequently observed configurations. While these methods rely on hand-crafted DP schemes, we generalize and fully automatize the design of DP pseudoknot prediction algorithms. We formalize the problem of designing DP algorithms for an (infinite) class of conformations, modeled by (a finite number of) fatgraphs, and automatically build DP schemes minimizing their algorithmic complexity. We propose an algorithm for the problem, based on the tree-decomposition of a well-chosen representative structure, which we simplify and reinterpret as a DP scheme. The algorithm is fixed-parameter tractable for the tree-width tw of the fatgraph, and its output represents a O(n tw+1) algorithm for predicting the MFE folding of an RNA of length n. Our general framework supports general energy models, partition function computations, recursive substructures and partial folding, and could pave the way for algebraic dynamic programming beyond the context-free case.
- Book chapters & Editorials
  - Y. Ponty, and V. Reinharz
    
    RNA Folding
    From text to graphs: Discrete structures and methods for Bioinformatics, ISTE, 2022 [hal]
    
    Abstract (click to read). Introduction Ribonucleic Acids (RNA) are molecules that are essential to any living organism. They are composed of nucleotides named Adenine, Cytosine, Guanine and Uracile, and usually represented as sequences over an {A, C, G, U} alphabet. In this book chapter, we provide a gentle, algorithmically self-contained, introduction into combinatorial algorithms introduced over the years to predict the secondary structure of RNA.
  - H.-T. Yao, Y. Ponty, and S. Will
    
    Developing complex RNA design applications in the Infrared framework
    RNA Folding - Methods and Protocols, 2022 [hal]
    
    Abstract (click to read). Applications in biotechnology and bio-medical research call for effective strategies to design novel RNAs with very specific properties. Such advanced design tasks require support by computational tools but at the same time put high demands on their flexibility and expressivity to model the application-specific requirements. To address such demands, we present the computational framework Infrared. It supports developing advanced customized design tools, which generate RNA sequences with specific properties, often in a few lines of Python code. This text guides the reader in tutorial format through the development of complex design applications. Thanks to the declarative, compositional approach of Infrared, we can describe this development as step-by-step extension of an elementary design task. Thus, we start with generating sequences that are compatible with a single RNA structure and go all the way to RNA design targeting complex positive and negative design objectives with respect to single or even multiple target structures. Finally, we present a ’real-world’ application of computational design to create an RNA device for biotechnology: we use Infrared to generate design candidates of an artificial ‘AND’ riboswitch, which activates gene expression in the simultaneous presence of two different small metabolites.Keywords: RNA design, multi-target design, declarative modeling, Boltzmannsampling, fixed-parameter tractable sampling
- Miscellaneous & Submitted
  - S. O’donoghue, J. Procter, L. Collinson, A. Yates, L. Gregg, B. Sommer, S. Velankar, R. Beiko, and Y. Ponty
    
    VIZBI 2021 special issue - 11th international meeting on visualizing biological data
    [doi] [hal]
    
    Abstract (click to read). This Research Topic collects work presented during the 11th International Meeting on Visualizing Biological Data (VIZBI 2021), held as an EMBL virtual conference from March 24–26. The conference features talks from 21 world-leading researchers showcasing visualizations transforming how life scientists view data, and driving key advances in molecular biology, systems biology, biomedical science, and ecology. In addition, the conference includes work presented as posters and lightning talks, contributed from the other conference attendees.For VIZBI 2021 - for the first time - all conference speakers and participants had the opportunity to disseminate work presented during the meeting as peer-reviewed publications, through this Research Topic. Manuscripts were be handled by an editorial board formed from VIZBI session chairs, along with eminent speakers from past meetings.
2021
- Peer-reviewed journal articles
  - G. Entzian, I. Hofacker, Y. Ponty, R. Lorenz, and A. Tanzer
    
    RNAxplorer: Harnessing the Power of Guiding Potentials to Sample RNA Landscapes
    Bioinformatics, Oxford University Press (OUP), To appear 2021 [hal]
    
    Abstract (click to read). Motivation: Predicting the folding dynamics of RNAs is a computationally difficult problem, first and foremost due to the combinatorial explosion of alternative structures in the structure space. Abstractions are therefore needed to simplify downstream analyses, and make them computationally tractable. This can be achieved by various structure sampling algorithms. However, current sampling methods are still time consuming and frequently fail to represent key elements of the folding space. Method: We introduce RNAxplorer, a novel adaptive sampling method which uses dynamic programming to perform an efficient Boltzmann sampling in the presence of guiding potentials reflecting the similarity to already well-sampled structures. These potentials are accumulated into pseudo-energy terms that effectively steer sampling towards underrepresented or unexplored regions of the structure space. Results: RNAxplorer allows us to efficiently explore RNA state space. It yields rare conformations that may be inaccessible to other sampling methods. We developed and applied different measures to benchmark our sampling methods against its competitors. Most of the measures show that RNAxplorer produces more diverse structure samples and is better at finding the most relevant kinetic traps in the landscape. Thus, it produces a more representative coarse graining of the landscape that is well suited to compute better approximations of RNA folding kinetics.Availability: https://github.com/ViennaRNA/RNAxplorer/
  - H. Nguyen, H. Xue, V. Firlej, Y. Ponty, M. Gallopin, and D. Gautheret
    
    Reference-free transcriptome signatures for prostate cancer prognosis
    BMC Cancer, BioMed Central, 21(394):To appear 2021 [doi] [hal]
    
    Abstract (click to read). Background. RNA-seq data are increasingly used to derive prognostic signatures for cancer outcome prediction. A limitation of current predictors is their reliance on reference gene annotations, which amounts to ignoring large numbers of non-canonical RNAs produced in disease tissues. A recently introduced kind of transcriptome classifier operates entirely in a reference-free manner, relying on k-mers extracted from patient RNA-seq data.Methods. In this paper, we set out to compare conventional and reference-free signatures in risk and relapse prediction of prostate cancer. To compare the two approaches as fairly as possible, we set up a common procedure that takes as input either a k-mer count matrix or a gene expression matrix, extracts a signature and evaluates this signature in an independent dataset.Results. We find that both gene-based and k-mer based classifiers had similarly high performances for risk prediction and a markedly lower performance for relapse prediction. Interestingly, the reference-free signatures included a set of sequences mapping to novel lncRNAs or variable regions of cancer driver genes that were not part of gene-based signatures.Conclusions. Reference-free classifiers are thus a promising strategy for the identification of novel prognostic RNA biomarkers.
  - H. Nguyen, H. Xue, V. Firlej, Y. Ponty, M. Gallopin, and D. Gautheret
    
    Reference-free transcriptome signatures for prostate cancer prognosis
    BMC Cancer, BioMed Central, 21(1):394 pp, 2021 [doi] [hal]
    
    Abstract (click to read). Abstract Background RNA-seq data are increasingly used to derive prognostic signatures for cancer outcome prediction. A limitation of current predictors is their reliance on reference gene annotations, which amounts to ignoring large numbers of non-canonical RNAs produced in disease tissues. A recently introduced kind of transcriptome classifier operates entirely in a reference-free manner, relying on k-mers extracted from patient RNA-seq data. Methods In this paper, we set out to compare conventional and reference-free signatures in risk and relapse prediction of prostate cancer. To compare the two approaches as fairly as possible, we set up a common procedure that takes as input either a k-mer count matrix or a gene expression matrix, extracts a signature and evaluates this signature in an independent dataset. Results We find that both gene-based and k-mer based classifiers had similarly high performances for risk prediction and a markedly lower performance for relapse prediction. Interestingly, the reference-free signatures included a set of sequences mapping to novel lncRNAs or variable regions of cancer driver genes that were not part of gene-based signatures. Conclusions Reference-free classifiers are thus a promising strategy for the identification of novel prognostic RNA biomarkers.
  - G. de Bisschop, D. Allouche, E. Frezza, B. Masquida, Y. Ponty, S. Will, and B. Sargueil
    
    Progress Toward SHAPE Constrained Computational Prediction of Tertiary Interactions in RNA Structure
    Non-Coding RNA, MDPI, 7(4):71 pp, 2021 [doi] [hal]
    
    Abstract (click to read). As more sequencing data accumulate and novel puzzling genetic regulations are discovered, the need for accurate automated modeling of RNA structure increases. RNA structure modeling from chemical probing experiments has made tremendous progress, however accurately predicting large RNA structures is still challenging for several reasons: RNA are inherently flexible and often adopt many energetically similar structures, which are not reliably distinguished by the available, incomplete thermodynamic model. Moreover, computationally, the problem is aggravated by the relevance of pseudoknots and non-canonical base pairs, which are hardly predicted efficiently. To identify nucleotides involved in pseudoknots and non-canonical interactions, we scrutinized the SHAPE reactivity of each nucleotide of the 188 nt long lariat-capping ribozyme under multiple conditions. Reactivities analyzed in the light of the X-Ray structure were shown to report accurately the nucleotide status. Those that seemed paradoxical were rationalized by the nucleotide behavior along molecular dynamic simulations. We show that valuable information on intricate interactions can be deduced from probing with different reagents, and in the presence or absence of Mg 2+. Furthermore, probing at increasing temperature was remarkably efficient at pointing to non-canonical interactions and pseudoknot pairings. The possibilities of following such strategies to inform structure modeling software are discussed.
  - D. Wang, A. Jiang, J. Feng, G. Li, D. Guo, M. Sajid, K. Wu, Q. Zhang, Y. Ponty, S. Will, F. Liu, X. Yu, S. Li, Q. Liu, X.-L. Yang, M. Guo, X. Li, M. Chen, Z.-L. Shi, K. Lan, Y. Chen, and Y. Zhou
    
    The SARS-CoV-2 Subgenome Landscape and its Novel Regulatory Features
    Molecular Cell, Cell Press, To appear 2021 [doi] [hal]
    
    Abstract (click to read). COVID-19, caused by Coronavirus SARS-CoV-2, is now in global pandemic. Coronaviruses are known to generate negative subgenomes through Transcription-Regulating Sequence (TRS)-dependent template switch, but the global dynamic landscapes of coronaviral subgenomes and regulatory rules remain unclear. Here, using NGS short-read and Nanopore long-read sequencing to profile poly(A) RNAs in two cell types at multiple time points post-infection of SARS-CoV-2, we identified hundreds of template switches and constructed the dynamic landscapes of SARS-CoV-2 subgenomes. Interestingly, template switch could occur in bidirectional manner, with diverse SARS-CoV-2 subgenomes generated from successive template switching events. Majority of template switches result from RNA-RNA interactions, including seed and compensatory modes, with terminal pairing status as a key determinant. Moreover, two TRS-independent template switch modes are also responsible for subgenome biogenesis. Collectively, our findings reveal the subgenome landscape of SARS-CoV-2 and its regulatory features, providing a molecular basis for the organization and regulation of coronaviral subgenomes.
- Conferences with proceedings
  - B. Marchand, Y. Ponty, and L. Bulteau
    
    Tree Diet: Reducing the Treewidth to Unlock FPT Algorithms in RNA Bioinformatics
    Proceedings of WABI 2021 - 21st Workshop on Algorithms in Bioinformatics, France, 2021 [hal]
    
    Abstract (click to read). Hard graph problems are ubiquitous in Bioinformatics, inspiring the design of specialized Fixed-Parameter Tractable algorithms, many of which rely on a combination of tree-decomposition and dynamic programming. The time/space complexities of such approaches hinge critically on low values for the treewidth $tw$ of the input graph. In order to extend their scope of applicability, we introduce the TREE-DIET problem, i.e. the removal of a minimal set of edges such that a given tree-decomposition can be slimmed down to a prescribed treewidth $tw'$. Our rationale is that the time gained thanks to a smaller treewidth in a parameterized algorithm compensates the extra post-processing needed to take deleted edges into account. Our core result is an FPT dynamic programming algorithm for TREE-DIET, using $2^{O(tw)} n$ time and space. We complement this result with parameterized complexity lower-bounds for stronger variants (e.g., NP-hardness when tw or $tw−tw'$ is constant). We propose a prototype implementation for our approach which we apply on difficult instances of selected RNA-based problems: RNA design, sequence-structure alignment, and search of pseudoknotted RNAs in genomes, revealing very encouraging results. This work paves the way for a wider adoption of tree-decomposition-based algorithms in Bioinformatics.
  - L. Bulteau, B. Marchand, and Y. Ponty
    
    A new parametrization for independent set reconfiguration and applications to RNA kinetics
    Proceedings of IPEC 2021 - 16th International Symposium on Parameterized and Exact Computation, Portugal, 2021 [hal]
    
    Abstract (click to read). In this paper, we study the Independent Set (IS) reconfiguration problem in graphs. An IS reconfiguration is a scenario transforming an IS L into another IS R, inserting/removing vertices one step at a time while keeping the cardinalities of intermediate sets greater than a specified threshold. We focus on the bipartite variant where only start and end vertices are allowed in intermediate ISs. Our motivation is an application to the RNA energy barrier problem from bioinformatics, for which a natural parameter would be the difference between the initial IS size and the threshold. We first show the para-NP hardness of the problem with respect to this parameter. We then investigate a new parameter, the cardinality range, denoted by ρ which captures the maximum deviation of the reconfiguration scenario from optimal sets (formally, ρ is the maximum difference between the cardinalities of an intermediate IS and an optimal IS). We give two different routes to show that this problem is in XP for ρ: The first is a direct O(n 2)-space, O(n 2ρ+2.5)-time algorithm based on a separation lemma; The second builds on a parameterized equivalence with the directed pathwidth problem, leading to a O(n ρ+1)-space, O(n ρ+2)-time algorithm for the reconfiguration problem through an adaptation of a prior result by Tamaki [20]. This equivalence is an interesting result in its own right, connecting a reconfiguration problem (which is essentially a connectivity problem within a reconfiguration network) with a structural parameter for an auxiliary graph. We demonstrate the practicality of these algorithms, and the relevance of our introduced parameter, by considering the application of our algorithms on random small-degree instances for our problem. Moreover, we reformulate the computation of the energy barrier between two RNA secondary structures, a classic hard problem in computational biology, as an instance of bipartite reconfiguration. Our results on IS reconfiguration thus yield an XP algorithm in O(n ρ+2) for the energy barrier problem, improving upon a partial O(n 2ρ+2.5) algorithm for the problem.
  - S. Khalife, Y. Ponty, and L. Bulteau
    
    Sequence graphs realizations and ambiguity in language models
    Proceedings of COCOON 2021 - 27th International Computing and Combinatorics Conference, Taiwan, 13025:2021 [doi] [hal]
    
    Abstract (click to read). Several natural language models rely on an assumption modeling each word context as a bag of words. We study the combinatorial implications of such assumption for the corresponding word or sentences representations. In particular , we present theoretical results concerning the family of sequence graphs, for which realizations yield equivalent representations given this assumption. Several combinatorial problems are presented, depending on three levels of generalisation (window size, graph orientation, and weights), and whether some of these are NP-complete is left opened. Based on these results, we also establish different algorithms, including a dynamic programming formulation, to count and explicit the different realizations of a sequence graph. This allows us to show that the bag of words assumption can induce an important number of sentences to have the same representations, even for relatively short context window sizes.
  - H.-T. Yao, J. Waldispühl, Y. Ponty, and S. Will
    
    Taming Disruptive Base Pairs to Reconcile Positive and Negative Structural Design of RNA
    Proceedings of RECOMB 2021 - 25th international conference on research in computational molecular biology, France, 2021 [hal]
    
    Abstract (click to read). The negative structural design of RNAs, also called Inverse folding, consists in building a synthetic nucleotides sequence adopting a targeted secondary structure as its Minimum Free Energy (MFE) structure. Computationally an NP hard problem, it is mostly addressed as an optimization task and solved using (meta-)heuristics. Existing methods are frequently challenged by demanding instances, and typically produce a single design, hindering practical applications of design, where multiple candidates are desirable to circumvent the idealized nature of design models. In this work, we introduce RNA POsitive and Negative Design (RNAPOND), a sampling approach which generates design candidates exactly from a well-defined distribution influenced by positive design objectives, including affinity towards the target and GC-content. Negative design principles are captured by an original iterative approach, where a subset of Disruptive Base Pairs (DPBs) are identified at each step, and subsequently forbidden from pairing by the introduction of suitable constraints. Despite the NP-hardness of the associated decision problem, we propose a combinatorial sampling algorithm which is Fixed Parameter Tractable (FPT) for the tree-width of the constraint network. Our algorithm, coupled with a suitable rejection step and an automated inference of DPBs, achieves a similar or better level of success in comparison to the state of the art, while allowing for the generation of diverse designs. Interestingly, it also automatically recovers some of the strategies used by practitioners of RNA design. RNAPOND is an open source project, available at: https://gitlab.inria.fr/amibio/RNAPOND
- Book chapters & Editorials
  - Y. Ponty, S. Hammer, H.-T. Yao, and S. Will
    
    Advanced design of structural RNAs using RNARedPrint
    RNA Bioinformatics, Editor(s) Ernesto Picardi, Methods in Molecular Biology, 1--15 pp, 2021 [doi] [hal]
    
    Abstract (click to read). RNA design addresses the need to build novel RNAs, e.g. for biotechnological applications in synthetic biology, equipped with desired functional properties. This chapter describes how to use the software RNARedPrint for the de novo rational design of RNA sequences adopting one or several desired secondary structures. Depending on the application , these structures could represent alternate configurations or kinetic pathways. The software makes such design convenient and sufficiently fast for practical routine, where it even overcomes notorious problems in the application of RNA design, e.g. it maintains realistic GC content.
2020
- Peer-reviewed journal articles
  - A. Saaidi, D. Allouche, M. Regnier, B. Sargueil, and Y. Ponty
    
    IPANEMAP: Integrative Probing Analysis of Nucleic Acids Empowered by Multiple Accessibility Profiles
    Nucleic Acids Research, Oxford University Press, 48(15):8276--8289 pp, 2020 [doi] [hal] [pubmed]
    
    Abstract (click to read). The manual production of reliable RNA structure models from chemical probing experiments benefits from the integration of information derived from multiple protocols and reagents. However, the interpretation of multiple probing profiles remains a complex task, hindering the quality and reproducibility of modeling efforts. We introduce IPANEMAP, the first automated method for the modeling of RNA structure from multiple probing reactivity profiles. Input profiles can result from experiments based on diverse protocols, reagents, or collection of variants, and are jointly analyzed to predict the dominant conformations of an RNA. IPANEMAP combines sampling, clustering, and multi-optimization, to produce secondary structure models that are both stable and well-supported by experimental evidences. The analysis of multiple reactivity profiles, both publicly available and produced in our study, demonstrates the good performances of IPANEMAP, even in a mono probing setting. It confirms the potential of integrating multiple sources of probing data, informing the design of informative probing assays. Availability: IPANEMAP is freely downloadable at https://github.com/afafbioinfo/IPANEMAP Contact: yann.ponty@lix.polytechnique.fr Supplementary information available at NAR online.
  - M. Retwitzer, V. Reinharz, A. Churkin, Y. Ponty, J. Waldispühl, and D. Barash
    
    IncaRNAfbinv 2.0 - A webserver and software with motif control for fragment-based design of RNAs
    Bioinformatics, Oxford University Press (OUP), 36(9):2910--2922 pp, 2020 [doi] [hal] [pubmed]
    
    Abstract (click to read). RNA design has conceptually evolved from the inverse RNA folding problem. In the classical inverse RNA problem (Hofacker et al, 1994), the user inputs an RNA secondary structure and receives an output RNA sequence that folds into it. While modern RNA design methods are based on the same principle, a finer control over the resulting sequences is sought. As an important example, a substantial number of ncRNA families show high preservation in specific regions while being more flexible in others and this information should be utilized in the design. By using the additional information, RNA design tools can help solve problems of practical interest in the growing fields of synthetic biology and nanotechnology.IncaRNAfbinv2.0 utilizes a fragment-based approach, enabling a control of specific RNA secondary structure motifs. The new version allows significantly more control over the general RNA shape, and also allows to express specific restrictions over each motif separately, in addition to other advanced features.Availability:incaRNAfbinv2.0 is available through a standalone package and a web-server at https://www.cs.bgu.ac.il/incaRNAfbinv Source code, command-line and GUI wrappers can be found at https://github.com/matandro/RNAsfbinv
- Conferences with proceedings
  - A. Churkin, Y. Ponty, and D. Barash
    
    IndelsRNAmute: Predicting deleterious multiple point substitutions and indels mutations
    Proceedings of ISBRA 2020 - 16th International Symposium on Bioinformatics Research and Applications, Russia, 2020 [hal]
    
    Abstract (click to read). RNA deleterious point mutation prediction was previously addressed with programs such as RNAmute and MultiRNAmute. The purpose of these programs is to predict a global conformational rearrangement of the secondary structure of a functional RNA molecule, thereby disrupting its function. RNAmute was designed to deal with only single point mutations in a brute force manner, while in MultiRNAmute an efficient approach to deal with multiple point mutations was developed. The approach used in MultiRNAmute is based on the stabilization of the suboptimal RNA folding prediction solutions and/or destabilization of the optimal folding prediction solution of the wild type RNA molecule. The MultiRNAmute algorithm is significantly more efficient than the brute force approach in RNAmute, but in the case of long sequences and large m-point mutation sets the MultiRNAmute becomes exponential in examining all possible stabilizing and destabilizing mutations. Moreover, an inherent limitation in both programs is their ability to predict only substitution mutations, as these programs were not designed to work with deletion or insertion mutations. To address this limitation we herein develop a very fast algorithm, based on suboptimal folding solutions, to predict a predefined number of multiple point deleterious mutations as specified by the user. Depending on the user's choice, each such set of mutations may contain combinations of deletions, insertions and substitution mutations. Additionally, we prove the hardness of predicting the most deleterious set of point mutations in structural RNAs.
  - R. Sarrazin-Gendron, H.-T. Yao, V. Reinharz, C. Oliver, Y. Ponty, and J. Waldispühl
    
    Stochastic Sampling of Structural Contexts Improves the Scalability and Accuracy of RNA 3D Modules Identification
    Proceedings of RECOMB 2020 - 24th Annual International Conference on Research in Computational Molecular Biology, Italy, Proceedings of RECOMB - 24th Annual International Conference on Research in Computational Molecular Biology - 2020, 2020 [hal]
    
    Abstract (click to read). RNA structures possess multiple levels of structural organization. Secondary structures are made of canonical (i.e. Watson-Crick and Wobble) helices, connected by loops whose local conformations are critical determinants of global 3D architectures. Such local 3D structures consist of conserved sets of non-canonical base pairs, called RNA modules. Their prediction from sequence data is thus a milestone toward 3D structure modelling. Unfortunately, the computational efficiency and scope of the current 3D module identification methods are too limited yet to benefit from all the knowledge accumulated in modules databases. Here, we introduce BayesPairing 2, a new sequence search algorithm leveraging secondary structure tree decomposition which allows to reduce the computational complexity and improve predictions on new sequences. We benchmarked our methods on 75 modules and 6360 RNA sequences, and report accuracies that are comparable to the state of the art, with considerable running time improvements. When identifying 200 modules on a single sequence, BayesPairing 2 is over 100 times faster than its previous version, opening new doors for genome-wide applications.
- Book chapters & Editorials
  - Y. Ponty, and V. Reinharz
    
    Repliement de l'ARN
    Du texte aux graphes : méthodes et structures discrètes pour la bioinformatique, Editor(s) Annie Chateau & Mikaël Salson, 2020 [hal]
    
    Abstract (click to read).
- Miscellaneous & Submitted
  - Y. Ponty
    
    Ensemble Algorithms and Analytic Combinatorics in RNA Bioinformatics and Beyond Algorithmes ensemblistes et combinatoire analytique en Bioinformatique des ARN
    [hal]
    
    Abstract (click to read). La bioinformatique prédictive représente un domaine majeur d'application pour les techniques d'optimisation combinatoire. Très souvent, une perspective ensembliste, ne se limitant pas à une solution optimale mais considèrant l'espace des solutions (sous)optimales, s'y avère fructueuse. Dans cette habilitation à diriger des recherches, je présente une série de contributions combinatoire, à la fois de nature algorithmique et analytique, inspirées par des problèmes et questions soulevées dans l'étude des Acides RiboNucléiques (ARN), particulièrement centrées sur leur propriétés structurelles à l'équilibre thermodynamique.
2019
- Peer-reviewed journal articles
  - C. Chauve, Y. Ponty, and M. Wallner
    
    Counting and sampling gene family evolutionary histories in the duplication-loss and duplication-loss-transfer models
    Journal of Mathematical Biology, Springer, 80(5):1353--1388 pp, 2019 [doi] [hal] [pubmed]
    
    Abstract (click to read). Given a set of species whose evolution is represented by a species tree, a gene family is a group of genes having evolved from a single ancestral gene. A gene family evolves along the branches of a species tree through various mechanisms, including-but not limited to-speciation (S), gene duplication (D), gene loss (L), horizontal gene transfer (T). The reconstruction of a gene tree representing the evolution of a gene family constrained by a species tree is an important problem in phylogenomics. However, unlike in the multispecies coalescent evolutionary model that considers only speciation and incomplete lineage sorting events, very little is known about the search space for gene family histories accounting for gene duplication, gene loss and horizontal gene transfer (the DLT-model). In this work, we introduce the notion of evolutionary histories defined as a binary ordered rooted tree describing the evolution of a gene family, constrained by a species tree in the DLT-model. We provide formal grammars describing the set of all evolutionary histories that are compatible with a given species tree, whether it is ranked or unranked. These grammars allow us, using either analytic combinatorics or dynamic programming, to efficiently compute the number of histories of a given size, and also to generate random histories of a given size under the uniform distribution. We apply these tools to obtain exact asymptotics for the number of gene family histories for two species trees, the rooted caterpillar and the complete binary tree, as well as estimates of the range of the exponential growth factor of the number of histories for random species trees of size up to 25. Our results show that including horizontal gene transfer induce a dramatic increase of the number of evolutionary histories. We also show that, within ranked species trees, the number of evolutionary histories in the DLT-model is almost independent of the species tree topology. These results establish firm foundations for the development of ensemble methods for the prediction of reconciliations.
  - D. Surujon, Y. Ponty, and P. Clote
    
    Small-world networks and RNA secondary structures
    Journal of Computational Biology, Mary Ann Liebert, 26(1):16--26 pp, 2019 [doi] [hal] [pubmed]
    
    Abstract (click to read). Let $S_n$ denote the network of all RNA secondary structures of length $n$, in which undirected edges exist between structures $s$, $t$ such that $t$ is obtained from $s$ by the addition, removal or shift of a single base pair. Using context-free grammars, generating functions and complex analysis, we show that the asymptotic average degree is $O(n)$ and that the asymptotic clustering coefficient is $O(1/n)$, from which it follows that the family $S_n$, $n = 1, 2, 3,\ldots$ of secondary structure networks is not small-world.
  - S. Hammer, W. Wang, S. Will, and Y. Ponty
    
    Fixed-parameter tractable sampling for RNA design with multiple target structures
    BMC Bioinformatics, BioMed Central, 20(1):209 pp, 2019 [doi] [hal] [pubmed]
    
    Abstract (click to read). BACKGROUND: The design of multi-stable RNA molecules has important applications in biology, medicine, and biotechnology. Synthetic design approaches profit strongly from effective in-silico methods, which substantially reduce the need for costly wet-lab experiments.RESULTS: We devise a novel approach to a central ingredient of most in-silico design methods: the generation of sequences that fold well into multiple target structures. Based on constraint networks, our approach supports generic Boltzmann-weighted sampling, which enables the positive design of RNA sequences with specific free energies (for each of multiple, possibly pseudoknotted, target structures) and GC-content. Moreover, we study general properties of our approach empirically and generate biologically relevant multi-target Boltzmann-weighted designs for an established design benchmark. Our results demonstrate the efficacy and feasibility of the method in practice as well as the benefits of Boltzmann sampling over the previously best multi-target sampling strategy-even for the case of negative design of multi-stable RNAs. Besides empirically studies, we finally justify the algorithmic details due to a fundamental theoretic result about multi-stable RNA design, namely the #P-hardness of the counting of designs.CONCLUSION: RNARedPrint introduces a novel, flexible, and effective approach to multi-target RNA design, which promises broad applicability and extensibility. Our free software is available at: https://github.com/yannponty/RNARedPrint Supplementary data are available online.
- Conferences with proceedings
  - A. Poulenard, M.-J. Rakotosaona, Y. Ponty, and M. Ovsjanikov
    
    Effective Rotation-invariant Point CNN with Spherical Harmonics kernels
    Proceedings of 3DV - International Conference on 3D Vision - 2019, Canada, 2019 [hal]
    
    Abstract (click to read). We present a novel rotation invariant architecture operating directly on point cloud data. We demonstrate how rotation invariance can be injected into a recently proposed point-based PCNN architecture, on all layers of the network. This leads to invariance to both global shape transformations, and to local rotations on the level of patches or parts, useful when dealing with non-rigid objects. We achieve this by employing a spherical harmonics-based kernel at different layers of the network, which is guaranteed to be invariant to rigid motions. We also introduce a more efficient pooling operation for PCNN using space-partitioning data-structures. This results in a flexible, simple and efficient architecture that achieves accurate results on challenging shape analysis tasks, including classification and segmentation, without requiring data-augmentation typically employed by non-invariant approaches.
  - H.-T. Yao, C. Chauve, M. Regnier, and Y. Ponty
    
    Exponentially few RNA structures are designable
    Proceedings of ACM-BCB 2019 - 10th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, United States, ACM Press, 289-298 pp, 2019 [doi] [hal]
    
    Abstract (click to read). The problem of RNA design attempts to construct RNA sequences that performs a predefined biological function, identified by several additional constraints. One of the foremost objective of RNA design is that the designed RNA sequence should adopt a predefined target secondary structure preferentially to any alternative structure, according to a given metrics and folding model. It was observed in several works that some secondary structures are undesignable, i.e. no RNA sequence can fold into the target structure while satisfying some criterion measuring how preferential this folding is compared to alternative conformations. In this paper, we show that the proportion of designable secondary structures decreases exponentially with the size of the target secondary structure, for various popular combinations of energy models and design objectives. This exponential decay is, at least in part, due to the existence of undesignable motifs, which can be generically constructed, and jointly analyzed to yield asymptotic upper-bounds on the number of designable structures.
- Book chapters & Editorials
  - D. Allouche, G. de Bisschop, A. Saaidi, Y. Ponty, and S. Bruno
    
    RNA secondary structure modelling following the IPANEMAP workflow
    RNA Folding - Methods and Protocols, Methods in Molecular Biology, Springer Nature, 2019 [hal]
    
    Abstract (click to read). RNA secondary structure modelling has been a challenge since the early days of molecular biology. Although algorithms for RNA structure modelling are more and more efficient and accurate, they significantly benefit from the integration of experimental structure probing data. RNA structure probing consists in submitting an RNA to enzymes or small molecules that specifically react with individual nucleotides according to their pairing status. Most enzymes used are single strand specific RNAses (RNAses T1, U2, nuclease S1 …) with the notable exception of the double strand specific RNAse V1. Although they are low molecular weight proteins, they are too bulky to access some nucleotides of a folded RNA. Small molecules can essentially reach any nucleotide and most of them are also single-strand specific although psoralen has recently been successfully used a double strand probe (Lu et al., 2016). For the longest time, RNA probing experiments remained tedious and rather qualitative than quantitative. RNA structure probing recently reached the medium, and then high, throughput. Pioneered and mostly developed within the Weeks lab, the SHAPE technology uses small molecules that react with flexible ribose, thus essentially reporting single-stranded nucleotides with some subtleties (Frezza et al., 2019; Steen et al., 2012). A medium throughput version of the SHAPE protocol was first developed based on capillary electrophoresis, later to be transformed into a high throughput method using next generation sequencing. The same workflows can be applied to more traditional probes such as DiMethyl Sulfate (DMS) and N-Cyclohexyl-N′-(2-morpholinoethyl)carbodiimide metho-p-toluenesulfonate (CMCT) that reveal unpaired A,C and G,U respectively. It appeared that different probes provide complementary information that further improves RNA structure prediction. We therefore developed IPANEMAP, an experimental and computational workflow that models RNA secondary structure from different sets of RNA structure probing performed with different probes, and/or in different conditions and/or on mutants (Saaidi et al. Submitted). This workflow relies on medium or high throughput structure probing, and combines statistical sampling, clustering (Ding and Lawrence, 2003) and pseudo-potentials (Deigan et al, 2009). The method was shown to produce more accurate and stable predictions than other workflows developed to date, even when a single reactivity profile is available, while the availability of multiple reactivities was shown to increase robustness and, to a lesser extent, accuracy of the modeling (Saaidi et al. Submitted). Below, we detail a whole IPANEMAP workflow, starting with experimental probing with DMS and/or CMCT and/or SHAPE reagent. Such probing can be carried out in various relevant conditions (varying température, Mg2+ concentration, introducing point mutations in the RNA to be modeled etc) (Saaidi et al. Submitted). Two versions of the experimental procedure (medium throughput and high throughput) are proposed, DMS and CMCT probing were adapted from Ehresmann et al. and Brunel et al. while the SHAPE probing is described in K. Weeks team publications (Karabiber et al., 2013; Low and Weeks, 2010a; Mortimer and Weeks, 2007; Smola et al., 2015a; Wilkinson et al., 2006, 2008). We then detail instructions for executing the IPANEMAP algorithm to obtain the RNA secondary structure model.
- Conferences without proceedings and/or unrefereed
  - M. Mallem, A. Denise, and Y. Ponty
    
    Forbidden substrings and the connectivity of the Hamming graph of RNA sequences: Partial disconnectivity tests
    Proceedings of SEQBIM 2019 - Séquences en Bioinformatique, Informatique et Mathématiques, France, 2019 [hal]
    
    Abstract (click to read). RNA structure design methods have grown in complexity to cover an increasing scope of application. Recent approaches combine an initial random generation with a local optimization step, and consider both a user-specified secondary structure and sets of mandatory and forbidden substrings. Although these additional constraints lead to better design results, they may interfere with the local optimization phase. Indeed, forbidden substrings may disrupt the connectivity of their underlying search space, a key property for the success of the local search. A naive connectivity test would explore the whole graph of candidate sequences, leading to an exponential time connectivity test. In this work, we propose two partial algorithms based on compact graph structures-the De Bruijn graphs and the Aho-Corasick automaton-allowing the detection of disconnectivity in time independent from the length of RNA sequence. Tested on random instances, our tests were able to detect the disconnectivity with sensitivity ranging between 35% and 55%, motivating further research.
- Miscellaneous & Submitted
  - A. Saaidi, Y. Ponty, and M. Regnier
    
    Clustering strings with mutations using an expectation-maximization algorithm In the context of RNA structure prediction
    [hal]
    
    Abstract (click to read). In comparative analysis, an RNA structure (a set of base pairs and unpaired nucleotides) is predicted from a set of RNA variants (similar sequences) under the assumption of the conservation of the structure during evolution. The combination of RNA variants with Experimental data informing about the local (nucleotide) structure may lead to more accurate structure prediction. The experimental protocol consists of mutating nucleotides likely to be 'unpaired'. A simultaneous reading of RNA variants sequences that underwent the experimental mutation protocol lead to the following issue: How to cluster 'mutated' substrings of similar parent strings such that each substring is correctly assigned to its parent string? We developed an Expectation Maximization algorithm that uses Mutational profiles (mutation distributions) to assign the substrings to their strings of origin.
  - C. Rovetta, J. Michálik, R. Lorenz, A. Tanzer, and Y. Ponty
    
    Non-Redundant Sampling and Statistical Estimators for RNA Structural Properties at the Thermodynamic Equilibrium
    [hal]
    
    Abstract (click to read). The computation of statistical properties of RNA structure at the thermodynamic equilibrium, or Boltzmann ensemble of low free-energy, represents an essential step to understand and harness the selective pressure weighing on RNA evolution. However, classic methods for sampling representative conformations are frequently crippled by large levels of redundancy, which are uninformative and detrimental to downstream analyses. In this work, we adapt and implement, within the Vienna RNA package, an efficient non-redundant backtracking procedure to produce collections of unique secondary structures generated within a well-defined distribution. This procedure is coupled with a novel statistical estimator, which we prove is unbiased, consistent and has lower variance (better convergence) than the classic estimator. We demonstrate the efficiency of our coupled non-redundant sampler/estimator by revisiting several applications of sampling in RNA bioinformatics, and demonstrate its practical superiority over previous estimators. We conclude by discussing the choice of the number of samples required to produce reliable estimates.
2018
- Peer-reviewed journal articles
  - C. Chauve, J. Courtiel, and Y. Ponty
    
    Counting, generating, analyzing and sampling tree alignments
    International Journal of Foundations of Computer Science, World Scientific Publishing, 29(5):741--767 pp, 2018 [hal]
    
    Abstract (click to read). Pairwise ordered tree alignment are combinatorial objects that appear in important applications , such as RNA secondary structure comparison. However, the usual representation of tree alignments as supertrees is ambiguous, i.e. two distinct supertrees may induce identical sets of matches between identical pairs of trees. This ambiguity is uninformative, and detrimental to any probabilistic analysis. In this work, we consider tree alignments up to equivalence. Our first result is a precise asymptotic enumeration of tree alignments, obtained from a context-free grammar by mean of basic analytic combinatorics. Our second result focuses on alignments between two given ordered trees S and T. By refining our grammar to align specific trees, we obtain a decomposition scheme for the space of alignments, and use it to design an efficient dynamic programming algorithm for sampling alignments under the Gibbs-Boltzmann probability distribution. This generalizes existing tree alignment algorithms, and opens the door for a probabilistic analysis of the space of suboptimal alignments.
  - A. Churkin, M. Retwitzer, V. Reinharz, Y. Ponty, J. Waldispühl, and D. Barash
    
    Design of RNAs: comparing programs for inverse RNA folding.
    Briefings in Bioinformatics, Oxford University Press (OUP), 19(2):350--358 pp, 2018 [doi] [hal] [pubmed]
    
    Abstract (click to read). Computational programs for predicting RNA sequences with desired folding properties have been extensively developed and expanded in the past several years. Given a secondary structure, these programs aim to predict sequences that fold into a target minimum free energy secondary structure, while considering various constraints. This procedure is called inverse RNA folding. Inverse RNA folding has been traditionally used to design optimized RNAs with favorable properties, an application that is expected to grow considerably in the future in light of advances in the expanding new fields of synthetic biology and RNA nanostructures. Moreover, it was recently demonstrated that inverse RNA folding can successfully be used as a valuable preprocessing step in computational detection of novel noncoding RNAs. This review describes the most popular freeware programs that have been developed for such purposes, starting from RNAinverse that was devised when formulating the inverse RNA folding problem. The most recently published ones that consider RNA secondary structure as input are antaRNA, RNAiFold and incaRNAfbinv, each having different features that could be beneficial to specific biological problems in practice. The various programs also use distinct approaches, ranging from ant colony optimization to constraint programming, in addition to adaptive walk, simulated annealing and Boltzmann sampling. This review compares between the various programs and provides a simple description of the various possibilities that would benefit practitioners in selecting the most suitable program. It is geared for specific tasks requiring RNA design based on input secondary structure, with an outlook toward the future of RNA design programs.
  - N. Abdollahi, A. Albani, E. Anthony, A. Baud, M. Cardon, R. Clerc, D. Czernecki, R. Conte, L. David, A. Delaune, S. Djerroud, P. Fourgoux, N. Guiglielmoni, J. Laurentie, N. Lehmann, C. Lochard, R. Montagne, V. Myrodia, V. Opuu, E. Parey, S. Privé, C. Quignot, M. Ruiz-Cuevas, M. Sissoko, N. Sompairac, A. Vallerix, V. Verrecchia, M. Delarue, R. Guérois, Y. Ponty, S. Sacquin-Mora, A. Carbone, C. Froidevaux, S. Le Crom, O. Lespinet, M. Weigt, S. Abboud, J. Bernardes, G. Bouvier, C. Dequeker, A. Ferré, P. Fuchs, G. Lelandais, P. Poulain, H. Richard, H. Schweke, E. Laine, and A. Lopes
    
    Meet-U: Educating through research immersion
    PLoS Computational Biology, Public Library of Science, 14(3):e1005992 pp, 2018 [doi] [hal]
    
    Abstract (click to read).
  - N. Abdollahi, A. Albani, E. Anthony, A. Baud, M. Cardon, R. Clerc, D. Czernecki, R. Conte, L. David, A. Delaune, S. Djerroud, P. Fourgoux, N. Guiglielmoni, J. Laurentie, N. Lehmann, C. Lochard, R. Montagne, V. Myrodia, V. Opuu, E. Parey, L. Polit, S. Privé, C. Quignot, M. Ruiz-Cuevas, M. Sissoko, N. Sompairac, A. Vallerix, V. Verrecchia, M. Delarue, R. Guérois, Y. Ponty, S. Sacquin-Mora, A. Carbone, C. Froidevaux, S. Le Crom, O. Lespinet, M. Weigt, S. Abboud, J. Bernardes, G. Bouvier, C. Dequeker, A. Ferré, P. Fuchs, G. Lelandais, P. Poulain, H. Richard, H. Schweke, E. Laine, and A. Lopes
    
    Meet-U: educating through research immersion
    PLoS Computational Biology, Public Library of Science, 14(3):1-10 pp, 2018 [doi] [hal] [pubmed]
    
    Abstract (click to read). We present a new educational initiative, called Meet-U, that aims at training students to collaborativework in computational biology and at bridging the gap between education and research.Meet-U mimics the set up of collaborative research projects and takes advantage of the most populartools for collaborative work and of cloud computing. Students are grouped in teams of 4-5 peopleand have to realize a project from A to Z that answers a challenging question in Biology.Meet-U promotes "coopetition", as the students collaborate within and across the teams and are also incompetition with each other to develop the best final product.Meet-U fosters interactions between different actors of education and research through the organization of a meeting day, open to everyone, where the students present their work to a jury of researchers and jury members give researchseminars. This very unique combination of education and research is strongly motivating for thestudents and provides a formidable opportunity for a scientific community to unite and increase itsvisibility. We report on our experience with Meet-U in two French universities with master studentsin bioinformatics and modeling, with protein-protein docking as the subject of the course.Meet-U is easy to implement, virtually costs nothing and can be straightforwardly transferred to other fieldsand/or universities. All the information and data are available at http://www.meet-u.org.
- Conferences with proceedings
  - S. Hammer, Y. Ponty, W. Wang, and S. Will
    
    Fixed-Parameter Tractable Sampling for RNA Design with Multiple Target Structures
    Proceedings of RECOMB - 22nd Annual International Conference on Research in Computational Molecular Biology - 2018, France, 2018 [hal]
    
    Abstract (click to read). The design of multi-stable RNA molecules has important applications in biology, medicine, and biotechnology. Synthetic design approaches profit strongly from effective in-silico methods, which can tremendously impact their cost and feasibility. We revisit a central ingredient of most in-silico design methods: the sampling of sequences for the design of multi-target structures, possibly including pseudoknots. For this task, we present the efficient, tree decomposition-based algorithm \ourprog{}. Our fixed parameter tractable approach is underpinned by establishing the $\#${\sf P}-hardness of uniform sampling. Modeling the problem as a constraint network, \ourprog{} supports generic Boltzmann-weighted sampling for arbitrary additive RNA energy models; this enables the generation of RNA sequences meeting specific goals like expected free energies or \GCb-content. Finally, we empirically study general properties of the approach and generate biologically relevant multi-target Boltzmann-weighted designs for a common design benchmark. Generating seed sequences with \ourprog{}, we demonstrate significant improvements over the previously best multi-target sampling strategy (uniform sampling).Our software is freely available at: https://github.com/yannponty/RNARedPrint .
- Conferences without proceedings and/or unrefereed
  - C. Singhal, Y. Ponty, and I. Chauvot de Beauchêne
    
    A hybrid combinatorial method for docking single stranded RNAs in a protein pockect at the thermodynamic equilibrium
    Proceedings of JOBIM (Journées Ouvertes en Biologie, Informatique et Mathématiques), France, 2018 [hal]
    
    Abstract (click to read).
  - Y. Ponty, S. Will, and S. Hammer
    
    Nucleic Acids design targeting integer-valued features: FPT counting and uniform sampling
    Proceedings of WEPA 2018 - 2nd International Workshop on Enumeration Problems and Applications, Italy, 2018 [hal]
    
    Abstract (click to read). In this extended abstracts, we describe an approach to account for integer-valued constraints in RNA design. In particular, we describe various FPT/XPT approaches, heavily relying on dynamic programming, for counting and sampling sequences fulfilling a set of declarative constraints. We close this short note with several open questions, drawing connections with graph theory and enumerative combinatorics.
- Miscellaneous & Submitted
  - C. Singhal, Y. Ponty, and I. Chauvot de Beauchêne
    
    A hybrid combinatorial method for docking single stranded RNA on proteins at the thermodynamic equilibrium
    [hal]
    
    Abstract (click to read).
2017
- Peer-reviewed journal articles
  - B. Löwes, C. Chauve, Y. Ponty, and R. Giegerich
    
    The BRaliBase dent - a tale of benchmark design and interpretation.
    Briefings in Bioinformatics, Oxford University Press (OUP), 18(2):306--311 pp, 2017 [doi] [hal] [pubmed]
    
    Abstract (click to read). BRaliBase is a widely used benchmark for assessing the accuracy of RNA secondary structure alignment methods. In most case studies based on the BRaliBase benchmark, one can observe a puzzling drop in accuracy in the 40-60% sequence identity range, the so-called 'BRaliBase Dent'. In this article, we show this dent is owing to a bias in the composition of the BRaliBase benchmark, namely the inclusion of a disproportionate number of transfer RNAs, which exhibit a conserved secondary structure. Our analysis, aside of its interest regarding the specific case of the BRaliBase benchmark, also raises important questions regarding the design and use of benchmarks in computational biology.
  - J. Haleš, A. Héliou, J. Maňuch, Y. Ponty, and L. Stacho
    
    Combinatorial RNA Design: Designability and Structure-Approximating Algorithm in Watson-Crick and Nussinov-Jacobson Energy Models
    Algorithmica, Springer Verlag, 79(3):835--856 pp, 2017 [doi] [hal]
    
    Abstract (click to read). We consider the Combinatorial RNA Design problem, a minimal instance of RNA design where one must produce an RNA sequence that adopts a given secondary structure as its minimal free-energy structure. We consider two free-energy models where the contributions of base pairs are additive and independent: the purely combinatorial Watson-Crick model, which only allows equally-contributing A − U and C − G base pairs, and the real-valued Nussinov-Jacobson model, which associates arbitrary energies to A − U, C − G and G − U base pairs. We first provide a complete characterization of designable structures using restricted alphabets and, in the four-letter alphabet, provide a complete characterization for designable structures without unpaired bases. When unpaired bases are allowed, we characterize extensive classes of (non-)designable structures, and prove the closure of the set of designable structures under the stutter operation. Membership of a given structure to any of the classes can be tested in Θ(n) time, including the generation of a solution sequence for positive instances. Finally, we consider a structure-approximating relaxation of the design, and provide a Θ(n) algorithm which, given a structure S that avoids two trivially non-designable motifs, transforms S into a designable structure constructively by adding at most one base-pair to each of its stems.
  - J. Deforges, S. de Breyne, M. Ameur, N. Ulryck, N. Chamond, A. Saaidi, Y. Ponty, T. Ohlmann, and B. Sargueil
    
    Two ribosome recruitment sites direct multiple translation events within HIV1 Gag open reading frame
    Nucleic Acids Research, Oxford University Press, 45(12):7382--7400 pp, 2017 [doi] [hal] [pubmed]
    
    Abstract (click to read). In the late phase of the HIV virus cycle, the full length unspliced genomic RNA is exported to the cytoplasm and serves as mRNA to translate the Gag and Gag-pol polyproteins. Three different translation initiation mechanisms responsible for Gag production have been described. However a rationale for the involvement of as many translation pathways in gRNA translation is yet to be defined. The Gag-IRES has the singularity to be located within the Gag open reading frame and to directly recruit the 40S ribosomal subunit. To further characterize this interaction, we first probed the Gag-IRES RNA structure. We then developed an innovative integrative modelling approach and propose a novel secondary structure model for the Gag-IRES. The minimal 40S ribosomal subunit binding site was further mapped using different assays. To our surprise, we found that at least two regions within Gag-IRES can independently recruit the ribosome. Next, we validated that these two regions influence Gag translation both in vitro and in cellulo. These binding sites are mostly unstructured and highly A-rich, such sequences have previously been shown to be sufficient to recruit the ribosome and to support an IRES function. A combination of biochemical and functional data give insight into the Gag-IRES molecular mechanism and provide compelling evidences for its importance. Hypothesis about its physiological role reflecting its conservation amongst primate lentiviruses are proposed.
  - W. Duchemin, Y. Anselmetti, M. Patterson, Y. Ponty, S. Bérard, C. Chauve, C. Scornavacca, V. Daubin, and E. Tannier
    
    DeCoSTAR: Reconstructing the ancestral organization of genes or genomes using reconciled phylogenies
    Genome Biology and Evolution, Society for Molecular Biology and Evolution, 9(5):1312-1319 pp, 2017 [doi] [hal]
    
    Abstract (click to read). DeCoSTAR is a software that aims at reconstructing the organization of ancestral genes or genomes in the form of sets of neighborhood relations (adjacencies) between pairs of ancestral genes or gene domains. It can also improve the assembly of fragmented genomes by proposing evolutionary-induced adjacencies between scaffolding fragments. Ancestral genes or domains are deduced from reconciled phylogenetic trees under an evolutionary model that considers gains, losses, speciations, duplications, and transfers as possible events for gene evolution. Reconciliations are either given as input or computed with the ecceTERA package, into which DeCoSTAR is integrated. DeCoSTAR computes adjacency evolutionary scenarios using a scoring scheme based on a weighted sum of adjacency gains and breakages. Solutions, both optimal and near-optimal, are sampled according to the Boltzmann-Gibbs distribution centered around parsimonious solutions, and statistical supports on ancestral and extant adjacencies are provided. DeCoSTAR supports the features of previously-contributed tools that reconstruct ancestral adjacencies, namely DeCo, DeCoLT, ART-DeCo and DeClone. In a few minutes, DeCoSTAR can reconstruct the evolutionary history of domains inside genes, of gene fusion and fission events, or of gene order along chromosomes, for large data sets including dozens of whole genomes from all kingdoms of life. We illustrate the potential of DeCoSTAR with several applications: ancestral reconstruction of gene orders for Anopheles mosquito genomes, multidomain proteins in Drosophila, and gene fusion and fission detection in Actinobacteria. Availability: http://pbil.univ-lyon1.fr/software/DeCoSTAR
- Conferences with proceedings
  - J. Michálik, H. Touzet, and Y. Ponty
    
    Efficient approximations of RNA kinetics landscape using non-redundant sampling
    Proceedings of ISMB/ECCB - 25th Annual international conference on Intelligent Systems for Molecular Biology/16th European Conference on Computational Biology - 2017, Czech Republic, 33(14):i283 - i292 pp, 2017 [doi] [hal]
    
    Abstract (click to read). Motivation: Kinetics is key to understand many phenomena involving RNAs, such as co-transcriptional folding and riboswitches. Exact out-of-equilibrium studies induce extreme computational demands, leading state-of-the-art methods to rely on approximated kinetics landscapes, obtained using sampling strategies that strive to generate the key landmarks of the landscape topology. However, such methods are impeded by a large level of redundancy within sampled sets. Such a redundancy is uninformative, and obfuscates important intermediate states, leading to an incomplete vision of RNA dynamics. Results: We introduce RNANR, a new set of algorithms for the exploration of RNA kinetics landscapes at the secondary structure level. RNANR considers locally optimal structures, a reduced set of RNA con-formations, in order to focus its sampling on basins in the kinetic landscape. Along with an exhaustive enumeration, RNANR implements a novel non-redundant stochastic sampling, and offers a rich array of structural parameters. Our tests on both real and random RNAs reveal that RNANR allows to generate more unique structures in a given time than its competitors, and allows a deeper exploration of kinetics landscapes. Availability: RNANR is freely available at https://project.inria.fr/rnalands/rnanr
  - A. Saaidi, Y. Ponty, M. Blanchette, M. Regnier, and B. Sargueil
    
    An EM algorithm for mapping short reads in multiple RNA structure probing experiments
    Proceedings of Matbio2017, United Kingdom, 2017 [hal]
    
    Abstract (click to read). An accurate mapping of reads against the sequence of reference is the first step to grant a good NGS data analysis.However, when mapping is about assigning reads to a set of RNA variants, in the case of simultaneous sequencing,the task become hard to handle. Many algorithms have been developed to overcome the issue of mapping readsagainst a set of homologous sequences at one time but the problem is not fully resolved, particularly when dealingwith short reads. The issue addressed in our study is much more challenging; In addition to the parallel assignmentissue in the presence of short reads, RNA variants molecules, used for the library sequencing preparation step,undergo a specific experimental treatment SHAPE causing the formation of mutations at the level of structurallyunpaired nucleotides. Mutations due to SHAPE might lead to a miss-mapping i.e. a read could be derived from agiven RNA variant i and because of SHAPE mutations it becomes more appropriate to assign it to the variant jfrom which the read has the shortest base distance. In an ongoing work, we are trying to resolve the unprecedentedmapping question trough an Expectation Maximization (EM) algorithm where each RNA variant from the setof references would be characterized by a SHAPE mutational profile instead of being merely characterized by asequence of nucleotides. The EM algorithm aims to maximize the likelihood of a read to be derived from a specificRNA variant and to assess its contribution to build the RNA associated mutational profile.
- Conferences without proceedings and/or unrefereed
  - A. Saaidi, Y. Ponty, and B. Sargueil
    
    An integrative approach for predicting the RNA secondary structure for the HIV–1 Gag UTR using probing data
    Proceedings of JOBIM - Journées Ouvertes en Biologie, Informatique et Mathématiques - 2017, France, 2017 [hal]
    
    Abstract (click to read). Structure modeling is key to understand the mechanisms of RNA retroviruses such as HIV. Many in silico prediction approaches suggesting structural models of moderate to good accuracies are available. However, the prediction methods could be further improved by taking advantage of both next generation sequencing technologies and different experimental techniques such as enzymatic and SHAPE probing data [1]. In a published article [2], we introduce and use a structural modeling method based on the integration of many experimental probing data to direct predictions with the aim to find the most accurate structure lying in the intersection of disjoint sources of experiments.
2016
- Peer-reviewed journal articles
  - V. Reinharz, Y. Ponty, and J. Waldispühl
    
    Combining structure probing data on RNA mutants with evolutionary information reveals RNA-binding interfaces
    Nucleic Acids Research, Oxford University Press, 44(11):e104 - e104 pp, 2016 [doi] [hal] [pubmed]
    
    Abstract (click to read). Systematic structure probing experiments (e.g. SHAPE) of RNA mutants such as the mutate-and-map protocol give us a direct access into the genetic robustness of ncRNA structures. Comparative studies of homologous sequences provide a distinct, yet complementary, approach to analyze structural and functional properties of non-coding RNAs. In this paper, we introduce a formal framework to combine the biochemical signal collected from mutate-and-map experiments, with the evolutionary information available in multiple sequence alignments. We apply neutral theory principles to detect complex long-range dependencies between nucleotides of a single stranded RNA, and implement these ideas into a software called aRNhAck. We illustrate the biological significance of this signal and show that the nucleotides networks calculated with aRNhAck are correlated with nucleotides located in RNA-RNA, RNA-protein, RNA-DNA and RNA-ligand interfaces. aRNhAck is freely available at http://csb.cs.mcgill.ca/arnhack.
  - E. Jacox, C. Chauve, G. Szöllősi, Y. Ponty, and C. Scornavacca
    
    ecceTERA: Comprehensive gene tree-species tree reconciliation using parsimony
    Bioinformatics, Oxford University Press (OUP), 32(13):2056-2058 pp, 2016 [doi] [hal] [pubmed]
    
    Abstract (click to read). A gene tree-species tree reconciliation explains the evolution of a gene tree within the species tree given a model of gene-family evolution. We describe ecceTERA, a program that implements a generic parsimony reconciliation algorithm, which accounts for gene duplication, loss, and transfer (DTL) as well as speciation, involving sampled and unsampled lineages, within undated, fully dated or partially dated species trees. The ecceTERA reconciliation model and algorithm generalize or improve upon most published DTL parsimony algorithms for binary species trees and binary gene trees. Moreover, ecceTERA can estimate accurate species-tree aware gene trees using amalgamation.
  - M. Drory retwitzer, V. Reinharz, Y. Ponty, J. Waldispühl, and D. Barash
    
    incaRNAfbinv : a web server for the fragment-based design of RNA sequences
    Nucleic Acids Research, Oxford University Press, 44(W1):W308 - W314 pp, 2016 [doi] [hal]
    
    Abstract (click to read). In recent years, new methods for computational RNA design have been developed and applied to various problems in synthetic biology and nanotechnology. Lately, there is considerable interest in incorporating essential biological information when solving the inverse RNA folding problem. Correspondingly, RNAfbinv aims at including biologically meaningful constraints and is the only program to-date that performs a fragment-based design of RNA sequences. In doing so it allows the design of sequences that do not necessarily exactly fold into the target, as long as the overall coarse-grained tree graph shape is preserved. Augmented by the weighted sampling algorithm of incaRNAtion, our web server called incaRNAfbinv implements the method devised in RNAfbinv and offers an interactive environment for the inverse folding of RNA using a fragment-based design approach. It takes as input: a target RNA secondary structure; optional sequence and motif constraints; optional target minimum free energy, neutrality, and GC content. In addition to the design of synthetic regulatory sequences, it can be used as a pre-processing step for the detection of novel natural occurring RNAs. The two complementary methodologies RNAfbinv and incaRNAtion are merged together and fully implemented in our web server incaRNAfbinv, available at http://www.cs.bgu. ac.il/incaRNAfbinv.
- Conferences with proceedings
  - C. Chauve, J. Courtiel, and Y. Ponty
    
    Counting, generating and sampling tree alignments
    Proceedings of ALCOB - 3rd International Conference on Algorithms for Computational Biology - 2016, Spain, Springer, 9702:53--64 pp, 2016 [doi] [hal]
    
    Abstract (click to read). Pairwise ordered tree alignment are combinatorial objects that appear in RNA secondary structure comparison. However, the usual representation of tree alignments as supertrees is ambiguous, i.e. two distinct supertrees may induce identical sets of matches between identical pairs of trees. This ambiguity is uninformative, and detrimental to any probabilistic analysis.In this work, we consider tree alignments up to equivalence. Our first result is a precise asymptotic enumeration of tree alignments, obtained from a context-free grammar by mean of basic analytic combinatorics. Our second result focuses on alignments between two given ordered trees $S$ and $T$. By refining our grammar to align specific trees, we obtain a decomposition scheme for the space of alignments, and use it to design an efficient dynamic programming algorithm for sampling alignments under the Gibbs-Boltzmann probability distribution. This generalizes existing tree alignment algorithms, and opens the door for a probabilistic analysis of the space of suboptimal RNA secondary structures alignments.
  - J. Lumbroso, M. Mishna, and Y. Ponty
    
    Taming reluctant random walks in the positive quadrant
    Proceedings of GASCOM - 10th conference on random generation of combinatorial structures - 2016, France, Electronic Notes in Discrete Mathematics, 59(Supp. C):99 -- 114 pp, 2016 [doi] [hal]
    
    Abstract (click to read). A lattice walk model is said to be reluctant if the defining step set has a strong drift towards the boundaries. We describe efficient random generation strategies for these walks.
- Conferences without proceedings and/or unrefereed
  - A. Saaidi, D. Allouche, B. Sargueil, and Y. Ponty
    
    Towards structural models for the Ebola UTR regions using experimental SHAPE probing data
    Proceedings of JOBIM - Journées Ouvertes en Biologie, Informatique et Mathématiques - 2016, France, 2016 [hal]
    
    Abstract (click to read). Next-Generation Sequencing (NGS) technologies have opened new perspectives to refine the process of predicting the secondary structure(s) of structured non-coding RNAs. Herein, we describe an integrated modeling strategy, based on the SHAPE chemistry, to infer structural insight from deep se-quencing data. Our approach is based on a pseudo-energy minimization, incorporating additional information from evolutionary data (compensatory mutations) and SHAPE experiments (reactivity scores) within an iterative procedure. Preliminary results reveal conserved and stable structures within UTRs of the Ebola Genome, that are both thermodynamically-stable and highly supported by SHAPE accessibility analysis.
  - W. Wang, M. Barba, P. Rinaudo, A. Denise, and Y. Ponty
    
    Homology modeling of complex structural RNAs
    Proceedings of JOBIM - Journées Ouvertes en Biologie, Informatique et Mathématiques - 2016, France, 2016 [hal]
    
    Abstract (click to read). Aligning macromolecules such as proteins, DNAs and RNAs in order to reveal, or conversely exploit, their functional homology is a classic challenge in bioinformatics, with farreaching applications in structure modelling and genome annotations. In the specific context of complex RNAs, featuring pseudoknots, multiple interactions and noncanonical base pairs, multiple algorithmic solutions and tools have been proposed for the structure/sequence alignment problem. However, such tools are seldom used in practice, due in part to their extreme computational demands, and because of their inability to support general types of structures. Recently, a general parameterized algorithm based on tree decomposition of the query structure has been designed by Rinaudo et al. We present an implementation of the algorithm within a tool named LiCoRNA. We compare it against stateoftheart algorithms. We show that it both gracefully specializes into a practical algorithm for simple classes pseudoknot, and offers a general solution for complex pseudoknots, which are explicitly outofreach of competing softwares.
- Miscellaneous & Submitted
  - A. Saaidi, D. Allouche, Y. Ponty, B. Sargueil, and M. Regnier
    
    Computational methods for comparing and integrating multiple probing assays to predict RNA secondary structure
    [hal]
    
    Abstract (click to read). 1- Introduction:RNA structure is a key to understand retroviruses’s mechanisms e.g. HIV. Many prediction approaches suggesting accurate structures are available but they could be further improved by both taking advantage of Next generation sequencing technology and new experimental techniques (Enzymatic and SHAPE).2 - Experimental probing dataIn this poster, we present an integrative approach based on using many experimental data, resulting from sequencing, to direct predictions with the aim to find an accurate structure lying in the intersection of different sources of experiments. From one side, to reveal single nucleotide, reactivity profiles resulting from a SHAPE technology were used as “soft constraints”, meaning that the reactivity values were translated into pseudo-energies as described (Lorenz et al, 2016). From the other side, RNAses cleavage was used with two enzymes V and T targeting respectively paired and unpaired nucleotides. Reactivity scores resulting from those two experiments are used as hard constraints, forcing positions that exceed a specific threshold to be paired(case of Enzymatic-V) or unpaired(case of Enzymatic-T).3-stochastic sampling:At the thermodynamic equilibrium, a given RNA can have many alternative structures, where each structure could be characterized by a probability within the space of all the possible conformations (Boltzmann ensemble). This probability is related to the energy of the structure, the highest the energy needed to break pairs present in the structure the highest is its probability in the ensemble.We admit that the optimal structure(s) should be energetically stable and supported by several experimental data. For this reason, we coupled a stochastic sampling from the Boltzmann ensembles associated with the experimentally derived constraints, with a clustering across experimental conditions, to generate a structural models that are well-supported by available data.4-The work-flow description: 1. Experimental data from different conditions(SHAPE,Enzymatic-T, Enzymatic-V) were analysed to extract reactivity profiles that will serve as constraints. 2. We sampled 2000 structures per condition: We perform a Boltzmann sampling (Ding et al, 2005) to generate a predefined number of stable structures, compatible with the constraints derived for each condition. We used the stochastic sampling mode of RNAsubopt (-p option) to generate energetically stable structures that are either fully compliant with constraints derived from enzymatic data (hard constraints(Mathews et al,2004)), or constitute reasonable trade-offs between thermodynamic stability and compatibility with SHAPE data (soft constraints, using thepseudo-potentials of Deigan et al. (Deigan et al, 2009), see (Lorenz et al,2016) for details). 3. We merge the structures while keeping labels to retain the origin of each structure. 4. In order to detect structures with affinity to each other, The merged sets of models were clustered, using the base-pair distance as a measure of dissimilarity, the distance between two structures corresponds to the number of base pairs needed to break and to build in order to go from a structure to an other. A clustering algorithms (affinity propagation (Wang et al, 2007) implemented in the scikit-learn Python package (Pedregosa et al, 2011) is used to agglomerate and identify recurrent structures. One of the advantages of affinity propagation resides in its low computational requirements. 5. The next step consists on identifying clusters that are homogeneous, stable and well supported by experimental evidences, leading to the identification of the following objective criteria:-Present conditions that informs as about the diversity of the cluster: Our primary target are clusters compatible with multiple experimental conditions. However, the larger sampled sets required for reproducibility tend to populate each cluster with structures fromall conditions. We thus associated with each cluster the number of represented conditions, defined as the number of conditions for whichthe accumulated Boltzmann probability in the cluster exceeds a predefined threshold.-Boltzmann weight that is a measure of stability: Structures that are found in a given cluster may be unstable, and should be treated asoutliers. For this reason, we computed the cumulated normalized Boltzmann probabilities within the cluster, to favor stable clustersconsisting of stable structures;- Average Cluster Distance to count for coherence: We observed a general tendency of clustering algorithms to create heterogeneousclusters when faced with noisy data. We thus associated with each cluster the mean distance between pairs of structures, estimated asthe average distance to the MEA (Lu et al, 2009) for the sake of efficiency, in order to neglect clusters that were too diverse.6. The next steps consist on choosing cluster(s) with high coherence, diversity and stability. for this purpose we restricted our analysis to clusters that were found on the 3D Pareto Frontier (Mattson Messac, 2005) with respect to the three mentioned above criteria .7. After detecting the optimal Pareto cluster(s), we need to identify representative structure for each cluster. We chose the maximum expectedaccuracy (MEA) structure (Lu et al, 2009) as the representative structure for each cluster, which is defined as the secondary structure whose structural elements have highest accumulated Boltzmann probability within the cluster.5 Results:This resulted in 2 structures which we narrowed down to a single candidate using compatibility with the 1M7 SHAPE data as a final discriminatory criterion.
2015
- Peer-reviewed journal articles
  - C. Chauve, Y. Ponty, and J. Zanetti
    
    Evolution of genes neighborhood within reconciled phylogenies: an ensemble approach
    BMC Bioinformatics, BioMed Central, 16(Suppl 19):S6 pp, 2015 [doi] [hal]
    
    Abstract (click to read). CONTEXT:The reconstruction of evolutionary scenarios for whole genomesin terms of genome rearrangements is a fundamental problem in evolutionaryand comparative genomics. The DeCo algorithm, recently introducedby Berard et al., computes parsimonious evolutionary scenarios forgene adjacencies, from pairs of reconciled gene trees. However, asfor many combinatorial optimization algorithms, there can exist manyco-optimal, or slightly sub-optimal, evolutionary scenarios thatdeserve to be considered.CONTRIBUTION:We extend the DeCo algorithmto sample evolutionary scenarios from the whole solution space underthe Boltzmann distribution, and also to compute Boltzmann probabilitiesfor specific ancestral adjacencies.RESULTS:We apply our algorithmsto a dataset of mammalian gene trees and adjacencies, and observea significant reduction of the number of syntenic conflicts observedin the resulting ancestral gene adjacencies.
- Conferences with proceedings
  - A. Rajaraman, C. Chauve, and Y. Ponty
    
    Assessing the robustness of parsimonious predictions for gene neighborhoods from reconciled phylogenies
    Proceedings of ISBRA - 11th International Symposium on Bioinformatics Research and Applications - 2015, United States, 9096:260--271 pp, 2015 [doi] [hal]
    
    Abstract (click to read). The availability of a large number of assembled genomes opens the way to study the evolution of syntenic character within a phylogenetic context. The DeCo algorithm, recently introduced by Bérard et al. allows the computation of parsimonious evolutionary scenarios for gene adjacencies, from pairs of reconciled gene trees. Following the approach pioneered by Sturmfels and Pachter, we describe how to modify the DeCo dynamic programming algorithm to identify classes of cost schemes that generates similar parsimonious evolutionary scenarios for gene adjacencies, as well as the robustness to changes to the cost scheme of evolutionary events of the presence or absence of specific ancestral gene adjacencies. We apply our method to six thousands mammalian gene families, and show that computing the robustness to changes to cost schemes provides new and interesting insights on the evolution of gene adjacencies and the DeCo model.
  - J. Haleš, J. Maňuch, Y. Ponty, and L. Stacho
    
    Combinatorial RNA Design: Designability and Structure-Approximating Algorithm
    Proceedings of CPM - 26th Annual Symposium on Combinatorial Pattern Matching, Italy, 2015 [hal]
    
    Abstract (click to read). In this work, we consider the Combinatorial RNA Design problem, a minimal instance of the RNA design problem which aims at finding a sequence that admits a given target as its unique base pair maximizing structure. We provide complete characterizations for the structures that can be designed using restricted alphabets. Under a classic four-letter alphabet, we provide a complete characterization of designable structures without unpaired bases. When unpaired bases are allowed, we provide partial characterizations for classes of designable/undesignable structures, and show that the class of designable structures is closed under the stutter operation. Membership of a given structure to any of the classes can be tested in linear time and, for positive instances, a solution can be found in linear time. Finally, we consider a structure-approximating version of the problem that allows to extend bands (helices) and, assuming that the input structure avoids two motifs, we provide a linear-time algorithm that produces a designable structure with at most twice more base pairs than the input structure.
- Book chapters & Editorials
  - Y. Ponty, and F. Leclerc
    
    Drawing and Editing the Secondary Structure(s) of RNA.
    RNA Bioinformatics, Editor(s) Ernesto Picardi, Methods in Molecular Biology, Springer New York, 63-100 pp, 2015 [doi] [hal] [pubmed]
    
    Abstract (click to read). Secondary structure diagrams are essential, in RNA biology, to communicate functional hypotheses and summarize structural data, and communicate them visually as drafts or finalized publication-ready figures. While many tools are currently available to automate the production of such diagrams, their capacities are usually partial, making it hard for a user to decide which to use in a given context. In this chapter, we guide the reader through the steps involved in the production of expressive publication-quality illustrations featuring the RNA secondary structure. We present major existing representations and layouts, and give precise instructions to produce them using available free software, including jViz.RNA, the PseudoViewer, RILogo, R-chie, RNAplot, R2R, and VARNA. We describe the file formats and structural descriptions accepted by popular RNA visualization tools. We also provide command lines and Python scripts to ease the user's access to advanced features. Finally, we discuss and illustrate alternative approaches to visualize the secondary structure in the presence of probing data, pseudoknots, RNA-RNA interactions, and comparative data.
  - F. Jossinet, Y. Ponty, and J. Waldispühl
    
    Preface. Extended versions of selected papers presented at Computational methods for Structural RNAs (CMSR'14)
    Computational methods for Structural RNAs (CMSR'14), Journal of Computational Biology, Mary Ann Liebert, 189 pp, 2015 [hal] [pubmed]
    
    Abstract (click to read). This preface introduces to the extended versions of selected papers presented at Computational methods for Structural RNAs (CMSR'14)
- Conferences without proceedings and/or unrefereed
  - V. Le Gallic, A. Denise, and Y. Ponty
    
    Résultats algorithmiques pour le design d’ARN avec contraintes de séquence
    Proceedings of SeqBio 2015, France, 2015 [hal]
    
    Abstract (click to read). Le problème du design consiste à concevoir une ou des séquences d’ARN qui vont, dans un modèleénergétique, se replier en une structure cible. Les algorithmes déjà existants dans le domaine negèrent pour la plupart pas l’ajout au problème des contraintes de séquence, c’est-à-dire l’interdictionou l’obligation d’utiliser certains motifs.Zhou et al. (2013) ont proposé un algorithme qui utilise de la génération aléatoire dans un langageformel conditionné par un automate fini qui gère les contraintes de motifs interdits/imposés et unegrammaire non contextuelle qui gère les contraintes imposées par la structure recherchée.On propose ici de se baser sur cet algorithme en y ajoutant des optimisations permettant de réduiresa complexité. Ce travail soulève également des questions ouvertes sur la construction efficace d’unautomate (quasi) minimal, ainsi que l’incorporation de contraintes supplémentaires garantissant lerepliement des séquences engendrées vers une structure cible à l’exclusion de tout autre repliement.
- Invited talks
  - Y. Ponty
    
    RNA Bioinformatics and ensemble dynamic programming
    Proceedings of 5th biennial Canadian Discrete and Algorithmic Mathematics Conference - CanaDAM 2015, Canada, 2015 [hal]
    
    Abstract (click to read). RiboNucleic Acids (RNAs) are fascinating biomolecules which, similarly to DNA, can be encoded as sequences over a four-letter alphabet, and perform a wide array of biological functions. However, unlike DNA, the precise function of a given RNA depends critically on its structure, adopted as the outcome of a folding process. Luckily, this intricate three- dimensional conformation can be adequately abstracted as a (non-crossing) list of contacts, i.e. a discrete combinatorial object.In this talk, I will emphasize how, over the past three decades, RNA biology has benefited from a continuous and fruitful cross-talk between discrete mathematicians, computer scientists and biochemists. At the center of this conversation lies the concept of dynamic-programming, an algorithmic design technique which solves a combinatorial optimization problem efficiently by taking advantage of a well-chosen decomposition of its search space. Extensions and optimized instances of this technique now allow to address, at a genomic scale, multiple questions related to the analysis of the Boltzmann ensemble and the sequence-structure(-function) relationship. These developments also raise well-defined open questions, motivating further studies of the underlying discrete structures.
  - Y. Ponty
    
    Towards firm foundations for the rational design of RNA molecules
    Proceedings of Colloque de Bioinformatique Moléculaire (GdR BIM), France, 2015 [hal]
    
    Abstract (click to read). The overall architecture of an RNA can be predicted with acceptable accuracy using established computational methods, such as those implemented in the MFold or the RNAFold software. This opens the door for a rational design of RNA molecules, to serve as building blocks in synthetic biology, or as novel therapeutic agents. To that purpose, one must solve the inverse folding problem, i.e. construct an RNA sequence which adopts a specific secondary structure according to established structure prediction methods. However, contrasting with the prediction task, the underlying computational structure of inverse folding is poorly understood, and existing tools either fail to return admissible sequences for some structures, or require extraordinary computational resources. Consequently, the field of RNA bioinformatics is currently craving for both practical and theoretically-sound approaches for the problem. In this talk, I will introduce the many flavors of the RNA design problem and illustrate its relevance to the objectives of modern (synthetic) biology. I will outline the strengths and shortcomings of selected approaches, and will briefly present recently contributed methods, developed with collaborators at Univ McGill (Canada), Wuhan Univ. (China) and Univ. Paris-Sud. If time allows, I will also mention results recently obtained with collaborators at Simon Fraser University, which characterize large subclasses of (un)designable secondary structures in simple energy models. These results are currently being extended to more realistic energy models, and could serve as a foundation for exact, computationally-efficient, solutions for RNA design.
2014
- Conferences with proceedings
  - C. Chauve, Y. Ponty, and J. Zanetti
    
    Evolution of genes neighborhood within reconciled phylogenies: an ensemble approach
    Proceedings of BSB - Brazilian Symposium on Bioinformatics - 2014, Brazil, Advances in Bioinformatics and Computational Biology, Springer, 8826:49--56 pp, 2014 [doi] [hal]
    
    Abstract (click to read). We consider a recently introduced dynamic programming scheme to compute parsimonious evolutionary scenarios for gene adjacencies. We extend this scheme to sample evolutionary scenarios from the whole solution space under the Boltzmann distribution. We apply our algorithms to a dataset of mammalian gene trees and adjacencies, and observe a significant reduction of the number of syntenic inconsistencies observed in the resulting ancestral gene adjacencies.
- Book chapters & Editorials
  - Y. Ponty
    
    Bio-algorithmique des ARN : petite promenade aux interfaces
    1024 - Bulletin de la société informatique de France, Editor(s) Eric Sopena, SIF, 23--53 pp, 2014 [hal]
    
    Abstract (click to read). Nous présentons l'histoire et les développements récents de la recherche en bioinformatique des ARN, en insistant particulièrement sur sa dimension algorithmique, ainsi que sur les nombreuses et fructueuses interactions avec d'autres sciences qui jalonnent son histoire.
- Miscellaneous & Submitted
  - C. Chauve, Y. Ponty, and J. Zanetti
    
    Evolution of genes neighborhood within reconciled phylogenies: an ensemble approach
    [hal]
    
    Abstract (click to read). Reconstruction of the evolutionary history of genomic characters along a given species tree is a long-standing problem in computational biology, with efficient algorithms to compute parsimonious scenarios for many types of characters, as in the cases of genes and genomes sequences, gene content, and gene family evolution. Recently, Bérard et al. extended the corpus of such results to syntenic characters, as they introduced the notion of adjacency forest, that models the evolution of gene adjacencies within a phylogeny, and described an efficient dynamic programming (DP) algorithm, called DeCo (Berard et al, ECCB 2012), to compute parsimonious adjacency evolutionary histories. Applying the classical parsimony-based approach of DeCo to a dataset of over 6,000 pairs of mammalian gene trees yielded a significant number of ancestral genes involved in more than two adjacencies, which correspond to syntenic inconsistencies.Recently, more general approaches for parsimony problems have been analyzed, either exploring a wider range of parameters, or considering several alternate histories for a given instance. Our work seeks to address the syntenic inconsistencies of DeCo by extending their DP scheme toward an exploration of the whole solution space of adjacency histories, under the Boltzmann probability distribution, that assigns a probability to each solution defined in terms of its parsimony score. This principle has been applied in several contexts and is sometimes known as the “Boltzmann ensemble approach”.While this Boltzmann ensemble approach has been used for a long time in RNA structure analysis, to the best of our knowledge it is not the case in comparative genomics, where exact probabilistic models have been favored as increasing computational capacities allow them to handle realistic datasets. However, such a probabilistic model does not exist so far for gene adjacencies, which motivates our work.We first show that by sampling adjacencies histories under a Boltzmann distribution that favors co-optimal histories and conserving only frequent ancestral adjacencies, we can reduce significantly the number of syntenic inconsistencies. We also implement a inside/outside variant for DeCo, that computes the actual probabilities of each individual adjacency considered under the Boltzmann distribution.
  - F. Jossinet, Y. Ponty, and J. Waldispühl
    
    Proceedings of the 1st workshop on Computational Methods for Structural RNAs
    [doi] [hal]
    
    Abstract (click to read). Ribonucleic acids (RNAs) play key roles in various aspects of the gene transcription and regulation processes, and are the focus of an ever-increasing interest in all areas of molecular biology. Deciphering the function of a non-protein coding RNA requires an intimate knowledge of its structure, motivating the development of structure-centric methods. This volume contains original peer-reviewed contributions and invited communications presented at "CMSR'14: Computational Methods for Structural RNAs" held on September 7, 2014 in Strasbourg. This event was hosted as an official workshop of ECCB'14: 13th European Conference on Computational Biology.Proceedings are freely downloadable at http://cmsr14.cs.mcgill.ca/
2013
- Peer-reviewed journal articles
  - A. Lorenz, and Y. Ponty
    
    Non-redundant random generation algorithms for weighted context-free languages ^*.
    Theoretical Computer Science, Elsevier, 502:177-194 pp, 2013 [doi] [hal]
    
    Abstract (click to read). We address the non-redundant random generation of $k$ words of length $n$ in a context-free language. Additionally, we want to avoid a predefined set of words. We study a rejection-based approach, whose worst-case time complexity is shown to grow exponentially with $k$ for some specifications and in the limit case of a coupon collector. We propose two algorithms respectively based on the recursive method and on an unranking approach. We show how careful implementations of these algorithms allow for a non-redundant generation of $k$ words of length $n$ in $\mathcal{O}(k\cdot n\cdot \log{n})$ arithmetic operations, after a precomputation of $\Theta(n)$ numbers. The overall complexity is therefore dominated by the generation of $k$ words, and the non-redundancy comes at a negligible cost.
  - V. Reinharz, Y. Ponty, and J. Waldispühl
    
    A weighted sampling algorithm for the design of RNA sequences with targeted secondary structure and nucleotide distribution.
    Bioinformatics, Oxford University Press (OUP), 29(13):i308-15 pp, 2013 [doi] [hal] [pubmed]
    
    Abstract (click to read). MOTIVATIONS: The design of RNA sequences folding into predefined secondary structures is a milestone for many synthetic biology and gene therapy studies. Most of the current software uses similar local search strategies (i.e. a random seed is progressively adapted to acquire the desired folding properties) and more importantly do not allow the user to control explicitly the nucleotide distribution such as the GC-content in their sequences. However, the latter is an important criterion for large-scale applications as it could presumably be used to design sequences with better transcription rates and/or structural plasticity. RESULTS: In this article, we introduce IncaRNAtion, a novel algorithm to design RNA sequences folding into target secondary structures with a predefined nucleotide distribution. IncaRNAtion uses a global sampling approach and weighted sampling techniques. We show that our approach is fast (i.e. running time comparable or better than local search methods), seedless (we remove the bias of the seed in local search heuristics) and successfully generates high-quality sequences (i.e. thermodynamically stable) for any GC-content. To complete this study, we develop a hybrid method combining our global sampling approach with local search strategies. Remarkably, our glocal methodology overcomes both local and global approaches for sampling sequences with a specific GC-content and target structure. AVAILABILITY: IncaRNAtion is available at csb.cs.mcgill.ca/incarnation/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
  - V. Reinharz, Y. Ponty, and J. Waldispühl
    
    Using Structural and Evolutionary Information to Detect and Correct Pyrosequencing Errors in Noncoding RNAs. ^*.
    Journal of Computational Biology, Mary Ann Liebert, 20(11):905-19 pp, 2013 [doi] [hal] [pubmed]
    
    Abstract (click to read). The analysis of the sequence-structure relationship in RNA molecules is not only essential for evolutionary studies but also for concrete applications such as error-correction in next generation sequencing (NGS) technologies. The prohibitive sizes of the mutational and conformational landscapes, combined with the volume of data to process, require efficient algorithms to compute sequence-structure properties. In this article, we address the correction of NGS errors by calculating which mutations most increase the likelihood of a sequence to a given structure and RNA family. We introduce RNApyro, an efficient, linear time and space inside-outside algorithm that computes exact mutational probabilities under secondary structure and evolutionary constraints given as a multiple sequence alignment with a consensus structure. We develop a scoring scheme combining classical stacking base-pair energies to novel isostericity scores and apply our techniques to correct pointwise errors in 5s and 16s rRNA sequences. Our results suggest that RNApyro is a promising algorithm to complement existing tools in the NGS error-correction pipeline.
  - Y. Zhang, Y. Ponty, M. Blanchette, E. Lecuyer, and J. Waldispühl
    
    SPARCS: a web server to analyze (un)structured regions in coding RNA sequences.
    Nucleic Acids Research, Oxford University Press, 41(Web Server issue):W480-5 pp, 2013 [doi] [hal] [pubmed]
    
    Abstract (click to read). More than a simple carrier of the genetic information, messenger RNA (mRNA) coding regions can also harbor functional elements that evolved to control different post-transcriptional processes, such as mRNA splicing, localization and translation. Functional elements in RNA molecules are often encoded by secondary structure elements. In this aticle, we introduce Structural Profile Assignment of RNA Coding Sequences (SPARCS), an efficient method to analyze the (secondary) structure profile of protein-coding regions in mRNAs. First, we develop a novel algorithm that enables us to sample uniformly the sequence landscape preserving the dinucleotide frequency and the encoded amino acid sequence of the input mRNA. Then, we use this algorithm to generate a set of artificial sequences that is used to estimate the Z-score of classical structural metrics such as the sum of base pairing probabilities and the base pairing entropy. Finally, we use these metrics to predict structured and unstructured regions in the input mRNA sequence. We applied our methods to study the structural profile of the ASH1 genes and recovered key structural elements. A web server implementing this discovery pipeline is available at http://csb.cs.mcgill.ca/sparcs together with the source code of the sampling algorithm.
  - A. Lopes, S. Sacquin-Mora, V. Dimitrova, E. Laine, Y. Ponty, and A. Carbone
    
    Protein-protein interactions in a crowded environment: an analysis via cross-docking simulations and evolutionary information
    PLoS Computational Biology, Public Library of Science, 9(12):e1003369 pp, 2013 [doi] [hal]
    
    Abstract (click to read). Protein-protein interactions (PPI) are at the heart of the molecular processes governing life and constitute an increasingly important target for drug design. Given their importance, it is vital to determine which protein interactions have functional relevance and to characterize the protein competition inherent to crowded environments, as the cytoplasm or the cellular organelles. We show that combining coarse-grain molecular cross-docking simulations and binding site predictions based on evolutionary sequence analysis is a viable route to identify true interacting partners for hundreds of proteins with a variate set of protein structures and interfaces. Also, we realize a large-scale analysis of protein binding promiscuity and provide a numerical characterization of partner competition and level of interaction strength for about 28000 false-partner interactions. Finally, we demonstrate that binding site prediction is useful to discriminate native partners, but also to scale up the approach to thousands of protein interactions. This study is based on the large computational effort made by thousands of internautes helping World Community Grid over a period of 7 months. The complete dataset issued by the computation and the analysis is released to the scientific community.
- Conferences with proceedings
  - V. Reinharz, Y. Ponty, and J. Waldispühl
    
    A linear inside-outside algorithm for correcting sequencing errors in structured RNA sequences ^*
    Proceedings of RECOMB - 17th Annual International Conference on Research in Computational Molecular Biology - 2013, China, 2013 [hal]
    
    Abstract (click to read). Analysis of the sequence-structure relationship in RNA molecules are essential to evolutionary studies but also to concrete applications such as error-correction methodologies in sequencing technologies. The prohibitive sizes of the mutational and conformational landscapes combined with the volume of data to proceed require e cient algorithms to compute sequence-structure properties. More speci cally, here we aim to calculate which mutations increase the most the likelihood of a sequence to a given structure and RNA family. In this paper, we introduce RNApyro, an e cient linear-time and space inside-outside algorithm that computes exact mutational probabilities under secondary structure and evolutionary constraints given as a multiple sequence alignment with a consensus structure. We develop a scoring scheme combining classical stacking base pair energies to novel isostericity scales, and apply our techniques to correct point-wise errors in 5s rRNA sequences. Our results suggest that RNApyro is a promising algorithm to complement existing tools in the NGS error-correction pipeline.
  - V. Reinharz, Y. Ponty, and J. Waldispühl
    
    A weighted sampling algorithm for the design of RNA sequences with targeted secondary structure and nucleotides distribution
    Proceedings of ISMB/ECCB - 21st Annual international conference on Intelligent Systems for Molecular Biology/12th European Conference on Computational Biology - 2013, Germany, 2013 [hal]
    
    Abstract (click to read). Motivations: The design of RNA sequences folding into predefined secondary structures is a milestone for many synthetic biology and gene therapy studies. Most of the current software uses similar local search strategies (i.e. a random seed is progressively adapted to acquire the desired folding properties) and more importantly do not allow the user to control explicitly the nucleotide distribution such as the GC-content in their sequences. However, the latter is an important criterion for large-scale applications as it could presumably be used to design sequences with better transcription rates and/or structural plasticity. Results: In this paper, we introduce IncaRNAtion, a novel algorithm to design RNA sequences folding into target secondary structures with a predefined nucleotide distribution. IncaRNAtion uses a global sampling approach and weighted sampling techniques. We show that our approach is fast (i.e. running time comparable or better than local search methods), seed-less (we remove the bias of the seed in local search heuristics), and successfully generates highquality sequences (i.e. thermodynamically stable) for any GC-content. To complete this study, we develop an hybrid method combining our global sampling approach with local search strategies. Remarkably, our glocal methodology overcomes both local and global approaches for sampling sequences with a specific GC content and target structure. Availability: IncaRNAtion is available at csb.cs.mcgill.ca/incarnation/
  - E. Senter, S. Sheikh, I. Dotu, Y. Ponty, and P. Clote
    
    Using the Fast Fourier Transform to accelerate the computational search for RNA conformational switches (extended abstract) ^*
    Proceedings of RECOMB - 17th Annual International Conference on Research in Computational Molecular Biology - 2013, China, 2013 [hal]
    
    Abstract (click to read). We describe the broad outline of a new thermodynamics-based algorithm, FFTbor, that uses the fast Fourier transform to perform polynomial interpolation to compute the Boltzmann probability that secondary structures di er by k base pairs from an arbitrary reference structure of a given RNA sequence. The algorithm, which runs in quartic time O(n4) and quadratic space O(n2), is used to determine the correlation between kinetic folding speed and the ruggedness of the energy landscape,and to predict the location of riboswitch expression platform candidates. The full paper appears in PLoS ONE (2012) 19 Dec 2012. A web server is available at http://bioinformatics.bc.edu/clotelab/FFTbor/.
  - Y. Zhou, Y. Ponty, S. Vialette, J. Waldispühl, Y. Zhang, and A. Denise
    
    Flexible RNA design under structure and sequence constraints using formal languages
    Proceedings of ACM-BCB - ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics - 2013, United States, 2013 [hal]
    
    Abstract (click to read). The problem of RNA secondary structure design (also called inverse folding) is the following: given a target secondary structure, one aims to create a sequence that folds into, or is compatible with, a given structure. In several practical applications in biology, additional constraints must be taken into account, such as the presence/absence of regulatory motifs, either at a specific location or anywhere in the sequence. In this study, we investigate the design of RNA sequences from their targeted secondary structure, given these additional sequence constraints. To this purpose, we develop a general framework based on concepts of language theory, namely context-free grammars and finite automata. We efficiently combine a comprehensive set of constraints into a unifying context-free grammar of moderate size. From there, we use generic generic algorithms to perform a (weighted) random generation, or an exhaustive enumeration, of candidate sequences. The resulting method, whose complexity scales linearly with the length of the RNA, was implemented as a standalone program. The resulting software was embedded into a publicly available dedicated web server. The applicability demonstrated of the method on a concrete case study dedicated to Exon Splicing Enhancers, in which our approach was successfully used in the design of \emph{in vitro} experiments.
- Book chapters & Editorials
  - S. Schirmer, Y. Ponty, and R. Giegerich
    
    Introduction to RNA secondary structure comparison.
    RNA Sequence, Structure, and Function: Computational and Bioinformatic Methods, Methods in Molecular Biology Editor(s) Gorodkin & Jan and Ruzzo & Walter L., Methods in molecular biology, Humana Press/Springer Imprint, 247-73 pp, 2013 [doi] [hal] [pubmed]
    
    Abstract (click to read). Many methods have been proposed for RNA secondary structure comparison, and new ones are still being developed. In this chapter, we first consider structure representations and discuss their suitability for structure comparison. Then, we take a look at the more commonly used methods, restricting ourselves to structures without pseudo-knots. For comparing structures of the same sequence, we study base pair distances. For structures of different sequences (and of different length), we study variants of the tree edit model. We name some of the available tools and give pointers to the literature. We end with a short review on comparing structures with pseudo-knots as an unsolved problem and topic of active research.
2012
- Peer-reviewed journal articles
  - P. Clote, Y. Ponty, and J.-M. Steyaert
    
    Expected distance between terminal nucleotides of RNA secondary structures.
    Journal of Mathematical Biology, Springer, 65(3):581-99 pp, 2012 [doi] [hal] [pubmed]
    
    Abstract (click to read). In "The ends of a large RNA molecule are necessarily close", Yoffe et al. (Nucleic Acids Res 39(1):292-299, 2011) used the programs RNAfold [resp. RNAsubopt] from Vienna RNA Package to calculate the distance between 5' and 3' ends of the minimum free energy secondary structure [resp. thermal equilibrium structures] of viral and random RNA sequences. Here, the 5'-3' distance is defined to be the length of the shortest path from 5' node to 3' node in the undirected graph, whose edge set consists of edges {i, i + 1} corresponding to covalent backbone bonds and of edges {i, j} corresponding to canonical base pairs. From repeated simulations and using a heuristic theoretical argument, Yoffe et al. conclude that the 5'-3' distance is less than a fixed constant, independent of RNA sequence length. In this paper, we provide a rigorous, mathematical framework to study the expected distance from 5' to 3' ends of an RNA sequence. We present recurrence relations that precisely define the expected distance from 5' to 3' ends of an RNA sequence, both for the Turner nearest neighbor energy model, as well as for a simple homopolymer model first defined by Stein and Waterman. We implement dynamic programming algorithms to compute (rather than approximate by repeated application of Vienna RNA Package) the expected distance between 5' and 3' ends of a given RNA sequence, with respect to the Turner energy model. Using methods of analytical combinatorics, that depend on complex analysis, we prove that the asymptotic expected 5'-3' distance of length n homopolymers is approximately equal to the constant 5.47211, while the asymptotic distance is 6.771096 if hairpins have a minimum of 3 unpaired bases and the probability that any two positions can form a base pair is 1/4. Finally, we analyze the 5'-3' distance for secondary structures from the STRAND database, and conclude that the 5'-3' distance is correlated with RNA sequence length.
  - E. Senter, S. Sheikh, I. Dotu, Y. Ponty, and P. Clote
    
    Using the Fast Fourier Transform to Accelerate the Computational Search for RNA Conformational Switches ^*.
    PLoS ONE, Public Library of Science, 7(12):e50506 pp, 2012 [doi] [hal]
    
    Abstract (click to read). Using complex roots of unity and the Fast Fourier Transform, we design a new thermodynamics-based algorithm, FFTbor, that computes the Boltzmann probability that secondary structures differ by k base pairs from an arbitrary initial structure of a given RNA sequence. The algorithm, which runs in quartic time O(n^4) and quadratic space O(n^2), is used to determine the correlation between kinetic folding speed and the ruggedness of the energy landscape, and to predict the location of riboswitch expression platform candidates. A web server is available at http://bioinformatics.bc.edu/clotelab/FFTbor/
  - A. Levin, M. Lis, Y. Ponty, C. O'Donnell, S. Devadas, B. Berger, and J. Waldispühl
    
    A global sampling approach to designing and reengineering RNA secondary structures.
    Nucleic Acids Research, Oxford University Press, 40(20):10041-10052 pp, 2012 [doi] [hal] [pubmed]
    
    Abstract (click to read). The development of algorithms for designing artificial RNA sequences that fold into specific secondary structures has many potential biomedical and synthetic biology applications. To date, this problem remains computationally difficult, and current strategies to address it resort to heuristics and stochastic search techniques. The most popular methods consist of two steps: First a random seed sequence is generated; next, this seed is progressively modified (i.e. mutated) to adopt the desired folding properties. Although computationally inexpensive, this approach raises several questions such as (i) the influence of the seed; and (ii) the efficiency of single-path directed searches that may be affected by energy barriers in the mutational landscape. In this article, we present RNA-ensign, a novel paradigm for RNA design. Instead of taking a progressive adaptive walk driven by local search criteria, we use an efficient global sampling algorithm to examine large regions of the mutational landscape under structural and thermodynamical constraints until a solution is found. When considering the influence of the seeds and the target secondary structures, our results show that, compared to single-path directed searches, our approach is more robust, succeeds more often and generates more thermodynamically stable sequences. An ensemble approach to RNA design is thus well worth pursuing as a complement to existing approaches. RNA-ensign is available at http://csb.cs.mcgill.ca/RNAensign.
- Conferences with proceedings
  - J. Du Boisberranger, D. Gardy, and Y. Ponty
    
    The weighted words collector
    Proceedings of AOFA - 23rd International Meeting on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms - 2012, Canada, DMTCS, DMTCS Proceedings vol. AQ, 23rd Intern. Meeting on Probabilistic, Combinatorial, and Asymptotic Methods for the Analysis of Algorithms (AofA'12):243--264 pp, 2012 [doi] [hal]
    
    Abstract (click to read). We consider the word collector problem, i.e. the expected number of calls to a random weighted generator before all the words of a given length in a language are generated. The originality of this instance of the non-uniform coupon collector lies in the, potentially large, multiplicity of the words/coupons of a given probability/composition. We obtain a general theorem that gives an asymptotic equivalent for the expected waiting time of a general version of the Coupon Collector. This theorem is especially well-suited for classes of coupons featuring high multiplicities. Its application to a given language essentially necessitates knowledge on the number of words of a given composition/probability. We illustrate the application of our theorem, in a step-by-step fashion, on four exemplary languages, whose analyses reveal a large diversity of asymptotic waiting times, generally expressible as $\kappa \cdot m^p \cdot (\log{m})^q \cdot (\log \log{m})^r$, for $m$ the number of words, and $p, q, r$ some positive real numbers.
  - S. Sheikh, R. Backofen, and Y. Ponty
    
    Impact Of The Energy Model On The Complexity Of RNA Folding With Pseudoknots
    Proceedings of CPM - 23rd Annual Symposium on Combinatorial Pattern Matching - 2012, Finland, Combinatorial Pattern Matching, Springer, 7354:321--333 pp, 2012 [doi] [hal]
    
    Abstract (click to read). La prédiction du repliement, avec pseudonoeuds généraux, d'une séquence d'ARN de taille $n$ est équivalent à la recherche d'un couplage d'énergie libre minimale. Dans un modèle d'énergie simple, où chaque paire de base contribue indépendamment à l'énergie, ce problème peut être résolu en temps $\Theta(n^3)$ grâce à une variante d'un algorithme de couplage pondéré maximal. Cependant, le même problème a été démontré NP-difficile dans le modèle d'énergie dit des plus proches voisins. Dans ce travail, nous étudions les propriétés du problème sous un modèle d'empilements, constituant un modèle intermédiaire entre ceux d'appariement et des plus proches voisins. Nous démontrons tout d'abord que le repliement avec pseudo-noeuds de l'ARN reste NP-difficile dans de nombreuses valuations du modèle d'énergie. . Par ailleurs, nous montrons que ce problème est approximable, en proposant un algorithme polynomial garantissant une $1/5$-approximation. Ce résultat illustre une différence essentielle entre ce modèle et celui des plus proches voisins, pour lequel nous montrons qu'il ne peut être approché à aucun ratio positif par un algorithme en temps polynomial sauf si $N=NP$.
  - C. Banderier, O. Bodini, Y. Ponty, and H. Tafat
    
    On the diversity of pattern distributions in rational language
    Proceedings of ANALCO - 12th Meeting on Analytic Algorithmics and Combinatorics - 2012, Japan, Omnipress, 107--116 pp, 2012 [hal]
    
    Abstract (click to read). It is well known that, under some aperiodicity and irreducibility conditions, the number of occurrences of local patterns within a Markov chain (and, more generally, within the languages generated by weighted regular expressions/automata) follows a Gaussian distribu- tion with both variance and mean in (n). By contrast, when these conditions no longer hold, it has been denoted that the limiting distribution may follow a whole diversity of distributions, including the uniform, power-law or even multimodal distribution, arising as tradeo s between structural properties of the regular expression and the weight/probabilities associated with its transitions/letters. However these cases only partially cover the full diversity of behaviors induced within regular expressions, and a characterization of attainable distributions remained to be provided. In this article, we constructively show that the limiting distribution of the simplest foresee- able motif (a single letter!) may already follow an arbitrarily complex continuous distribution (or cadlag process). We also give applications in random generation (Boltzmann sampling) and bioinformatics (parsimonious segmentation of DNA).
  - P. Rinaudo, Y. Ponty, D. Barth, and A. Denise
    
    Tree decomposition and parameterized algorithms for RNA structure-sequence alignment including tertiary interactions and pseudoknots
    Proceedings of WABI - 12th Workshop on Algorithms in Bioinformatics - 2012, Slovenia, tba, 2012 [hal]
    
    Abstract (click to read). Nous présentons un cadre général pour la comparaison structure/séquence d'une très large classe de structures d'ARN, qui généralise de nombreux travaux récents consacrés à des familles spécifiques. Notre approche, basée sur une décomposition arborescente des structures, permet d'obtenir un algorithme de complexité paramétrée, dont la complexité asymptotique, exponentielle dans le cas général, dépend de la famille de structures considérée. Pour chacune des familles considérées par les travaux antérieurs, notre algorithme se spécialise automatiquement, et permet d'obtenir une complexité égale à celle obtenue jusqu'alors.
- Book chapters & Editorials
  - Y. Ponty, and J. Jongwane
    
    Mieux comprendre certaines molécules biologiques grâce à l’informatique
    Interstices INRIA, 2012 [hal]
    
    Abstract (click to read). Yann Ponty s’intéresse à l’algorithmique du repliement d’ARN. Il nous en parle dans cet épisode du podcast audio.
- Conferences without proceedings and/or unrefereed
  - S. Janssen, L. Paulevé, Y. Ponty, B. Raman, and M. Zytnicki
    
    Can Probabilistic Model Checking Explore Ribo-Nucleic Acid Folding Space?
    Proceedings of IWBDA - 4th International Workshop on Bio-Design Automation - 2012, United States, IWBDA - 4th International Workshop on Bio-Design Automation - 2012, 2012 [hal]
    
    Abstract (click to read). The folding path of a ribo-nucleic acid from its free form to its complete structure is of significant interest to structural biologists. There are algorithms for simulating kinetics of the ribo-nucleic acid from its unfolded state to its stable structure (minimum free-energy structure). These computational techniques are expensive as the total number of structures in the folding space is exponentially large. We propose to use time-efficient algorithms from the field of probabilistic model checking for verifying certain hypothesis concerning ribo-nucleic acid folding landscape. First, we explain how thermodynamic models of ribo-nucleic acid can be used to generate a Markov chain of structures, whose transitions within the chain are based on the energy difference w.r.t to their neighbours. Then we present a process algebra model to compactly represent the dynamics of ribo-nucleic acid folding. Both these approaches essentially generate input models for probabilistic model checking. Finally, we discuss if statistical model checking techniques can be applied for the problem of aligning ribo-nucleic acid sequences and structures.
- Invited talks
  - Y. Ponty
    
    Visualizing 2D & 3D Structures of RNA
    Proceedings of VIZBI - 3rd international meeting on Visualizing Biological Data - 2012, Germany, 2012 [hal]
    
    Abstract (click to read). Broad overview of the existing visualization techniques/methods/tools and many challenges faced by the RNA community.
- Miscellaneous & Submitted
  - Y. Ponty
    
    Rule-weighted and terminal-weighted context-free grammars have identical expressivity
    [hal]
    
    Abstract (click to read). Two formalisms, both based on context-free grammars, have recently been proposed as a basis for a non-uniform random generation of combinatorial objects. The former, introduced by Denise et al, associates weights with letters, while the latter, recently explored by Weinberg et al in the context of random generation, associates weights to transitions. In this short note, we use a simple modification of the Greibach Normal Form transformation algorithm, due to Blum and Koch, to show the equivalent expressivities, in term of their induced distributions, of these two formalisms.
2011
- Peer-reviewed journal articles
  - J. Waldispühl, and Y. Ponty
    
    An unbiased adaptive sampling algorithm for the exploration of RNA mutational landscapes under evolutionary pressure. ^*.
    Journal of Computational Biology, Mary Ann Liebert, 18(11):1465-79 pp, 2011 [doi] [hal] [pubmed]
    
    Abstract (click to read). The analysis of the relationship between sequences and structures (i.e., how mutations affect structures and reciprocally how structures influence mutations) is essential to decipher the principles driving molecular evolution, to infer the origins of genetic diseases, and to develop bioengineering applications such as the design of artificial molecules. Because their structures can be predicted from the sequence data only, RNA molecules provide a good framework to study this sequence-structure relationship. We recently introduced a suite of algorithms called RNAmutants which allows a complete exploration of RNA sequence-structure maps in polynomial time and space. Formally, RNAmutants takes an input sequence (or seed) to compute the Boltzmann-weighted ensembles of mutants with exactly k mutations, and sample mutations from these ensembles. However, this approach suffers from major limitations. Indeed, since the Boltzmann probabilities of the mutations depend of the free energy of the structures, RNAmutants has difficulties to sample mutant sequences with low G+C-contents. In this article, we introduce an unbiased adaptive sampling algorithm that enables RNAmutants to sample regions of the mutational landscape poorly covered by classical algorithms. We applied these methods to sample mutations with low G+C-contents. These adaptive sampling techniques can be easily adapted to explore other regions of the sequence and structural landscapes which are difficult to sample. Importantly, these algorithms come at a minimal computational cost. We demonstrate the insights offered by these techniques on studies of complete RNA sequence structures maps of sizes up to 40 nucleotides. Our results indicate that the G+C-content has a strong influence on the size and shape of the evolutionary accessible sequence and structural spaces. In particular, we show that low G+C-contents favor the apparition of internal loops and thus possibly the synthesis of tertiary structure motifs. On the other hand, high G+C-contents significantly reduce the size of the evolutionary accessible mutational landscapes.
- Conferences with proceedings
  - J. Waldispühl, and Y. Ponty
    
    An unbiased adaptive sampling algorithm for the exploration of RNA mutational landscapes under evolutionary pressure ^*
    Proceedings of RECOMB - 15th Annual International Conference on Research in Computational Molecular Biology - 2011, Canada, Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 6577:501-515 pp, 2011 [doi] [hal]
    
    Abstract (click to read). The analysis of the relationship between sequences and structures (i.e. how mutations aect structures and reciprocally how structures in uence mutations) is essential to decipher the principles driving molecular evolution, to infer the origins of genetic diseases or to develop bioengineering applications such as the design of articial molecules. Because their structures can be predicted from the sequence data only, RNA molecules provide a good framework to study this sequence-structure relationship. We recently introduced a suite of algorithms called RNAmutants which allows, for the rst time, a complete exploration of RNA sequence-structure maps in polynomial time and space. Formally, RNAmutants takes an input sequence (or seed) to compute the Boltzmann weighted ensembles of mutants with exactly k mutations, and sample mutations from these ensembles. However, this approach suers from major limitations. Indeed, since the Boltzmann probabilities of the mutations depend of the free energy of the structures, RNAmutants has diculties to sample mutant sequences with low G+C-contents. In this paper we introduce a novel unbiased adaptive sampling algorithm that enables RNAmutants to sample regions of the mutational landscape poorly covered by classical algorithms. We applied these methods to sample mutations with low G+C-contents. These adaptive sampling techniques can be easily adapted to explore other regions of the sequence and structural landscapes which are dif- cult to sample. Importantly, these algorithms come at a minimal computational cost. We demonstrate the insights oered by these techniques on studies of complete RNA sequence structures maps of sizes up to 40 nucleotides. Our results indicate that the G+C-content has a strong in uence on the size and shape of the evolutionary accessible sequence and structural spaces. In particular, we show that low G+C-contents favor the apparition of internal loops and thus possibly the synthesis of tertiary structure motifs. On the other hand, high G+C-contents signicantly reduce the size of the evolutionary accessible mutational landscapes.
  - Y. Ponty, and C. Saule
    
    A Combinatorial Framework for Designing (Pseudoknotted) RNA Algorithms
    Proceedings of WABI - 11th Workshop on Algorithms in Bioinformatics - 2011, Germany, 2011 [hal]
    
    Abstract (click to read). We extend an hypergraph representation, introduced by Finkelstein and Roytberg, to unify dynamic programming algorithms in the context of RNA folding with pseudoknots. Classic applications of RNA dynamic programming energy minimization, partition function, base-pair probabilities\ldots) are reformulated within this framework, giving rise to very simple algorithms. This reformulation allows one to conceptually detach the conformation space/energy model -- captured by the hypergraph model -- from the specific application, assuming unambiguity of the decomposition. To ensure the latter property, we propose a new combinatorial methodology based on generating functions. We extend the set of generic applications by proposing an exact algorithm for extracting generalized moments in weighted distribution, generalizing a prior contribution by Miklos and al. Finally, we illustrate our full-fledged programme on three exemplary conformation spaces (secondary structures, Akutsu's simple type pseudoknots and kissing hairpins). This readily gives sets of algorithms that are either novel or have complexity comparable to classic implementations for minimization and Boltzmann ensemble applications of dynamic programming.
- Conferences without proceedings and/or unrefereed
  - S. Janssen, Y. Ponty, B. Raman, S. Sheikh, J.-M. Steyaert, and P. Clote
    
    Investigating the RFAM paradox: The pseudoknot explanation
    Proceedings of Fifth Indo-French Bioinformatics Meeting, India, 2011 [hal]
    
    Abstract (click to read). In this short note, we investigate the responsibility of pseudoknotted (PK) foldings in the observed incapacity of an MFE folding method (RNAfold) to predict the structure of certain families of RNA within RFam. We considered the difference Delta in predicted free-energy obtained by respectively considering and omitting PK conformations, which we showed could be reliably used as an indicator for the presence of pseudoknots (84.5% AUC, while 88.3% of PK families have positive values). We furthered our analysis, investigating the top 15 families associated with positive values of Delta associated with non-PK consensus structures in RFAM, and found evidence of the presence of PK in the literature for at least 11 of them. However a large proportion of poorly predicted families remain associated with low Delta values, and additional explanations need be explored for their poor predictions.
2010
- Peer-reviewed journal articles
  - A. Denise, Y. Ponty, and M. Termier
    
    Controlled non uniform random generation of decomposable structures
    Theoretical Computer Science, Elsevier, 411(40-42):3527-3552 pp, 2010 [doi] [hal]
    
    Abstract (click to read). Consider a class of decomposable combinatorial structures, using different types of atoms ${\cal Z} = \{z_1,\ldots ,z_{|{\cal Z}|}\}$. We address the random generation of such structures with respect to a size $n$ and a targeted distribution in $k$ of its \emph{distinguished} atoms. We consider two variations on this problem. In the first alternative, the targeted distribution is given by $k$ real numbers $\mu_1, \ldots, \mu_k$ such that $0 < \mu_i < 1$ for all $i$ and $\mu_1+\cdots+\mu_k \leq 1$. We aim to generate random structures among the whole set of structures of a given size $n$, in such a way that the {\em expected} frequency of any distinguished atom $z_i$ equals $\mu_i$. We address this problem by weighting the atoms with a $k$-tuple $\pi$ of real-valued weights, inducing a weighted distribution over the set of structures of size $n$. We first adapt the classical recursive random generation scheme into an algorithm taking ${\cal O}(n^{1+o(1)}+mn\log{n})$ arithmetic operations to draw $m$ structures from the $\pi$-weighted distribution. Secondly, we address the analytical computation of weights such that the targeted frequencies are achieved asymptotically, i. e. for large values of $n$. We derive systems of functional equations whose resolution gives an explicit relationship between $\pi$ and $\mu_1, \ldots, \mu_k$. Lastly, we give an algorithm in ${\cal O}(k n^4)$ for the inverse problem, {\it i.e.} computing the frequencies associated with a given $k$-tuple $\pi$ of weights, and an optimized version in ${\cal O}(k n^2)$ in the case of context-free languages. This allows for a heuristic resolution of the weights/frequencies relationship suitable for complex specifications. In the second alternative, the targeted distribution is given by a $k$ natural numbers $n_1, \ldots, n_k$ such that $n_1+\cdots+n_k+r=n$ where $r \geq 0$ is the number of undistinguished atoms. The structures must be generated uniformly among the set of structures of size $n$ that contain {\em exactly} $n_i$ atoms $z_i$ ($1 \leq i \leq k$). We give a ${\cal O}(r^2\prod_{i=1}^k n_i^2 +m n k \log n)$ algorithm for generating $m$ structures, which simplifies into a ${\cal O}(r\prod_{i=1}^k n_i +m n)$ for regular specifications.
- Conferences with proceedings
  - D. Gardy, and Y. Ponty
    
    Weighted random generation of context-free languages: Analysis of collisions in random urn occupancy models
    Proceedings of GASCOM - 8th conference on random generation of combinatorial structures - 2010, Canada, 14pp pp, 2010 [hal]
    
    Abstract (click to read). The present work analyzes the redundancy of sets of combinatorial objects produced by a weighted random generation algorithm proposed by Denise et al. This scheme associates weights to the terminals symbols of a weighted context-free grammar, extends this weight definition multiplicatively on words, and draws words of length $n$ with probability proportional their weight. We investigate the level of redundancy within a sample of $k$ word, the proportion of the total probability covered by $k$ words (coverage), the time (number of generations) of the first collision, and the time of the full collection. For these four questions, we use an analytic urn analogy to derive asymptotic estimates and/or polynomially computable exact forms. We illustrate these tools by an analysis of an RNA secondary structure statistical sampling algorithm introduced by Ding et al.
  - O. Bodini, and Y. Ponty
    
    Multi-dimensional Boltzmann Sampling of Languages
    Proceedings of 21st International Meeting on Probabilistic, Combinatorial, and Asymptotic Methods in the Analysis of Algorithms (AofA'10), Austria, DMTCS Proceedings, DMTCS, DMTCS Proceedings vol. AM, 21st International Meeting on Probabilistic, Combinatorial, and Asymptotic Methods in the Analysis of Algorithms (AofA'10):49-64 pp, 2010 [doi] [hal]
    
    Abstract (click to read). We address the uniform random generation of words from a context-free language (over an alphabet of size $k$), while constraining every letter to a targeted frequency of occurrence. Our approach consists in a multidimensional extension of Boltzmann samplers. We show that, under mostly $\textit{strong-connectivity}$ hypotheses, our samplers return a word of size in $[(1- \epsilon)n, (1+ \epsilon)n]$ and exact frequency in $\mathcal{O}(n^{1+k/2})$ expected time. Moreover, if we accept tolerance intervals of width in $\Omega (\sqrt{n})$ for the number of occurrences of each letters, our samplers perform an approximate-size generation of words in expected $\mathcal{O}(n)$ time. We illustrate our approach on the generation of Tetris tessellations with uniform statistics in the different types of tetraminoes.
2009
- Peer-reviewed journal articles
  - K. Darty, A. Denise, and Y. Ponty
    
    VARNA: Interactive drawing and editing of the RNA secondary structure.
    Bioinformatics, Oxford University Press (OUP), 25(15):1974-5 pp, 2009 [doi] [hal] [pubmed]
    
    Abstract (click to read). DESCRIPTION: VARNA is a tool for the automated drawing, visualization and annotation of the secondary structure of RNA, designed as a companion software for web servers and databases. FEATURES: VARNA implements four drawing algorithms, supports input/output using the classic formats dbn, ct, bpseq and RNAML and exports the drawing as five picture formats, either pixel-based (JPEG, PNG) or vector-based (SVG, EPS and XFIG). It also allows manual modification and structural annotation of the resulting drawing using either an interactive point and click approach, within a web server or through command-line arguments. AVAILABILITY: VARNA is a free software, released under the terms of the GPLv3.0 license and available at http://varna.lri.fr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
2008
- Peer-reviewed journal articles
  - Y. Ponty
    
    Efficient sampling of RNA secondary structures from the Boltzmann ensemble of low-energy: The boustrophedon method
    Journal of Mathematical Biology, Springer, 56(1-2):107--127 pp, 2008 [doi] [hal]
    
    Abstract (click to read). We adapt here a surprising technique, the boustrophedon method, to speed up the sampling of RNA secondary structures from the Boltzmann low-energy ensemble. This technique is simple and its implementation straight-forward, as it only requires a permutation in the order of some operations already performed in the stochastic traceback stage of these algorithms. It nevertheless greatly improves their worst-case complexity from O(n^2) to O(n log(n)), for n the size of the original sequence. Moreover the average-case complexity of the generation is shown to be improved from O(n√n) to O(n log(n)) in an Boltzmann-weighted homopolymer model based on the Nussinov–Jacobson free-energy model. These results are extended to the more realistic Turner free-energy model through experiments performed on both structured (Drosophilia melanogaster mRNA 5S) and hybrid (Staphylococcus aureus RNAIII) RNA sequences, using a boustrophedon modified version of the popular software UnaFold. This improvement allows for the sampling of greater and more significant sets of structures in a given time.
  - M. Bousquet-Mélou, and Y. Ponty
    
    Culminating paths
    Discrete Mathematics and Theoretical Computer Science, DMTCS, 10(2):125--152 pp, 2008 [hal]
    
    Abstract (click to read). Let a and b be two positive integers. A culminating path is a path of Z^2 that starts from (0,0), consists of steps (1,a) and (1,-b), stays above the x-axis and ends at the highest ordinate it ever reaches. These paths were first encountered in bioinformatics, in the analysis of similarity search algorithms. They are also related to certain models of Lorentzian gravity in theoretical physics. We first show that the language on a two letter alphabet that naturally encodes culminating paths is not context-free. Then, we focus on the enumeration of culminating paths. A step by step approach, combined with the kernel method, provides a closed form expression for the generating fucntion of culminating paths ending at a (generic) height k. In the case a=b, we derive from this expression the asymptotic behaviour of the number of culminating paths of length n. When a>b, we obtain the asymptotic behaviour by a simpler argument. When a= b, with no precomputation stage nor non-linear storage required. The choice of the best algorithm is not as clear when a
  - W. Lorenz, P. Clote, and Y. Ponty
    
    Asymptotics of RNA shapes
    Journal of Computational Biology, Mary Ann Liebert, 15(1):31--63 pp, 2008 [doi] [hal]
    
    Abstract (click to read). RNA shapes, introduced by Giegerich et al., provide a useful classification of the branching complexity for RNA secondary structures. In this paper, we derive an exact value for the asymptotic number of RNA shapes, by relying on an elegant relation between non-ambiguous, context-free grammars and generating functions. Our results provide a theoretical upper bound on the length of RNA sequences amenable to probabilistic shape analysis, under the assumption that any base can basepair with any other base. Since the relation between context-free grammars and asymptotic enumeration is simple yet not well-known in bioinformatics, we give a self-contained presentation with illustrative examples. Additionally, we prove a surprising 1-to-1 correspondence between Pi-shapes and Motzkin numbers
  - Y. Ponty, R. Istrate, E. Porcelli, and P. Clote
    
    LocalMove: computing on-lattice fits for biopolymers
    Nucleic Acids Research, Oxford University Press, 36(2):W216-W222 pp, 2008 [doi] [hal]
    
    Abstract (click to read). Given an input Protein Data Bank file (PDB) for a protein or RNA molecule, LocalMove is a web server that determines an on-lattice representation for the input biomolecule. The web server implements a Markov Chain Monte-Carlo algorithm with simulated annealing to compute an approximate fit for either the coarse-grain model or backbone model on either the cubic or face-centered cubic lattice. LocalMove returns a PDB file as output, as well as dynamic movie of 3D images of intermediate conformations during the computation. The LocalMove server is publicly available at http://bioinformatics.bc.edu/clotelab/localmove/.
- Conferences with proceedings
  - Y. Ponty
    
    Non-redundant random generation from weighted context-free languages ^*
    Proceedings of GASCOM - 7th conference on random generation of combinatorial structures - 2008, Italy, 12pp pp, 2008 [hal]
    
    Abstract (click to read). We address the non-redundant random generation of k words of length n from a context-free language. Additionally, we want to avoid a prede¯ned set of words. We study the limits of a rejection-based approach, whose time complexity is shown to grow exponentially in k in some cases. We propose an alternative recursive algorithm, whose careful implementation allows for a non-redundant generation of k words of size n in O(kn log n) arithmetic operations after the precomputation of O(n) numbers. The overall complexity is therefore dominated by the generation of k words, and the non-redundancy comes at a negligible cost.
2007
- Peer-reviewed journal articles
  - F. Fabrizzio, Y. Ponty, W. Lorenz, and P. Clote
    
    DIAL: A web server for the pairwise alignment of two RNA 3-dimensional structures using nucleotide, dihedral angle and base pairing similarities
    Nucleic Acids Research, Oxford University Press, 35(Web server issue):W659--668 pp, 2007 [doi] [hal]
    
    Abstract (click to read). DIAL (dihedral alignment) is a web server that provides public access to a new dynamic programming algorithm for pairwise 3D structural alignment of RNA. DIAL achieves quadratic time by performing an alignment that accounts for (i) pseudo-dihedral and/or dihedral angle similarity, (ii) nucleotide sequence similarity and (iii) nucleotide base-pairing similarity. DIAL provides access to three alignment algorithms: global (Needleman–Wunsch), local (Smith–Waterman) and semiglobal modified to yield motif search). Suboptimal alignments are optionally returned, and also the Boltzmann pair probabilities for aligned positions within the optimal alignment. If a non-zero suboptimal alignment score ratio is entered, then the semiglobal alignment algorithm may be used to detect structurally similar occurrences of a user-specified 3D motif. The query motif may be contiguous in the linear chain or fragmented in a number of noncontiguous regions. The DIAL web server provides graphical output which allows the user to view, rotate and enlarge the 3D superposition for the optimal (and suboptimal) alignment of query to target. Although graphical output is available for all three algorithms, the semiglobal motif search may be of most interest in attempts to identify RNA motifs. DIAL is available at http://bioinformatics.bc.edu/clotelab/DIAL.
2006
- Peer-reviewed journal articles
  - Y. Ponty, M. Termier, and A. Denise
    
    GenRGenS: Software for Generating Random Genomic Sequences and Structures
    Bioinformatics, Oxford University Press (OUP), 22(12):1534--1535 pp, 2006 [doi] [hal]
    
    Abstract (click to read). GenRGenS is a software tool dedicated to randomly generating genomic sequences and structures. It handles several classes of models useful for sequence analysis, such as Markov chains, hidden Markov models, weighted context-free grammars, regular expressions and PROSITE expressions. GenRGenS is the only program that can handle weighted context-free grammars, thus allowing the user to model and to generate structured objects (such as RNA secondary structures) of any given desired size. GenRGenS also allows the user to combine several of these different models at the same time. Availability: Source and executable files of GenRGenS (in Java) and the complete user's manual are freely available at http://www.lri.fr/bio/GenRGenS
- Theses
  - Y. Ponty
    
    Models for structured genomic sequences, random generation and applications Modélisation de séquences génomiques structurées, génération aléatoire et applications
    2006 [hal]
    
    Abstract (click to read). La mise en évidence des mécanismes de sélection agissant sur les données génomiques structurées (ARN, Protéines, ADN...) nécessite l'élaboration de modèles de séquences. Une fois un tel modèle élaboré, il est possible, au prix d'une analyse mathématique parfois complexe ou par le biais de la
    génération aléatoire, d'évaluer la significativité d'un phénomène observé. Tout d'abord, nous nous intéressons aux propriétés des grammaires pondérées, un formalisme particulièrement adapté à la modélisation de la structure des ARN, dérivant des algorithmes de génération aléatoire efficaces implémentés au sein du prototype GenRGenS. Nous abordons le calcul automatique des pondérations réalisant des valeurs observées pour les paramètres du modèle, ainsi qu'une implémentation basée sur une approche optimisation. Dans un second temps, nous abordons la modélisation de la structure secondaire d'ARN. Après quelques rappels de biologie moléculaire, nous proposons plusieurs modèles basés sur des grammaires pondérées permettant la génération de structures d'ARN réalistes. L'utilisation d'un algorithme d'optimisation permet le calculer des pondérations correspondant à certaines familles d'ARN. Nous proposons enfin un algorithme d'extraction de structure secondaire maximale dans une structure générale, qui permet de profiter des données récentes issues de la cristallographie. Le dernier chapitre de cette thèse s'intéresse à l'analyse d'un algorithme de recherche de similarité heuristique, dont la sensibilité s'avère étroitement liée à la probabilité de présence d'un motif au sein de marches aléatoires particulières, les chemins culminants. Ces marches restent positives, et atteignent une altitude maximale en leur dernier pas. Nous proposons un algorithme récursif de génération aléatoire pour ces chemins. En combinant des techniques issues de la combinatoire énumérative, l'analyse asymptotique et la théorie des langages, nous dérivons des algorithmes de génération aléatoire par rejet linéaires dans de nombreux cas.
2004
- Conferences with proceedings
  - G. Kucherov, L. Noé, and Y. Ponty
    
    Estimating seed sensitivity on homogeneous alignments
    Proceedings of 4th Symposium on Bioinformatics and bioengineering - BIBE'2004, Taiwan, Proceedings of the Fourth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'04), IEEE Computer Society, 387-394 pp, 2004 [doi] [hal]
    
    Abstract (click to read). We address the problem of estimating the sensitivity of seed-based similarity search algorithms. In contrast to approaches based on Markov models [18, 6, 3, 4, 10], we study the estimation based on homogeneous alignments. We describe an algorithm for counting and random generation of those alignments and an algorithm for exact computation of the sensitivity for a broad class of seed strategies. We provide experimental results demonstrating a bias introduced by ignoring the homogeneousness condition.
2003
- Miscellaneous & Submitted
  - Y. Ponty
    
    Enumerative combinatorcs and random generation of RNA secondary structures Etudes combinatoire et génération aléatoire des structures secondaires d'ARN Mémoire de stage effectué dans le cadre du DEA Algorithmique
    [hal]
    
    Abstract (click to read). Ce mémoire résume un travail opéré sous la direction d'Alain Denise de Mars à Juin 2003. Il consisteen une étude des propriétés combinatoires des structures secondaires d'ARN, en rapport avec le problème de la génération aléatoire de structures secondaires d'ARN réalistes.Dans une première partie, nous tenterons de familiariser le lecteur avec le contexte biologique sous-jacent aux problèmes étudiés.Nous présenterons ensuite un état de l'art de l'étude combinatoire des structures secondaire d'ARN. Pour cela, nous introduirons brièvement une série d'outils théoriques, comme les séries génératrices, et leurs comportements asymptotiques ou les grammaires non contextuelles. Enfin, nous évoquerons divers mécanismes de génération aléatoire et présenterons une contribution à l'étude des paramètres des structures secondaires d'ARN.Nous proposerons en annexe une description d'un logiciel dédié à la génération aléatoire de séquences génomiques, GenRGenS, dont l'auteur assure le développement depuis la version 1.0. Nous présenterons aussi un algorithme polynomial de génération aléatoire de chemins culminants, structures combinatoires non algébriques qui apparaissent lors de l'étude d'algorithmes d'alignements de séquence.
  - G. Kucherov, L. Noé, and Y. Ponty
    
    Estimating seed sensitivity on homogeneous alignments
    [hal]
    
    Abstract (click to read). We address the problem of measuring the sensitivity of seed-based similarity search algorithms. In contrast to approaches based on Markov models, we study the measurement based on homogeneous alignments. We describe an algorithm for counting and random generation of those alignments and an algorithm for exact computation of the sensitivity for a broad class of seed strategies. We provide experimental results demonstrating a bias introduced by ignoring the homogeneousness condition.