Publications
Browse by date Browse by category
 Peerreviewed journal articles (30)

International Journal of Foundations of Computer Science, World Scientific Publishing, To appear 2018 [hal]Abstract (click to expand). Pairwise ordered tree alignment are combinatorial objects that appear in important applications , such as RNA secondary structure comparison. However, the usual representation of tree alignments as supertrees is ambiguous, i.e. two distinct supertrees may induce identical sets of matches between identical pairs of trees. This ambiguity is uninformative, and detrimental to any probabilistic analysis. In this work, we consider tree alignments up to equivalence. Our first result is a precise asymptotic enumeration of tree alignments, obtained from a contextfree grammar by mean of basic analytic combinatorics. Our second result focuses on alignments between two given ordered trees S and T. By refining our grammar to align specific trees, we obtain a decomposition scheme for the space of alignments, and use it to design an efficient dynamic programming algorithm for sampling alignments under the GibbsBoltzmann probability distribution. This generalizes existing tree alignment algorithms, and opens the door for a probabilistic analysis of the space of suboptimal alignments. 
Journal of computational biology : a journal of computational molecular cell biology, Mary Ann Liebert , To appear 2018 [hal]Abstract (click to expand). Let $S_n$ denote the network of all RNA secondary structures of length $n$, in which undirected edges exist between structures $s$, $t$ such that $t$ is obtained from $s$ by the addition, removal or shift of a single base pair. Using contextfree grammars, generating functions and complex analysis, we show that the asymptotic average degree is $O(n)$ and that the asymptotic clustering coefficient is $O(1/n)$, from which it follows that the family $S_n$, $n = 1, 2, 3,\ldots$ of secondary structure networks is not smallworld. 
PLoS Computational Biology, Public Library of Science, 14(3):110 pp, 2018 [doi] [hal] [pubmed]Abstract (click to expand). We present a new educational initiative, called MeetU, that aims at training students to collaborativework in computational biology and at bridging the gap between education and research.MeetU mimics the set up of collaborative research projects and takes advantage of the most populartools for collaborative work and of cloud computing. Students are grouped in teams of 45 peopleand have to realize a project from A to Z that answers a challenging question in Biology.MeetU promotes "coopetition", as the students collaborate within and across the teams and are also incompetition with each other to develop the best final product.MeetU fosters interactions between different actors of education and research through the organization of a meeting day, open to everyone, where the students present their work to a jury of researchers and jury members give researchseminars. This very unique combination of education and research is strongly motivating for thestudents and provides a formidable opportunity for a scientific community to unite and increase itsvisibility. We report on our experience with MeetU in two French universities with master studentsin bioinformatics and modeling, with proteinprotein docking as the subject of the course.MeetU is easy to implement, virtually costs nothing and can be straightforwardly transferred to other fieldsand/or universities. All the information and data are available at http://www.meetu.org.


The BRaliBase dent  a tale of benchmark design and interpretation.Briefings in Bioinformatics, Oxford University Press (OUP), 18(2):306311 pp, 2017 [doi] [hal] [pubmed]Abstract (click to expand). BRaliBase is a widely used benchmark for assessing the accuracy of RNA secondary structure alignment methods. In most case studies based on the BRaliBase benchmark, one can observe a puzzling drop in accuracy in the 4060% sequence identity range, the socalled 'BRaliBase Dent'. In this article, we show this dent is owing to a bias in the composition of the BRaliBase benchmark, namely the inclusion of a disproportionate number of transfer RNAs, which exhibit a conserved secondary structure. Our analysis, aside of its interest regarding the specific case of the BRaliBase benchmark, also raises important questions regarding the design and use of benchmarks in computational biology. 
Algorithmica, Springer Verlag, 79(3):835856 pp, 2017 [doi] [hal]Abstract (click to expand). We consider the Combinatorial RNA Design problem, a minimal instance of RNA design where one must produce an RNA sequence that adopts a given secondary structure as its minimal freeenergy structure. We consider two freeenergy models where the contributions of base pairs are additive and independent: the purely combinatorial WatsonCrick model, which only allows equallycontributing A − U and C − G base pairs, and the realvalued NussinovJacobson model, which associates arbitrary energies to A − U, C − G and G − U base pairs. We first provide a complete characterization of designable structures using restricted alphabets and, in the fourletter alphabet, provide a complete characterization for designable structures without unpaired bases. When unpaired bases are allowed, we characterize extensive classes of (non)designable structures, and prove the closure of the set of designable structures under the stutter operation. Membership of a given structure to any of the classes can be tested in Θ(n) time, including the generation of a solution sequence for positive instances. Finally, we consider a structureapproximating relaxation of the design, and provide a Θ(n) algorithm which, given a structure S that avoids two trivially nondesignable motifs, transforms S into a designable structure constructively by adding at most one basepair to each of its stems. 
Briefings in Bioinformatics, Oxford University Press (OUP), To appear 2017 [doi] [hal] [pubmed]Abstract (click to expand). Computational programs for predicting RNA sequences with desired folding properties have been extensively developed and expanded in the past several years. Given a secondary structure, these programs aim to predict sequences that fold into a target minimum free energy secondary structure, while considering various constraints. This procedure is called inverse RNA folding. Inverse RNA folding has been traditionally used to design optimized RNAs with favorable properties, an application that is expected to grow considerably in the future in light of advances in the expanding new fields of synthetic biology and RNA nanostructures. Moreover, it was recently demonstrated that inverse RNA folding can successfully be used as a valuable preprocessing step in computational detection of novel noncoding RNAs. This review describes the most popular freeware programs that have been developed for such purposes, starting from RNAinverse that was devised when formulating the inverse RNA folding problem. The most recently published ones that consider RNA secondary structure as input are antaRNA, RNAiFold and incaRNAfbinv, each having different features that could be beneficial to specific biological problems in practice. The various programs also use distinct approaches, ranging from ant colony optimization to constraint programming, in addition to adaptive walk, simulated annealing and Boltzmann sampling. This review compares between the various programs and provides a simple description of the various possibilities that would benefit practitioners in selecting the most suitable program. It is geared for specific tasks requiring RNA design based on input secondary structure, with an outlook toward the future of RNA design programs. 
Two ribosome recruitment sites direct multiple translation events within HIV1 Gag open reading frameNucleic Acids Research, 45(12):7538 pp, 2017 [doi] [hal] [pubmed]Abstract (click to expand). 
Two ribosome recruitment sites direct multiple translation events within HIV1 Gag open reading frameNucleic Acids Research, Oxford University Press, 45(12):73827400 pp, 2017 [doi] [hal] [pubmed]Abstract (click to expand). In the late phase of the HIV virus cycle, the full length unspliced genomic RNA is exported to the cytoplasm and serves as mRNA to translate the Gag and Gagpol polyproteins. Three different translation initiation mechanisms responsible for Gag production have been described. However a rationale for the involvement of as many translation pathways in gRNA translation is yet to be defined. The GagIRES has the singularity to be located within the Gag open reading frame and to directly recruit the 40S ribosomal subunit. To further characterize this interaction, we first probed the GagIRES RNA structure. We then developed an innovative integrative modelling approach and propose a novel secondary structure model for the GagIRES. The minimal 40S ribosomal subunit binding site was further mapped using different assays. To our surprise, we found that at least two regions within GagIRES can independently recruit the ribosome. Next, we validated that these two regions influence Gag translation both in vitro and in cellulo. These binding sites are mostly unstructured and highly Arich, such sequences have previously been shown to be sufficient to recruit the ribosome and to support an IRES function. A combination of biochemical and functional data give insight into the GagIRES molecular mechanism and provide compelling evidences for its importance. Hypothesis about its physiological role reflecting its conservation amongst primate lentiviruses are proposed. 
DeCoSTAR: Reconstructing the ancestral organization of genes or genomes using reconciled phylogeniesGenome Biology and Evolution, Society for Molecular Biology and Evolution, 9(5):13121319 pp, 2017 [doi] [hal]Abstract (click to expand). DeCoSTAR is a software that aims at reconstructing the organization of ancestral genes or genomes in the form of sets of neighborhood relations (adjacencies) between pairs of ancestral genes or gene domains. It can also improve the assembly of fragmented genomes by proposing evolutionaryinduced adjacencies between scaffolding fragments. Ancestral genes or domains are deduced from reconciled phylogenetic trees under an evolutionary model that considers gains, losses, speciations, duplications, and transfers as possible events for gene evolution. Reconciliations are either given as input or computed with the ecceTERA package, into which DeCoSTAR is integrated. DeCoSTAR computes adjacency evolutionary scenarios using a scoring scheme based on a weighted sum of adjacency gains and breakages. Solutions, both optimal and nearoptimal, are sampled according to the BoltzmannGibbs distribution centered around parsimonious solutions, and statistical supports on ancestral and extant adjacencies are provided. DeCoSTAR supports the features of previouslycontributed tools that reconstruct ancestral adjacencies, namely DeCo, DeCoLT, ARTDeCo and DeClone. In a few minutes, DeCoSTAR can reconstruct the evolutionary history of domains inside genes, of gene fusion and fission events, or of gene order along chromosomes, for large data sets including dozens of whole genomes from all kingdoms of life. We illustrate the potential of DeCoSTAR with several applications: ancestral reconstruction of gene orders for Anopheles mosquito genomes, multidomain proteins in Drosophila, and gene fusion and fission detection in Actinobacteria. Availability: http://pbil.univlyon1.fr/software/DeCoSTAR


Nucleic Acids Research, Oxford University Press, 44(11):e104  e104 pp, 2016 [doi] [hal] [pubmed]Abstract (click to expand). Systematic structure probing experiments (e.g. SHAPE) of RNA mutants such as the mutateandmap protocol give us a direct access into the genetic robustness of ncRNA structures. Comparative studies of homologous sequences provide a distinct, yet complementary, approach to analyze structural and functional properties of noncoding RNAs. In this paper, we introduce a formal framework to combine the biochemical signal collected from mutateandmap experiments, with the evolutionary information available in multiple sequence alignments. We apply neutral theory principles to detect complex longrange dependencies between nucleotides of a single stranded RNA, and implement these ideas into a software called aRNhAck. We illustrate the biological significance of this signal and show that the nucleotides networks calculated with aRNhAck are correlated with nucleotides located in RNARNA, RNAprotein, RNADNA and RNAligand interfaces. aRNhAck is freely available at http://csb.cs.mcgill.ca/arnhack. 
ecceTERA: comprehensive gene treespecies tree reconciliation using parsimony.Bioinformatics (Oxford, England), 20568 pp, 2016 [hal] [pubmed]Abstract (click to expand). A gene treespecies tree reconciliation explains the evolution of a gene tree within the species tree given a model of genefamily evolution. We describe ecceTERA, a program that implements a generic parsimony reconciliation algorithm, which accounts for gene duplication, loss and transfer (DTL) as well as speciation, involving sampled and unsampled lineages, within undated, fully dated or partially dated species trees. The ecceTERA reconciliation model and algorithm generalize or improve upon most published DTL parsimony algorithms for binary species trees and binary gene trees. Moreover, ecceTERA can estimate accurate speciestree aware gene trees using amalgamation.ecceTERA is freely available under http://mbb.univmontp2.fr/MBB/download_sources/16__ecceTERA and can be run online at http://mbb.univmontp2.fr/MBB/subsection/softExec.php?soft=eccetera 
Nucleic Acids Research, Oxford University Press, 44(W1):W308  W314 pp, 2016 [doi] [hal]Abstract (click to expand). In recent years, new methods for computational RNA design have been developed and applied to various problems in synthetic biology and nanotechnology. Lately, there is considerable interest in incorporating essential biological information when solving the inverse RNA folding problem. Correspondingly, RNAfbinv aims at including biologically meaningful constraints and is the only program todate that performs a fragmentbased design of RNA sequences. In doing so it allows the design of sequences that do not necessarily exactly fold into the target, as long as the overall coarsegrained tree graph shape is preserved. Augmented by the weighted sampling algorithm of incaRNAtion, our web server called incaRNAfbinv implements the method devised in RNAfbinv and offers an interactive environment for the inverse folding of RNA using a fragmentbased design approach. It takes as input: a target RNA secondary structure; optional sequence and motif constraints; optional target minimum free energy, neutrality, and GC content. In addition to the design of synthetic regulatory sequences, it can be used as a preprocessing step for the detection of novel natural occurring RNAs. The two complementary methodologies RNAfbinv and incaRNAtion are merged together and fully implemented in our web server incaRNAfbinv, available at http://www.cs.bgu. ac.il/incaRNAfbinv.


Evolution of genes neighborhood within reconciled phylogenies: an ensemble approachBMC Bioinformatics, BioMed Central, 16(Suppl 19):S6 pp, 2015 [doi] [hal]Abstract (click to expand). CONTEXT:The reconstruction of evolutionary scenarios for whole genomesin terms of genome rearrangements is a fundamental problem in evolutionaryand comparative genomics. The DeCo algorithm, recently introducedby Berard et al., computes parsimonious evolutionary scenarios forgene adjacencies, from pairs of reconciled gene trees. However, asfor many combinatorial optimization algorithms, there can exist manycooptimal, or slightly suboptimal, evolutionary scenarios thatdeserve to be considered.CONTRIBUTION:We extend the DeCo algorithmto sample evolutionary scenarios from the whole solution space underthe Boltzmann distribution, and also to compute Boltzmann probabilitiesfor specific ancestral adjacencies.RESULTS:We apply our algorithmsto a dataset of mammalian gene trees and adjacencies, and observea significant reduction of the number of syntenic conflicts observedin the resulting ancestral gene adjacencies.


Theoretical Computer Science, Elsevier, 502:177194 pp, 2013 [doi] [hal]Abstract (click to expand). We address the nonredundant random generation of $k$ words of length $n$ in a contextfree language. Additionally, we want to avoid a predefined set of words. We study a rejectionbased approach, whose worstcase time complexity is shown to grow exponentially with $k$ for some specifications and in the limit case of a coupon collector. We propose two algorithms respectively based on the recursive method and on an unranking approach. We show how careful implementations of these algorithms allow for a nonredundant generation of $k$ words of length $n$ in $\mathcal{O}(k\cdot n\cdot \log{n})$ arithmetic operations, after a precomputation of $\Theta(n)$ numbers. The overall complexity is therefore dominated by the generation of $k$ words, and the nonredundancy comes at a negligible cost. 
A weighted sampling algorithm for the design of RNA sequences with targeted secondary structure and nucleotide distribution.Bioinformatics, Oxford University Press (OUP), 29(13):i30815 pp, 2013 [doi] [hal] [pubmed]Abstract (click to expand). MOTIVATIONS: The design of RNA sequences folding into predefined secondary structures is a milestone for many synthetic biology and gene therapy studies. Most of the current software uses similar local search strategies (i.e. a random seed is progressively adapted to acquire the desired folding properties) and more importantly do not allow the user to control explicitly the nucleotide distribution such as the GCcontent in their sequences. However, the latter is an important criterion for largescale applications as it could presumably be used to design sequences with better transcription rates and/or structural plasticity. RESULTS: In this article, we introduce IncaRNAtion, a novel algorithm to design RNA sequences folding into target secondary structures with a predefined nucleotide distribution. IncaRNAtion uses a global sampling approach and weighted sampling techniques. We show that our approach is fast (i.e. running time comparable or better than local search methods), seedless (we remove the bias of the seed in local search heuristics) and successfully generates highquality sequences (i.e. thermodynamically stable) for any GCcontent. To complete this study, we develop a hybrid method combining our global sampling approach with local search strategies. Remarkably, our glocal methodology overcomes both local and global approaches for sampling sequences with a specific GCcontent and target structure. AVAILABILITY: IncaRNAtion is available at csb.cs.mcgill.ca/incarnation/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. 
Journal of Computational Biology, Mary Ann Liebert, 20(11):90519 pp, 2013 [doi] [hal] [pubmed]Abstract (click to expand). The analysis of the sequencestructure relationship in RNA molecules is not only essential for evolutionary studies but also for concrete applications such as errorcorrection in next generation sequencing (NGS) technologies. The prohibitive sizes of the mutational and conformational landscapes, combined with the volume of data to process, require efficient algorithms to compute sequencestructure properties. In this article, we address the correction of NGS errors by calculating which mutations most increase the likelihood of a sequence to a given structure and RNA family. We introduce RNApyro, an efficient, linear time and space insideoutside algorithm that computes exact mutational probabilities under secondary structure and evolutionary constraints given as a multiple sequence alignment with a consensus structure. We develop a scoring scheme combining classical stacking basepair energies to novel isostericity scores and apply our techniques to correct pointwise errors in 5s and 16s rRNA sequences. Our results suggest that RNApyro is a promising algorithm to complement existing tools in the NGS errorcorrection pipeline. 
Nucleic Acids Research, Oxford University Press, 41(Web Server issue):W4805 pp, 2013 [doi] [hal] [pubmed]Abstract (click to expand). More than a simple carrier of the genetic information, messenger RNA (mRNA) coding regions can also harbor functional elements that evolved to control different posttranscriptional processes, such as mRNA splicing, localization and translation. Functional elements in RNA molecules are often encoded by secondary structure elements. In this aticle, we introduce Structural Profile Assignment of RNA Coding Sequences (SPARCS), an efficient method to analyze the (secondary) structure profile of proteincoding regions in mRNAs. First, we develop a novel algorithm that enables us to sample uniformly the sequence landscape preserving the dinucleotide frequency and the encoded amino acid sequence of the input mRNA. Then, we use this algorithm to generate a set of artificial sequences that is used to estimate the Zscore of classical structural metrics such as the sum of base pairing probabilities and the base pairing entropy. Finally, we use these metrics to predict structured and unstructured regions in the input mRNA sequence. We applied our methods to study the structural profile of the ASH1 genes and recovered key structural elements. A web server implementing this discovery pipeline is available at http://csb.cs.mcgill.ca/sparcs together with the source code of the sampling algorithm. 
Proteinprotein interactions in a crowded environment: an analysis via crossdocking simulations and evolutionary informationPLoS Computational Biology, Public Library of Science, 9(12):e1003369 pp, 2013 [doi] [hal]Abstract (click to expand). Proteinprotein interactions (PPI) are at the heart of the molecular processes governing life and constitute an increasingly important target for drug design. Given their importance, it is vital to determine which protein interactions have functional relevance and to characterize the protein competition inherent to crowded environments, as the cytoplasm or the cellular organelles. We show that combining coarsegrain molecular crossdocking simulations and binding site predictions based on evolutionary sequence analysis is a viable route to identify true interacting partners for hundreds of proteins with a variate set of protein structures and interfaces. Also, we realize a largescale analysis of protein binding promiscuity and provide a numerical characterization of partner competition and level of interaction strength for about 28000 falsepartner interactions. Finally, we demonstrate that binding site prediction is useful to discriminate native partners, but also to scale up the approach to thousands of protein interactions. This study is based on the large computational effort made by thousands of internautes helping World Community Grid over a period of 7 months. The complete dataset issued by the computation and the analysis is released to the scientific community.


Journal of Mathematical Biology, Springer Verlag (Germany), 65(3):58199 pp, 2012 [doi] [hal] [pubmed]Abstract (click to expand). In "The ends of a large RNA molecule are necessarily close", Yoffe et al. (Nucleic Acids Res 39(1):292299, 2011) used the programs RNAfold [resp. RNAsubopt] from Vienna RNA Package to calculate the distance between 5' and 3' ends of the minimum free energy secondary structure [resp. thermal equilibrium structures] of viral and random RNA sequences. Here, the 5'3' distance is defined to be the length of the shortest path from 5' node to 3' node in the undirected graph, whose edge set consists of edges {i, i + 1} corresponding to covalent backbone bonds and of edges {i, j} corresponding to canonical base pairs. From repeated simulations and using a heuristic theoretical argument, Yoffe et al. conclude that the 5'3' distance is less than a fixed constant, independent of RNA sequence length. In this paper, we provide a rigorous, mathematical framework to study the expected distance from 5' to 3' ends of an RNA sequence. We present recurrence relations that precisely define the expected distance from 5' to 3' ends of an RNA sequence, both for the Turner nearest neighbor energy model, as well as for a simple homopolymer model first defined by Stein and Waterman. We implement dynamic programming algorithms to compute (rather than approximate by repeated application of Vienna RNA Package) the expected distance between 5' and 3' ends of a given RNA sequence, with respect to the Turner energy model. Using methods of analytical combinatorics, that depend on complex analysis, we prove that the asymptotic expected 5'3' distance of length n homopolymers is approximately equal to the constant 5.47211, while the asymptotic distance is 6.771096 if hairpins have a minimum of 3 unpaired bases and the probability that any two positions can form a base pair is 1/4. Finally, we analyze the 5'3' distance for secondary structures from the STRAND database, and conclude that the 5'3' distance is correlated with RNA sequence length. 
Using the Fast Fourier Transform to Accelerate the Computational Search for RNA Conformational Switches ^{*}PLoS ONE, Public Library of Science, 7(12):e50506 pp, 2012 [doi] [hal]Abstract (click to expand). Using complex roots of unity and the Fast Fourier Transform, we design a new thermodynamicsbased algorithm, FFTbor, that computes the Boltzmann probability that secondary structures differ by k base pairs from an arbitrary initial structure of a given RNA sequence. The algorithm, which runs in quartic time O(n^4) and quadratic space O(n^2), is used to determine the correlation between kinetic folding speed and the ruggedness of the energy landscape, and to predict the location of riboswitch expression platform candidates. A web server is available at http://bioinformatics.bc.edu/clotelab/FFTbor/ 
A global sampling approach to designing and reengineering RNA secondary structures.Nucleic Acids Research, Oxford University Press, 40(20):1004110052 pp, 2012 [doi] [hal] [pubmed]Abstract (click to expand). The development of algorithms for designing artificial RNA sequences that fold into specific secondary structures has many potential biomedical and synthetic biology applications. To date, this problem remains computationally difficult, and current strategies to address it resort to heuristics and stochastic search techniques. The most popular methods consist of two steps: First a random seed sequence is generated; next, this seed is progressively modified (i.e. mutated) to adopt the desired folding properties. Although computationally inexpensive, this approach raises several questions such as (i) the influence of the seed; and (ii) the efficiency of singlepath directed searches that may be affected by energy barriers in the mutational landscape. In this article, we present RNAensign, a novel paradigm for RNA design. Instead of taking a progressive adaptive walk driven by local search criteria, we use an efficient global sampling algorithm to examine large regions of the mutational landscape under structural and thermodynamical constraints until a solution is found. When considering the influence of the seeds and the target secondary structures, our results show that, compared to singlepath directed searches, our approach is more robust, succeeds more often and generates more thermodynamically stable sequences. An ensemble approach to RNA design is thus well worth pursuing as a complement to existing approaches. RNAensign is available at http://csb.cs.mcgill.ca/RNAensign.


An unbiased adaptive sampling algorithm for the exploration of RNA mutational landscapes under evolutionary pressure. ^{*}Journal of Computational Biology, Mary Ann Liebert, 18(11):146579 pp, 2011 [doi] [hal] [pubmed]Abstract (click to expand). The analysis of the relationship between sequences and structures (i.e., how mutations affect structures and reciprocally how structures influence mutations) is essential to decipher the principles driving molecular evolution, to infer the origins of genetic diseases, and to develop bioengineering applications such as the design of artificial molecules. Because their structures can be predicted from the sequence data only, RNA molecules provide a good framework to study this sequencestructure relationship. We recently introduced a suite of algorithms called RNAmutants which allows a complete exploration of RNA sequencestructure maps in polynomial time and space. Formally, RNAmutants takes an input sequence (or seed) to compute the Boltzmannweighted ensembles of mutants with exactly k mutations, and sample mutations from these ensembles. However, this approach suffers from major limitations. Indeed, since the Boltzmann probabilities of the mutations depend of the free energy of the structures, RNAmutants has difficulties to sample mutant sequences with low G+Ccontents. In this article, we introduce an unbiased adaptive sampling algorithm that enables RNAmutants to sample regions of the mutational landscape poorly covered by classical algorithms. We applied these methods to sample mutations with low G+Ccontents. These adaptive sampling techniques can be easily adapted to explore other regions of the sequence and structural landscapes which are difficult to sample. Importantly, these algorithms come at a minimal computational cost. We demonstrate the insights offered by these techniques on studies of complete RNA sequence structures maps of sizes up to 40 nucleotides. Our results indicate that the G+Ccontent has a strong influence on the size and shape of the evolutionary accessible sequence and structural spaces. In particular, we show that low G+Ccontents favor the apparition of internal loops and thus possibly the synthesis of tertiary structure motifs. On the other hand, high G+Ccontents significantly reduce the size of the evolutionary accessible mutational landscapes.


Theoretical Computer Science, Elsevier, 411(4042):35273552 pp, 2010 [doi] [hal]Abstract (click to expand). Consider a class of decomposable combinatorial structures, using different types of atoms ${\cal Z} = \{z_1,\ldots ,z_{{\cal Z}}\}$. We address the random generation of such structures with respect to a size $n$ and a targeted distribution in $k$ of its \emph{distinguished} atoms. We consider two variations on this problem. In the first alternative, the targeted distribution is given by $k$ real numbers $\mu_1, \ldots, \mu_k$ such that $0 < \mu_i < 1$ for all $i$ and $\mu_1+\cdots+\mu_k \leq 1$. We aim to generate random structures among the whole set of structures of a given size $n$, in such a way that the {\em expected} frequency of any distinguished atom $z_i$ equals $\mu_i$. We address this problem by weighting the atoms with a $k$tuple $\pi$ of realvalued weights, inducing a weighted distribution over the set of structures of size $n$. We first adapt the classical recursive random generation scheme into an algorithm taking ${\cal O}(n^{1+o(1)}+mn\log{n})$ arithmetic operations to draw $m$ structures from the $\pi$weighted distribution. Secondly, we address the analytical computation of weights such that the targeted frequencies are achieved asymptotically, i. e. for large values of $n$. We derive systems of functional equations whose resolution gives an explicit relationship between $\pi$ and $\mu_1, \ldots, \mu_k$. Lastly, we give an algorithm in ${\cal O}(k n^4)$ for the inverse problem, {\it i.e.} computing the frequencies associated with a given $k$tuple $\pi$ of weights, and an optimized version in ${\cal O}(k n^2)$ in the case of contextfree languages. This allows for a heuristic resolution of the weights/frequencies relationship suitable for complex specifications. In the second alternative, the targeted distribution is given by a $k$ natural numbers $n_1, \ldots, n_k$ such that $n_1+\cdots+n_k+r=n$ where $r \geq 0$ is the number of undistinguished atoms. The structures must be generated uniformly among the set of structures of size $n$ that contain {\em exactly} $n_i$ atoms $z_i$ ($1 \leq i \leq k$). We give a ${\cal O}(r^2\prod_{i=1}^k n_i^2 +m n k \log n)$ algorithm for generating $m$ structures, which simplifies into a ${\cal O}(r\prod_{i=1}^k n_i +m n)$ for regular specifications.


Bioinformatics, Oxford University Press (OUP), 25(15):19745 pp, 2009 [doi] [hal] [pubmed]Abstract (click to expand). DESCRIPTION: VARNA is a tool for the automated drawing, visualization and annotation of the secondary structure of RNA, designed as a companion software for web servers and databases. FEATURES: VARNA implements four drawing algorithms, supports input/output using the classic formats dbn, ct, bpseq and RNAML and exports the drawing as five picture formats, either pixelbased (JPEG, PNG) or vectorbased (SVG, EPS and XFIG). It also allows manual modification and structural annotation of the resulting drawing using either an interactive point and click approach, within a web server or through commandline arguments. AVAILABILITY: VARNA is a free software, released under the terms of the GPLv3.0 license and available at http://varna.lri.fr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Efficient sampling of RNA secondary structures from the Boltzmann ensemble of lowenergy: The boustrophedon methodJournal of Mathematical Biology, Springer Verlag (Germany), 56(12):107127 pp, 2008 [doi] [hal]Abstract (click to expand). We adapt here a surprising technique, the boustrophedon method, to speed up the sampling of RNA secondary structures from the Boltzmann lowenergy ensemble. This technique is simple and its implementation straightforward, as it only requires a permutation in the order of some operations already performed in the stochastic traceback stage of these algorithms. It nevertheless greatly improves their worstcase complexity from O(n^2) to O(n log(n)), for n the size of the original sequence. Moreover the averagecase complexity of the generation is shown to be improved from O(n√n) to O(n log(n)) in an Boltzmannweighted homopolymer model based on the Nussinov–Jacobson freeenergy model. These results are extended to the more realistic Turner freeenergy model through experiments performed on both structured (Drosophilia melanogaster mRNA 5S) and hybrid (Staphylococcus aureus RNAIII) RNA sequences, using a boustrophedon modified version of the popular software UnaFold. This improvement allows for the sampling of greater and more significant sets of structures in a given time. 
Discrete Mathematics and Theoretical Computer Science, DMTCS, 10(2):125152 pp, 2008 [hal]Abstract (click to expand). Let a and b be two positive integers. A culminating path is a path of Z^2 that starts from (0,0), consists of steps (1,a) and (1,b), stays above the xaxis and ends at the highest ordinate it ever reaches. These paths were first encountered in bioinformatics, in the analysis of similarity search algorithms. They are also related to certain models of Lorentzian gravity in theoretical physics. We first show that the language on a two letter alphabet that naturally encodes culminating paths is not contextfree. Then, we focus on the enumeration of culminating paths. A step by step approach, combined with the kernel method, provides a closed form expression for the generating fucntion of culminating paths ending at a (generic) height k. In the case a=b, we derive from this expression the asymptotic behaviour of the number of culminating paths of length n. When a>b, we obtain the asymptotic behaviour by a simpler argument. When a= b, with no precomputation stage nor nonlinear storage required. The choice of the best algorithm is not as clear when a 
Asymptotics of RNA shapesJournal of Computational Biology, Mary Ann Liebert, 15(1):3163 pp, 2008 [doi] [hal]Abstract (click to expand). RNA shapes, introduced by Giegerich et al., provide a useful classification of the branching complexity for RNA secondary structures. In this paper, we derive an exact value for the asymptotic number of RNA shapes, by relying on an elegant relation between nonambiguous, contextfree grammars and generating functions. Our results provide a theoretical upper bound on the length of RNA sequences amenable to probabilistic shape analysis, under the assumption that any base can basepair with any other base. Since the relation between contextfree grammars and asymptotic enumeration is simple yet not wellknown in bioinformatics, we give a selfcontained presentation with illustrative examples. Additionally, we prove a surprising 1to1 correspondence between Pishapes and Motzkin numbers 
LocalMove: computing onlattice fits for biopolymersNucleic Acids Research, Oxford University Press, 36(2):W216W222 pp, 2008 [doi] [hal]Abstract (click to expand). Given an input Protein Data Bank file (PDB) for a protein or RNA molecule, LocalMove is a web server that determines an onlattice representation for the input biomolecule. The web server implements a Markov Chain MonteCarlo algorithm with simulated annealing to compute an approximate fit for either the coarsegrain model or backbone model on either the cubic or facecentered cubic lattice. LocalMove returns a PDB file as output, as well as dynamic movie of 3D images of intermediate conformations during the computation. The LocalMove server is publicly available at http://bioinformatics.bc.edu/clotelab/localmove/.


DIAL: A web server for the pairwise alignment of two RNA 3dimensional structures using nucleotide, dihedral angle and base pairing similaritiesNucleic Acids Research, Oxford University Press, 35(Web server issue):W659668 pp, 2007 [doi] [hal]Abstract (click to expand). DIAL (dihedral alignment) is a web server that provides public access to a new dynamic programming algorithm for pairwise 3D structural alignment of RNA. DIAL achieves quadratic time by performing an alignment that accounts for (i) pseudodihedral and/or dihedral angle similarity, (ii) nucleotide sequence similarity and (iii) nucleotide basepairing similarity. DIAL provides access to three alignment algorithms: global (Needleman–Wunsch), local (Smith–Waterman) and semiglobal modified to yield motif search). Suboptimal alignments are optionally returned, and also the Boltzmann pair probabilities for aligned positions within the optimal alignment. If a nonzero suboptimal alignment score ratio is entered, then the semiglobal alignment algorithm may be used to detect structurally similar occurrences of a userspecified 3D motif. The query motif may be contiguous in the linear chain or fragmented in a number of noncontiguous regions. The DIAL web server provides graphical output which allows the user to view, rotate and enlarge the 3D superposition for the optimal (and suboptimal) alignment of query to target. Although graphical output is available for all three algorithms, the semiglobal motif search may be of most interest in attempts to identify RNA motifs. DIAL is available at http://bioinformatics.bc.edu/clotelab/DIAL.


Bioinformatics, Oxford University Press (OUP), 22(12):15341535 pp, 2006 [doi] [hal]Abstract (click to expand). GenRGenS is a software tool dedicated to randomly generating genomic sequences and structures. It handles several classes of models useful for sequence analysis, such as Markov chains, hidden Markov models, weighted contextfree grammars, regular expressions and PROSITE expressions. GenRGenS is the only program that can handle weighted contextfree grammars, thus allowing the user to model and to generate structured objects (such as RNA secondary structures) of any given desired size. GenRGenS also allows the user to combine several of these different models at the same time. Availability: Source and executable files of GenRGenS (in Java) and the complete user's manual are freely available at http://www.lri.fr/bio/GenRGenS

 Conferences with proceedings (22)

Proceedings of RECOMB 2018 – 22nd Annual International Conference on Research in Computational Molecular Biology, France, 2018 [hal]Abstract (click to expand). The design of multistable RNA molecules has important applications in biology, medicine, and biotechnology. Synthetic design approaches profit strongly from effective insilico methods, which can tremendously impact their cost and feasibility. We revisit a central ingredient of most insilico design methods: the sampling of sequences for the design of multitarget structures, possibly including pseudoknots. For this task, we present the efficient, tree decompositionbased algorithm \ourprog{}. Our fixed parameter tractable approach is underpinned by establishing the $\#${\sf P}hardness of uniform sampling. Modeling the problem as a constraint network, \ourprog{} supports generic Boltzmannweighted sampling for arbitrary additive RNA energy models; this enables the generation of RNA sequences meeting specific goals like expected free energies or \GCbcontent. Finally, we empirically study general properties of the approach and generate biologically relevant multitarget Boltzmannweighted designs for a common design benchmark. Generating seed sequences with \ourprog{}, we demonstrate significant improvements over the previously best multitarget sampling strategy (uniform sampling).Our software is freely available at: https://github.com/yannponty/RNARedPrint .


Proceedings of ISMB/ECCB  25th Annual international conference on Intelligent Systems for Molecular Biology/16th European Conference on Computational Biology  2017, Czech Republic, 33(14):i283  i292 pp, 2017 [doi] [hal]Abstract (click to expand). Motivation: Kinetics is key to understand many phenomena involving RNAs, such as cotranscriptional folding and riboswitches. Exact outofequilibrium studies induce extreme computational demands, leading stateoftheart methods to rely on approximated kinetics landscapes, obtained using sampling strategies that strive to generate the key landmarks of the landscape topology. However, such methods are impeded by a large level of redundancy within sampled sets. Such a redundancy is uninformative, and obfuscates important intermediate states, leading to an incomplete vision of RNA dynamics. Results: We introduce RNANR, a new set of algorithms for the exploration of RNA kinetics landscapes at the secondary structure level. RNANR considers locally optimal structures, a reduced set of RNA conformations, in order to focus its sampling on basins in the kinetic landscape. Along with an exhaustive enumeration, RNANR implements a novel nonredundant stochastic sampling, and offers a rich array of structural parameters. Our tests on both real and random RNAs reveal that RNANR allows to generate more unique structures in a given time than its competitors, and allows a deeper exploration of kinetics landscapes. Availability: RNANR is freely available at https://project.inria.fr/rnalands/rnanr 
An EM algorithm for mapping short reads in multiple RNA structure probing experimentsProceedings of Matbio2017, United Kingdom, 2017 [hal]Abstract (click to expand). An accurate mapping of reads against the sequence of reference is the first step to grant a good NGS data analysis.However, when mapping is about assigning reads to a set of RNA variants, in the case of simultaneous sequencing,the task become hard to handle. Many algorithms have been developed to overcome the issue of mapping readsagainst a set of homologous sequences at one time but the problem is not fully resolved, particularly when dealingwith short reads. The issue addressed in our study is much more challenging; In addition to the parallel assignmentissue in the presence of short reads, RNA variants molecules, used for the library sequencing preparation step,undergo a specific experimental treatment SHAPE causing the formation of mutations at the level of structurallyunpaired nucleotides. Mutations due to SHAPE might lead to a missmapping i.e. a read could be derived from agiven RNA variant i and because of SHAPE mutations it becomes more appropriate to assign it to the variant jfrom which the read has the shortest base distance. In an ongoing work, we are trying to resolve the unprecedentedmapping question trough an Expectation Maximization (EM) algorithm where each RNA variant from the setof references would be characterized by a SHAPE mutational profile instead of being merely characterized by asequence of nucleotides. The EM algorithm aims to maximize the likelihood of a read to be derived from a specificRNA variant and to assess its contribution to build the RNA associated mutational profile.


Proceedings of ALCOB  3rd International Conference on Algorithms for Computational Biology  2016, Spain, Springer, 9702:5364 pp, 2016 [doi] [hal]Abstract (click to expand). Pairwise ordered tree alignment are combinatorial objects that appear in RNA secondary structure comparison. However, the usual representation of tree alignments as supertrees is ambiguous, i.e. two distinct supertrees may induce identical sets of matches between identical pairs of trees. This ambiguity is uninformative, and detrimental to any probabilistic analysis.In this work, we consider tree alignments up to equivalence. Our first result is a precise asymptotic enumeration of tree alignments, obtained from a contextfree grammar by mean of basic analytic combinatorics. Our second result focuses on alignments between two given ordered trees $S$ and $T$. By refining our grammar to align specific trees, we obtain a decomposition scheme for the space of alignments, and use it to design an efficient dynamic programming algorithm for sampling alignments under the GibbsBoltzmann probability distribution. This generalizes existing tree alignment algorithms, and opens the door for a probabilistic analysis of the space of suboptimal RNA secondary structures alignments. 
Proceedings of GASCOM  10th conference on random generation of combinatorial structures  2016, France, Electronic Notes in Discrete Mathematics, 59(Supp. C):99  114 pp, 2016 [doi] [hal]Abstract (click to expand). A lattice walk model is said to be reluctant if the defining step set has a strong drift towards the boundaries. We describe efficient random generation strategies for these walks.


Proceedings of ISBRA  11th International Symposium on Bioinformatics Research and Applications  2015, United States, 9096:260271 pp, 2015 [doi] [hal]Abstract (click to expand). The availability of a large number of assembled genomes opens the way to study the evolution of syntenic character within a phylogenetic context. The DeCo algorithm, recently introduced by Bérard et al. allows the computation of parsimonious evolutionary scenarios for gene adjacencies, from pairs of reconciled gene trees. Following the approach pioneered by Sturmfels and Pachter, we describe how to modify the DeCo dynamic programming algorithm to identify classes of cost schemes that generates similar parsimonious evolutionary scenarios for gene adjacencies, as well as the robustness to changes to the cost scheme of evolutionary events of the presence or absence of specific ancestral gene adjacencies. We apply our method to six thousands mammalian gene families, and show that computing the robustness to changes to cost schemes provides new and interesting insights on the evolution of gene adjacencies and the DeCo model. 
Proceedings of CPM  26th Annual Symposium on Combinatorial Pattern Matching, Italy, 2015 [hal]Abstract (click to expand). In this work, we consider the Combinatorial RNA Design problem, a minimal instance of the RNA design problem which aims at finding a sequence that admits a given target as its unique base pair maximizing structure. We provide complete characterizations for the structures that can be designed using restricted alphabets. Under a classic fourletter alphabet, we provide a complete characterization of designable structures without unpaired bases. When unpaired bases are allowed, we provide partial characterizations for classes of designable/undesignable structures, and show that the class of designable structures is closed under the stutter operation. Membership of a given structure to any of the classes can be tested in linear time and, for positive instances, a solution can be found in linear time. Finally, we consider a structureapproximating version of the problem that allows to extend bands (helices) and, assuming that the input structure avoids two motifs, we provide a lineartime algorithm that produces a designable structure with at most twice more base pairs than the input structure.


Proceedings of BSB  Brazilian Symposium on Bioinformatics  2014, Brazil, Advances in Bioinformatics and Computational Biology, Springer, 8826:4956 pp, 2014 [doi] [hal]Abstract (click to expand). We consider a recently introduced dynamic programming scheme to compute parsimonious evolutionary scenarios for gene adjacencies. We extend this scheme to sample evolutionary scenarios from the whole solution space under the Boltzmann distribution. We apply our algorithms to a dataset of mammalian gene trees and adjacencies, and observe a significant reduction of the number of syntenic inconsistencies observed in the resulting ancestral gene adjacencies.


Proceedings of RECOMB  17th Annual International Conference on Research in Computational Molecular Biology  2013, China, 2013 [hal]Abstract (click to expand). Analysis of the sequencestructure relationship in RNA molecules are essential to evolutionary studies but also to concrete applications such as errorcorrection methodologies in sequencing technologies. The prohibitive sizes of the mutational and conformational landscapes combined with the volume of data to proceed require e cient algorithms to compute sequencestructure properties. More speci cally, here we aim to calculate which mutations increase the most the likelihood of a sequence to a given structure and RNA family. In this paper, we introduce RNApyro, an e cient lineartime and space insideoutside algorithm that computes exact mutational probabilities under secondary structure and evolutionary constraints given as a multiple sequence alignment with a consensus structure. We develop a scoring scheme combining classical stacking base pair energies to novel isostericity scales, and apply our techniques to correct pointwise errors in 5s rRNA sequences. Our results suggest that RNApyro is a promising algorithm to complement existing tools in the NGS errorcorrection pipeline. 
Proceedings of ISMB/ECCB  21st Annual international conference on Intelligent Systems for Molecular Biology/12th European Conference on Computational Biology  2013, Germany, 2013 [hal]Abstract (click to expand). Motivations: The design of RNA sequences folding into predefined secondary structures is a milestone for many synthetic biology and gene therapy studies. Most of the current software uses similar local search strategies (i.e. a random seed is progressively adapted to acquire the desired folding properties) and more importantly do not allow the user to control explicitly the nucleotide distribution such as the GCcontent in their sequences. However, the latter is an important criterion for largescale applications as it could presumably be used to design sequences with better transcription rates and/or structural plasticity. Results: In this paper, we introduce IncaRNAtion, a novel algorithm to design RNA sequences folding into target secondary structures with a predefined nucleotide distribution. IncaRNAtion uses a global sampling approach and weighted sampling techniques. We show that our approach is fast (i.e. running time comparable or better than local search methods), seedless (we remove the bias of the seed in local search heuristics), and successfully generates highquality sequences (i.e. thermodynamically stable) for any GCcontent. To complete this study, we develop an hybrid method combining our global sampling approach with local search strategies. Remarkably, our glocal methodology overcomes both local and global approaches for sampling sequences with a specific GC content and target structure. Availability: IncaRNAtion is available at csb.cs.mcgill.ca/incarnation/ 
Proceedings of RECOMB  17th Annual International Conference on Research in Computational Molecular Biology  2013, China, 2013 [hal]Abstract (click to expand). We describe the broad outline of a new thermodynamicsbased algorithm, FFTbor, that uses the fast Fourier transform to perform polynomial interpolation to compute the Boltzmann probability that secondary structures di er by k base pairs from an arbitrary reference structure of a given RNA sequence. The algorithm, which runs in quartic time O(n4) and quadratic space O(n2), is used to determine the correlation between kinetic folding speed and the ruggedness of the energy landscape,and to predict the location of riboswitch expression platform candidates. The full paper appears in PLoS ONE (2012) 19 Dec 2012. A web server is available at http://bioinformatics.bc.edu/clotelab/FFTbor/. 
Proceedings of ACMBCB  ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics  2013, United States, 2013 [hal]Abstract (click to expand). The problem of RNA secondary structure design (also called inverse folding) is the following: given a target secondary structure, one aims to create a sequence that folds into, or is compatible with, a given structure. In several practical applications in biology, additional constraints must be taken into account, such as the presence/absence of regulatory motifs, either at a specific location or anywhere in the sequence. In this study, we investigate the design of RNA sequences from their targeted secondary structure, given these additional sequence constraints. To this purpose, we develop a general framework based on concepts of language theory, namely contextfree grammars and finite automata. We efficiently combine a comprehensive set of constraints into a unifying contextfree grammar of moderate size. From there, we use generic generic algorithms to perform a (weighted) random generation, or an exhaustive enumeration, of candidate sequences. The resulting method, whose complexity scales linearly with the length of the RNA, was implemented as a standalone program. The resulting software was embedded into a publicly available dedicated web server. The applicability demonstrated of the method on a concrete case study dedicated to Exon Splicing Enhancers, in which our approach was successfully used in the design of \emph{in vitro} experiments.


Proceedings of AOFA  23rd International Meeting on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms  2012, Canada, DMTCS, DMTCS Proceedings vol. AQ, 23rd Intern. Meeting on Probabilistic, Combinatorial, and Asymptotic Methods for the Analysis of Algorithms (AofA'12):243264 pp, 2012 [hal]Abstract (click to expand). We consider the word collector problem, i.e. the expected number of calls to a random weighted generator before all the words of a given length in a language are generated. The originality of this instance of the nonuniform coupon collector lies in the, potentially large, multiplicity of the words/coupons of a given probability/composition. We obtain a general theorem that gives an asymptotic equivalent for the expected waiting time of a general version of the Coupon Collector. This theorem is especially wellsuited for classes of coupons featuring high multiplicities. Its application to a given language essentially necessitates knowledge on the number of words of a given composition/probability. We illustrate the application of our theorem, in a stepbystep fashion, on four exemplary languages, whose analyses reveal a large diversity of asymptotic waiting times, generally expressible as $\kappa \cdot m^p \cdot (\log{m})^q \cdot (\log \log{m})^r$, for $m$ the number of words, and $p, q, r$ some positive real numbers. 
Proceedings of CPM  23rd Annual Symposium on Combinatorial Pattern Matching  2012, Finland, Combinatorial Pattern Matching, Springer, 7354:321333 pp, 2012 [doi] [hal]Abstract (click to expand). La prédiction du repliement, avec pseudonoeuds généraux, d'une séquence d'ARN de taille $n$ est équivalent à la recherche d'un couplage d'énergie libre minimale. Dans un modèle d'énergie simple, où chaque paire de base contribue indépendamment à l'énergie, ce problème peut être résolu en temps $\Theta(n^3)$ grâce à une variante d'un algorithme de couplage pondéré maximal. Cependant, le même problème a été démontré NPdifficile dans le modèle d'énergie dit des plus proches voisins. Dans ce travail, nous étudions les propriétés du problème sous un modèle d'empilements, constituant un modèle intermédiaire entre ceux d'appariement et des plus proches voisins. Nous démontrons tout d'abord que le repliement avec pseudonoeuds de l'ARN reste NPdifficile dans de nombreuses valuations du modèle d'énergie. . Par ailleurs, nous montrons que ce problème est approximable, en proposant un algorithme polynomial garantissant une $1/5$approximation. Ce résultat illustre une différence essentielle entre ce modèle et celui des plus proches voisins, pour lequel nous montrons qu'il ne peut être approché à aucun ratio positif par un algorithme en temps polynomial sauf si $N=NP$. 
Proceedings of ANALCO  12th Meeting on Analytic Algorithmics and Combinatorics  2012, Japan, Omnipress, 107116 pp, 2012 [hal]Abstract (click to expand). It is well known that, under some aperiodicity and irreducibility conditions, the number of occurrences of local patterns within a Markov chain (and, more generally, within the languages generated by weighted regular expressions/automata) follows a Gaussian distribu tion with both variance and mean in (n). By contrast, when these conditions no longer hold, it has been denoted that the limiting distribution may follow a whole diversity of distributions, including the uniform, powerlaw or even multimodal distribution, arising as tradeo s between structural properties of the regular expression and the weight/probabilities associated with its transitions/letters. However these cases only partially cover the full diversity of behaviors induced within regular expressions, and a characterization of attainable distributions remained to be provided. In this article, we constructively show that the limiting distribution of the simplest foresee able motif (a single letter!) may already follow an arbitrarily complex continuous distribution (or cadlag process). We also give applications in random generation (Boltzmann sampling) and bioinformatics (parsimonious segmentation of DNA). 
Proceedings of WABI  12th Workshop on Algorithms in Bioinformatics  2012, Slovenia, tba, 2012 [hal]Abstract (click to expand). Nous présentons un cadre général pour la comparaison structure/séquence d'une très large classe de structures d'ARN, qui généralise de nombreux travaux récents consacrés à des familles spécifiques. Notre approche, basée sur une décomposition arborescente des structures, permet d'obtenir un algorithme de complexité paramétrée, dont la complexité asymptotique, exponentielle dans le cas général, dépend de la famille de structures considérée. Pour chacune des familles considérées par les travaux antérieurs, notre algorithme se spécialise automatiquement, et permet d'obtenir une complexité égale à celle obtenue jusqu'alors.


Proceedings of RECOMB  15th Annual International Conference on Research in Computational Molecular Biology  2011, Canada, Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 6577:501515 pp, 2011 [doi] [hal]Abstract (click to expand). The analysis of the relationship between sequences and structures (i.e. how mutations aect structures and reciprocally how structures in uence mutations) is essential to decipher the principles driving molecular evolution, to infer the origins of genetic diseases or to develop bioengineering applications such as the design of articial molecules. Because their structures can be predicted from the sequence data only, RNA molecules provide a good framework to study this sequencestructure relationship. We recently introduced a suite of algorithms called RNAmutants which allows, for the rst time, a complete exploration of RNA sequencestructure maps in polynomial time and space. Formally, RNAmutants takes an input sequence (or seed) to compute the Boltzmann weighted ensembles of mutants with exactly k mutations, and sample mutations from these ensembles. However, this approach suers from major limitations. Indeed, since the Boltzmann probabilities of the mutations depend of the free energy of the structures, RNAmutants has diculties to sample mutant sequences with low G+Ccontents. In this paper we introduce a novel unbiased adaptive sampling algorithm that enables RNAmutants to sample regions of the mutational landscape poorly covered by classical algorithms. We applied these methods to sample mutations with low G+Ccontents. These adaptive sampling techniques can be easily adapted to explore other regions of the sequence and structural landscapes which are dif cult to sample. Importantly, these algorithms come at a minimal computational cost. We demonstrate the insights oered by these techniques on studies of complete RNA sequence structures maps of sizes up to 40 nucleotides. Our results indicate that the G+Ccontent has a strong in uence on the size and shape of the evolutionary accessible sequence and structural spaces. In particular, we show that low G+Ccontents favor the apparition of internal loops and thus possibly the synthesis of tertiary structure motifs. On the other hand, high G+Ccontents signicantly reduce the size of the evolutionary accessible mutational landscapes. 
Proceedings of WABI  11th Workshop on Algorithms in Bioinformatics  2011, Germany, 2011 [hal]Abstract (click to expand). We extend an hypergraph representation, introduced by Finkelstein and Roytberg, to unify dynamic programming algorithms in the context of RNA folding with pseudoknots. Classic applications of RNA dynamic programming energy minimization, partition function, basepair probabilities\ldots) are reformulated within this framework, giving rise to very simple algorithms. This reformulation allows one to conceptually detach the conformation space/energy model  captured by the hypergraph model  from the specific application, assuming unambiguity of the decomposition. To ensure the latter property, we propose a new combinatorial methodology based on generating functions. We extend the set of generic applications by proposing an exact algorithm for extracting generalized moments in weighted distribution, generalizing a prior contribution by Miklos and al. Finally, we illustrate our fullfledged programme on three exemplary conformation spaces (secondary structures, Akutsu's simple type pseudoknots and kissing hairpins). This readily gives sets of algorithms that are either novel or have complexity comparable to classic implementations for minimization and Boltzmann ensemble applications of dynamic programming.


Proceedings of GASCOM  8th conference on random generation of combinatorial structures  2010, Canada, 14pp pp, 2010 [hal]Abstract (click to expand). The present work analyzes the redundancy of sets of combinatorial objects produced by a weighted random generation algorithm proposed by Denise et al. This scheme associates weights to the terminals symbols of a weighted contextfree grammar, extends this weight definition multiplicatively on words, and draws words of length $n$ with probability proportional their weight. We investigate the level of redundancy within a sample of $k$ word, the proportion of the total probability covered by $k$ words (coverage), the time (number of generations) of the first collision, and the time of the full collection. For these four questions, we use an analytic urn analogy to derive asymptotic estimates and/or polynomially computable exact forms. We illustrate these tools by an analysis of an RNA secondary structure statistical sampling algorithm introduced by Ding et al. 
Proceedings of 21st International Meeting on Probabilistic, Combinatorial, and Asymptotic Methods in the Analysis of Algorithms (AofA'10), Austria, DMTCS Proceedings, Discrete Mathematics and Theoretical Computer Science, DMTCS Proceedings vol. AM, 21st International Meeting on Probabilistic, Combinatorial, and Asymptotic Methods in the Analysis of Algorithms (AofA'10):4964 pp, 2010 [hal]Abstract (click to expand). We address the uniform random generation of words from a contextfree language (over an alphabet of size $k$), while constraining every letter to a targeted frequency of occurrence. Our approach consists in a multidimensional extension of Boltzmann samplers. We show that, under mostly $\textit{strongconnectivity}$ hypotheses, our samplers return a word of size in $[(1 \epsilon)n, (1+ \epsilon)n]$ and exact frequency in $\mathcal{O}(n^{1+k/2})$ expected time. Moreover, if we accept tolerance intervals of width in $\Omega (\sqrt{n})$ for the number of occurrences of each letters, our samplers perform an approximatesize generation of words in expected $\mathcal{O}(n)$ time. We illustrate our approach on the generation of Tetris tessellations with uniform statistics in the different types of tetraminoes.


Proceedings of GASCOM  7th conference on random generation of combinatorial structures  2008, Italy, 12pp pp, 2008 [hal]Abstract (click to expand). We address the nonredundant random generation of k words of length n from a contextfree language. Additionally, we want to avoid a prede¯ned set of words. We study the limits of a rejectionbased approach, whose time complexity is shown to grow exponentially in k in some cases. We propose an alternative recursive algorithm, whose careful implementation allows for a nonredundant generation of k words of size n in O(kn log n) arithmetic operations after the precomputation of O(n) numbers. The overall complexity is therefore dominated by the generation of k words, and the nonredundancy comes at a negligible cost.


Proceedings of 4th Symposium on Bioinformatics and bioengineering  BIBE'2004, Taiwan, Proceedings of the Fourth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'04), IEEE Computer Society, 387394 pp, 2004 [doi] [hal]Abstract (click to expand). We address the problem of estimating the sensitivity of seedbased similarity search algorithms. In contrast to approaches based on Markov models [18, 6, 3, 4, 10], we study the estimation based on homogeneous alignments. We describe an algorithm for counting and random generation of those alignments and an algorithm for exact computation of the sensitivity for a broad class of seed strategies. We provide experimental results demonstrating a bias introduced by ignoring the homogeneousness condition.

 Book chapters & Editorials (5)

RNA Bioinformatics, Editor(s) Ernesto Picardi, Methods in Molecular Biology, Springer New York, 63100 pp, 2015 [doi] [hal] [pubmed]Abstract (click to expand). Secondary structure diagrams are essential, in RNA biology, to communicate functional hypotheses and summarize structural data, and communicate them visually as drafts or finalized publicationready figures. While many tools are currently available to automate the production of such diagrams, their capacities are usually partial, making it hard for a user to decide which to use in a given context. In this chapter, we guide the reader through the steps involved in the production of expressive publicationquality illustrations featuring the RNA secondary structure. We present major existing representations and layouts, and give precise instructions to produce them using available free software, including jViz.RNA, the PseudoViewer, RILogo, Rchie, RNAplot, R2R, and VARNA. We describe the file formats and structural descriptions accepted by popular RNA visualization tools. We also provide command lines and Python scripts to ease the user's access to advanced features. Finally, we discuss and illustrate alternative approaches to visualize the secondary structure in the presence of probing data, pseudoknots, RNARNA interactions, and comparative data. 
Preface. Extended versions of selected papers presented at Computational methods for Structural RNAs (CMSR'14)Computational methods for Structural RNAs (CMSR'14), Journal of Computational Biology, Mary Ann Liebert, 189 pp, 2015 [hal] [pubmed]Abstract (click to expand). This preface introduces to the extended versions of selected papers presented at Computational methods for Structural RNAs (CMSR'14)


1024  Bulletin de la société informatique de France, Editor(s) Eric Sopena, SIF, 2353 pp, 2014 [hal]Abstract (click to expand). Nous présentons l'histoire et les développements récents de la recherche en bioinformatique des ARN, en insistant particulièrement sur sa dimension algorithmique, ainsi que sur les nombreuses et fructueuses interactions avec d'autres sciences qui jalonnent son histoire.


Introduction to RNA secondary structure comparison.RNA Sequence, Structure, and Function: Computational and Bioinformatic Methods, Methods in Molecular Biology Editor(s) Gorodkin, Jan and Ruzzo, Walter L., Methods in molecular biology, Humana Press/Springer Imprint, 24773 pp, 2013 [doi] [hal] [pubmed]Abstract (click to expand). Many methods have been proposed for RNA secondary structure comparison, and new ones are still being developed. In this chapter, we first consider structure representations and discuss their suitability for structure comparison. Then, we take a look at the more commonly used methods, restricting ourselves to structures without pseudoknots. For comparing structures of the same sequence, we study base pair distances. For structures of different sequences (and of different length), we study variants of the tree edit model. We name some of the available tools and give pointers to the literature. We end with a short review on comparing structures with pseudoknots as an unsolved problem and topic of active research.


Mieux comprendre certaines molécules biologiques grâce à l’informatiqueInterstices INRIA, 2012 [hal]Abstract (click to expand). Yann Ponty s’intéresse à l’algorithmique du repliement d’ARN. Il nous en parle dans cet épisode du podcast audio.

 Edited books (1)

Proceedings of the 1st workshop on Computational Methods for Structural RNAsEditor(s) Jossinet, Fabrice & Ponty, Yann & Waldispühl, Jérôme, McGill University, ISBN 9782955018705, 2014 [doi] [hal]Abstract (click to expand). Ribonucleic acids (RNAs) play key roles in various aspects of the gene transcription and regulation processes, and are the focus of an everincreasing interest in all areas of molecular biology. Deciphering the function of a nonprotein coding RNA requires an intimate knowledge of its structure, motivating the development of structurecentric methods. This volume contains original peerreviewed contributions and invited communications presented at "CMSR'14: Computational Methods for Structural RNAs" held on September 7, 2014 in Strasbourg. This event was hosted as an official workshop of ECCB'14: 13th European Conference on Computational Biology.Proceedings are freely downloadable at http://cmsr14.cs.mcgill.ca/

 Conferences without proceedings and/or unrefereed (7)

Proceedings of WEPA 2018  2nd International Workshop on Enumeration Problems and Applications, Italy, 2018 [hal]Abstract (click to expand). In this extended abstracts, we describe an approach to account for integervalued constraints in RNA design. In particular, we describe various FPT/XPT approaches, heavily relying on dynamic programming, for counting and sampling sequences fulfilling a set of declarative constraints. We close this short note with several open questions, drawing connections with graph theory and enumerative combinatorics.


An integrative approach for predicting the RNA secondary structure for the HIV–1 Gag UTR using probing dataProceedings of JOBIM  Journées Ouvertes en Biologie, Informatique et Mathématiques  2017, France, 2017 [hal]Abstract (click to expand). Structure modeling is key to understand the mechanisms of RNA retroviruses such as HIV. Many in silico prediction approaches suggesting structural models of moderate to good accuracies are available. However, the prediction methods could be further improved by taking advantage of both next generation sequencing technologies and different experimental techniques such as enzymatic and SHAPE probing data [1]. In a published article [2], we introduce and use a structural modeling method based on the integration of many experimental probing data to direct predictions with the aim to find the most accurate structure lying in the intersection of disjoint sources of experiments.


Proceedings of JOBIM  Journées Ouvertes en Biologie, Informatique et Mathématiques  2016, France, 2016 [hal]Abstract (click to expand). NextGeneration Sequencing (NGS) technologies have opened new perspectives to refine the process of predicting the secondary structure(s) of structured noncoding RNAs. Herein, we describe an integrated modeling strategy, based on the SHAPE chemistry, to infer structural insight from deep sequencing data. Our approach is based on a pseudoenergy minimization, incorporating additional information from evolutionary data (compensatory mutations) and SHAPE experiments (reactivity scores) within an iterative procedure. Preliminary results reveal conserved and stable structures within UTRs of the Ebola Genome, that are both thermodynamicallystable and highly supported by SHAPE accessibility analysis. 
Proceedings of JOBIM  Journées Ouvertes en Biologie, Informatique et Mathématiques  2016, France, 2016 [hal]Abstract (click to expand). Aligning macromolecules such as proteins, DNAs and RNAs in order to reveal, or conversely exploit, their functional homology is a classic challenge in bioinformatics, with farreaching applications in structure modelling and genome annotations. In the specific context of complex RNAs, featuring pseudoknots, multiple interactions and noncanonical base pairs, multiple algorithmic solutions and tools have been proposed for the structure/sequence alignment problem. However, such tools are seldom used in practice, due in part to their extreme computational demands, and because of their inability to support general types of structures. Recently, a general parameterized algorithm based on tree decomposition of the query structure has been designed by Rinaudo et al. We present an implementation of the algorithm within a tool named LiCoRNA. We compare it against stateoftheart algorithms. We show that it both gracefully specializes into a practical algorithm for simple classes pseudoknot, and offers a general solution for complex pseudoknots, which are explicitly outofreach of competing softwares.


Résultats algorithmiques pour le design d’ARN avec contraintes de séquenceProceedings of SeqBio 2015, France, 2015 [hal]Abstract (click to expand). Le problème du design consiste à concevoir une ou des séquences d’ARN qui vont, dans un modèleénergétique, se replier en une structure cible. Les algorithmes déjà existants dans le domaine negèrent pour la plupart pas l’ajout au problème des contraintes de séquence, c’estàdire l’interdictionou l’obligation d’utiliser certains motifs.Zhou et al. (2013) ont proposé un algorithme qui utilise de la génération aléatoire dans un langageformel conditionné par un automate fini qui gère les contraintes de motifs interdits/imposés et unegrammaire non contextuelle qui gère les contraintes imposées par la structure recherchée.On propose ici de se baser sur cet algorithme en y ajoutant des optimisations permettant de réduiresa complexité. Ce travail soulève également des questions ouvertes sur la construction efficace d’unautomate (quasi) minimal, ainsi que l’incorporation de contraintes supplémentaires garantissant lerepliement des séquences engendrées vers une structure cible à l’exclusion de tout autre repliement.


Can Probabilistic Model Checking Explore RiboNucleic Acid Folding Space?Proceedings of IWBDA  4th International Workshop on BioDesign Automation  2012, United States, 2012 [hal]Abstract (click to expand). The folding path of a ribonucleic acid from its free form to its complete structure is of significant interest to structural biologists. There are algorithms for simulating kinetics of the ribonucleic acid from its unfolded state to its stable structure (minimum freeenergy structure). These computational techniques are expensive as the total number of structures in the folding space is exponentially large. We propose to use timeefficient algorithms from the field of probabilistic model checking for verifying certain hypothesis concerning ribonucleic acid folding landscape. First, we explain how thermodynamic models of ribonucleic acid can be used to generate a Markov chain of structures, whose transitions within the chain are based on the energy difference w.r.t to their neighbours. Then we present a process algebra model to compactly represent the dynamics of ribonucleic acid folding. Both these approaches essentially generate input models for probabilistic model checking. Finally, we discuss if statistical model checking techniques can be applied for the problem of aligning ribonucleic acid sequences and structures.


Proceedings of Fifth IndoFrench Bioinformatics Meeting, India, 2011 [hal]Abstract (click to expand). In this short note, we investigate the responsibility of pseudoknotted (PK) foldings in the observed incapacity of an MFE folding method (RNAfold) to predict the structure of certain families of RNA within RFam. We considered the difference Delta in predicted freeenergy obtained by respectively considering and omitting PK conformations, which we showed could be reliably used as an indicator for the presence of pseudoknots (84.5% AUC, while 88.3% of PK families have positive values). We furthered our analysis, investigating the top 15 families associated with positive values of Delta associated with nonPK consensus structures in RFAM, and found evidence of the presence of PK in the literature for at least 11 of them. However a large proportion of poorly predicted families remain associated with low Delta values, and additional explanations need be explored for their poor predictions.

 Theses (1)

2006 [hal]Abstract (click to expand). La mise en évidence des mécanismes de sélection agissant sur les données génomiques structurées (ARN, Protéines, ADN...) nécessite l'élaboration de modèles de séquences. Une fois un tel modèle élaboré, il est possible, au prix d'une analyse mathématique parfois complexe ou par le biais de la
génération aléatoire, d'évaluer la significativité d'un phénomène observé. Tout d'abord, nous nous intéressons aux propriétés des grammaires pondérées, un formalisme particulièrement adapté à la modélisation de la structure des ARN, dérivant des algorithmes de génération aléatoire efficaces implémentés au sein du prototype GenRGenS. Nous abordons le calcul automatique des pondérations réalisant des valeurs observées pour les paramètres du modèle, ainsi qu'une implémentation basée sur une approche optimisation. Dans un second temps, nous abordons la modélisation de la structure secondaire d'ARN. Après quelques rappels de biologie moléculaire, nous proposons plusieurs modèles basés sur des grammaires pondérées permettant la génération de structures d'ARN réalistes. L'utilisation d'un algorithme d'optimisation permet le calculer des pondérations correspondant à certaines familles d'ARN. Nous proposons enfin un algorithme d'extraction de structure secondaire maximale dans une structure générale, qui permet de profiter des données récentes issues de la cristallographie. Le dernier chapitre de cette thèse s'intéresse à l'analyse d'un algorithme de recherche de similarité heuristique, dont la sensibilité s'avère étroitement liée à la probabilité de présence d'un motif au sein de marches aléatoires particulières, les chemins culminants. Ces marches restent positives, et atteignent une altitude maximale en leur dernier pas. Nous proposons un algorithme récursif de génération aléatoire pour ces chemins. En combinant des techniques issues de la combinatoire énumérative, l'analyse asymptotique et la théorie des langages, nous dérivons des algorithmes de génération aléatoire par rejet linéaires dans de nombreux cas.

 Invited talks (3)

RNA Bioinformatics and ensemble dynamic programmingProceedings of 5th biennial Canadian Discrete and Algorithmic Mathematics Conference  CanaDAM 2015, Canada, 2015 [hal]Abstract (click to expand). RiboNucleic Acids (RNAs) are fascinating biomolecules which, similarly to DNA, can be encoded as sequences over a fourletter alphabet, and perform a wide array of biological functions. However, unlike DNA, the precise function of a given RNA depends critically on its structure, adopted as the outcome of a folding process. Luckily, this intricate three dimensional conformation can be adequately abstracted as a (noncrossing) list of contacts, i.e. a discrete combinatorial object.In this talk, I will emphasize how, over the past three decades, RNA biology has benefited from a continuous and fruitful crosstalk between discrete mathematicians, computer scientists and biochemists. At the center of this conversation lies the concept of dynamicprogramming, an algorithmic design technique which solves a combinatorial optimization problem efficiently by taking advantage of a wellchosen decomposition of its search space. Extensions and optimized instances of this technique now allow to address, at a genomic scale, multiple questions related to the analysis of the Boltzmann ensemble and the sequencestructure(function) relationship. These developments also raise welldefined open questions, motivating further studies of the underlying discrete structures. 
Towards firm foundations for the rational design of RNA moleculesProceedings of Colloque de Bioinformatique Moléculaire (GdR BIM), France, 2015 [hal]Abstract (click to expand). The overall architecture of an RNA can be predicted with acceptable accuracy using established computational methods, such as those implemented in the MFold or the RNAFold software. This opens the door for a rational design of RNA molecules, to serve as building blocks in synthetic biology, or as novel therapeutic agents. To that purpose, one must solve the inverse folding problem, i.e. construct an RNA sequence which adopts a specific secondary structure according to established structure prediction methods. However, contrasting with the prediction task, the underlying computational structure of inverse folding is poorly understood, and existing tools either fail to return admissible sequences for some structures, or require extraordinary computational resources. Consequently, the field of RNA bioinformatics is currently craving for both practical and theoreticallysound approaches for the problem. In this talk, I will introduce the many flavors of the RNA design problem and illustrate its relevance to the objectives of modern (synthetic) biology. I will outline the strengths and shortcomings of selected approaches, and will briefly present recently contributed methods, developed with collaborators at Univ McGill (Canada), Wuhan Univ. (China) and Univ. ParisSud. If time allows, I will also mention results recently obtained with collaborators at Simon Fraser University, which characterize large subclasses of (un)designable secondary structures in simple energy models. These results are currently being extended to more realistic energy models, and could serve as a foundation for exact, computationallyefficient, solutions for RNA design.


Visualizing 2D & 3D Structures of RNAProceedings of VIZBI  3rd international meeting on Visualizing Biological Data  2012, Germany, 2012 [hal]Abstract (click to expand). Broad overview of the existing visualization techniques/methods/tools and many challenges faced by the RNA community.

 Miscellaneous & Submitted (5)

[hal]Abstract (click to expand). 1 Introduction:RNA structure is a key to understand retroviruses’s mechanisms e.g. HIV. Many prediction approaches suggesting accurate structures are available but they could be further improved by both taking advantage of Next generation sequencing technology and new experimental techniques (Enzymatic and SHAPE).2  Experimental probing dataIn this poster, we present an integrative approach based on using many experimental data, resulting from sequencing, to direct predictions with the aim to find an accurate structure lying in the intersection of different sources of experiments. From one side, to reveal single nucleotide, reactivity profiles resulting from a SHAPE technology were used as “soft constraints”, meaning that the reactivity values were translated into pseudoenergies as described (Lorenz et al, 2016). From the other side, RNAses cleavage was used with two enzymes V and T targeting respectively paired and unpaired nucleotides. Reactivity scores resulting from those two experiments are used as hard constraints, forcing positions that exceed a specific threshold to be paired(case of EnzymaticV) or unpaired(case of EnzymaticT).3stochastic sampling:At the thermodynamic equilibrium, a given RNA can have many alternative structures, where each structure could be characterized by a probability within the space of all the possible conformations (Boltzmann ensemble). This probability is related to the energy of the structure, the highest the energy needed to break pairs present in the structure the highest is its probability in the ensemble.We admit that the optimal structure(s) should be energetically stable and supported by several experimental data. For this reason, we coupled a stochastic sampling from the Boltzmann ensembles associated with the experimentally derived constraints, with a clustering across experimental conditions, to generate a structural models that are wellsupported by available data.4The workflow description: 1. Experimental data from different conditions(SHAPE,EnzymaticT, EnzymaticV) were analysed to extract reactivity profiles that will serve as constraints. 2. We sampled 2000 structures per condition: We perform a Boltzmann sampling (Ding et al, 2005) to generate a predefined number of stable structures, compatible with the constraints derived for each condition. We used the stochastic sampling mode of RNAsubopt (p option) to generate energetically stable structures that are either fully compliant with constraints derived from enzymatic data (hard constraints(Mathews et al,2004)), or constitute reasonable tradeoffs between thermodynamic stability and compatibility with SHAPE data (soft constraints, using thepseudopotentials of Deigan et al. (Deigan et al, 2009), see (Lorenz et al,2016) for details). 3. We merge the structures while keeping labels to retain the origin of each structure. 4. In order to detect structures with affinity to each other, The merged sets of models were clustered, using the basepair distance as a measure of dissimilarity, the distance between two structures corresponds to the number of base pairs needed to break and to build in order to go from a structure to an other. A clustering algorithms (affinity propagation (Wang et al, 2007) implemented in the scikitlearn Python package (Pedregosa et al, 2011) is used to agglomerate and identify recurrent structures. One of the advantages of affinity propagation resides in its low computational requirements. 5. The next step consists on identifying clusters that are homogeneous, stable and well supported by experimental evidences, leading to the identification of the following objective criteria:Present conditions that informs as about the diversity of the cluster: Our primary target are clusters compatible with multiple experimental conditions. However, the larger sampled sets required for reproducibility tend to populate each cluster with structures fromall conditions. We thus associated with each cluster the number of represented conditions, defined as the number of conditions for whichthe accumulated Boltzmann probability in the cluster exceeds a predefined threshold.Boltzmann weight that is a measure of stability: Structures that are found in a given cluster may be unstable, and should be treated asoutliers. For this reason, we computed the cumulated normalized Boltzmann probabilities within the cluster, to favor stable clustersconsisting of stable structures; Average Cluster Distance to count for coherence: We observed a general tendency of clustering algorithms to create heterogeneousclusters when faced with noisy data. We thus associated with each cluster the mean distance between pairs of structures, estimated asthe average distance to the MEA (Lu et al, 2009) for the sake of efficiency, in order to neglect clusters that were too diverse.6. The next steps consist on choosing cluster(s) with high coherence, diversity and stability. for this purpose we restricted our analysis to clusters that were found on the 3D Pareto Frontier (Mattson Messac, 2005) with respect to the three mentioned above criteria .7. After detecting the optimal Pareto cluster(s), we need to identify representative structure for each cluster. We chose the maximum expectedaccuracy (MEA) structure (Lu et al, 2009) as the representative structure for each cluster, which is defined as the secondary structure whose structural elements have highest accumulated Boltzmann probability within the cluster.5 Results:This resulted in 2 structures which we narrowed down to a single candidate using compatibility with the 1M7 SHAPE data as a final discriminatory criterion.


[hal]Abstract (click to expand). Reconstruction of the evolutionary history of genomic characters along a given species tree is a longstanding problem in computational biology, with efficient algorithms to compute parsimonious scenarios for many types of characters, as in the cases of genes and genomes sequences, gene content, and gene family evolution. Recently, Bérard et al. extended the corpus of such results to syntenic characters, as they introduced the notion of adjacency forest, that models the evolution of gene adjacencies within a phylogeny, and described an efficient dynamic programming (DP) algorithm, called DeCo (Berard et al, ECCB 2012), to compute parsimonious adjacency evolutionary histories. Applying the classical parsimonybased approach of DeCo to a dataset of over 6,000 pairs of mammalian gene trees yielded a significant number of ancestral genes involved in more than two adjacencies, which correspond to syntenic inconsistencies.Recently, more general approaches for parsimony problems have been analyzed, either exploring a wider range of parameters, or considering several alternate histories for a given instance. Our work seeks to address the syntenic inconsistencies of DeCo by extending their DP scheme toward an exploration of the whole solution space of adjacency histories, under the Boltzmann probability distribution, that assigns a probability to each solution defined in terms of its parsimony score. This principle has been applied in several contexts and is sometimes known as the “Boltzmann ensemble approach”.While this Boltzmann ensemble approach has been used for a long time in RNA structure analysis, to the best of our knowledge it is not the case in comparative genomics, where exact probabilistic models have been favored as increasing computational capacities allow them to handle realistic datasets. However, such a probabilistic model does not exist so far for gene adjacencies, which motivates our work.We first show that by sampling adjacencies histories under a Boltzmann distribution that favors cooptimal histories and conserving only frequent ancestral adjacencies, we can reduce significantly the number of syntenic inconsistencies. We also implement a inside/outside variant for DeCo, that computes the actual probabilities of each individual adjacency considered under the Boltzmann distribution.


[hal]Abstract (click to expand). Two formalisms, both based on contextfree grammars, have recently been proposed as a basis for a nonuniform random generation of combinatorial objects. The former, introduced by Denise et al, associates weights with letters, while the latter, recently explored by Weinberg et al in the context of random generation, associates weights to transitions. In this short note, we use a simple modification of the Greibach Normal Form transformation algorithm, due to Blum and Koch, to show the equivalent expressivities, in term of their induced distributions, of these two formalisms.


[hal]Abstract (click to expand). Ce mémoire résume un travail opéré sous la direction d'Alain Denise de Mars à Juin 2003. Il consisteen une étude des propriétés combinatoires des structures secondaires d'ARN, en rapport avec le problème de la génération aléatoire de structures secondaires d'ARN réalistes.Dans une première partie, nous tenterons de familiariser le lecteur avec le contexte biologique sousjacent aux problèmes étudiés.Nous présenterons ensuite un état de l'art de l'étude combinatoire des structures secondaire d'ARN. Pour cela, nous introduirons brièvement une série d'outils théoriques, comme les séries génératrices, et leurs comportements asymptotiques ou les grammaires non contextuelles. Enfin, nous évoquerons divers mécanismes de génération aléatoire et présenterons une contribution à l'étude des paramètres des structures secondaires d'ARN.Nous proposerons en annexe une description d'un logiciel dédié à la génération aléatoire de séquences génomiques, GenRGenS, dont l'auteur assure le développement depuis la version 1.0. Nous présenterons aussi un algorithme polynomial de génération aléatoire de chemins culminants, structures combinatoires non algébriques qui apparaissent lors de l'étude d'algorithmes d'alignements de séquence. 
[hal]Abstract (click to expand). We address the problem of measuring the sensitivity of seedbased similarity search algorithms. In contrast to approaches based on Markov models, we study the measurement based on homogeneous alignments. We describe an algorithm for counting and random generation of those alignments and an algorithm for exact computation of the sensitivity for a broad class of seed strategies. We provide experimental results demonstrating a bias introduced by ignoring the homogeneousness condition.
