Yann Ponty, CNRS researcher@LIX, Ecole Polytechnique

2012-2013-CASM-M2 BIBS

J'interviens au sein du Master Bioinformatique et Biostatistiques (BIBS) de l'Université Paris-Sud/11 et co-habilité par l'École Polytechnique, dans le cadre de la matière Combinatoire, Algorithmes, Séquences et Modélisation (CASM). Les cours des 17 Décembre 2012, 14 Janvier et 21 Janvier 2013 auront lieu au PUIO, et consisteront en un présentation de différentes approches issues de la programmation dynamique développées dans le cadre du repliement de l'ARN. Chaque séance sera suivie d'un mini-TP, où on s'attachera à mettre en pratique les principes présentés en cours.

Vous trouverez bientôt sur cette page :

Les slides et énoncés de TP
Les modalités de contrôle continu
Les articles proposés pour l'examen de contrôle continu

Slides et énoncés de TP

Séance 1 : Structure(s) de l'ARN, programmation dynamique, repliement dans le modèle d'énergie de Nussinov/Jacobson.
[Slides cours], [Version compacte-4pp]
[Énoncé TP 1 + Rappels Python]
[Wrappers Python Vienna Package]
Séance 2 : Structures sous-optimales, ensemble de Boltzmann, échantillonnage statistique et pseudo-noeuds simples.
[Slides cours], [Version compacte-4pp]
Séance 3 : Comparaison/alignement (2D/3D), introduction aux méthodes génériques pour la programmation dynamique.
[Slides cours], [Version compacte-4pp]
[Énoncé TP 2]

Modalités de contrôle continu

Le contrôle continu de CASM comporte :

Examen écrit (date/lieu à venir) : Il portera sur l'ensemble du programme, et comptera pour 2/3 de la note finale. On pourra s'entraîner sur les examens des années précédentes :
- Examen 2011-2012 : Corrigé partiel.
- Annales 2010-2011 : Corrigé partiel.
- Annales 2009-2010 : Sujet [pdf] + Corrigé partiel [pdf])
Examen d'article : Soutenance organisée collectivement (date/lieu à venir) et comptant pour 1/3 de la note finale.
Déroulement : Choisir individuellement un article, dont vous devrez effectuer une étude synthétique. Un court rapport (+/- 5 pages) devra nous être remis au moins une semaine avant la soutenance individuelle (15 min. chacun).

Liste des articles proposés

La lecture d'article sera effectuée et soutenue en monome. Chaque article ne pourra être choisi que par une personne au plus. Veuillez me contacter dès que vous avez choisi un article, que je puisse le retirer de la liste des articles disponibles.

[Disponible]
Summarizing and correcting the GC content bias in high-throughput sequencing [PDF]
Benjamini, Y. and T.P. Speed
Biais et correction dans le séquençage haut-débit. [More...] [Less...]

Abstract GC content bias describes the dependence between fragment count (read coverage) and GC content found in Illumina sequencing data. This bias can dominate the signal of interest for analyses that focus on measuring fragment abundance within a genome, such as copy number estimation (DNA-seq). The bias is not consistent between samples; and there is no consensus as to the best methods to remove it in a single sample. We analyze regularities in the GC bias patterns, and find a compact description for this unimodal curve family. It is the GC content of the full DNA fragment, not only the sequenced read, that most influences fragment count. This GC effect is unimodal: both GC-rich fragments and AT-rich fragments are underrepresented in the sequencing results. This empirical evidence strengthens the hypothesis that PCR is the most important cause of the GC bias. We propose a model that produces predictions at the base pair level, allowing strand-specific GC-effect correction regardless of the downstream smoothing or binning. These GC modeling considerations can inform other high-throughput sequencing analyses such as ChIP-seq and RNA-seq. [Less...]
[Pris par Raguideau]
ON PATTERN FREQUENCY OCCURRENCES IN A MARKOVIAN SEQUENCE [PDF]
Mireille Régnier and Wojciech Szpankowski
Combinatoire analytique pour l'analyse de motifs.
Important : Seul est à traiter le cas Bernoulli (voir uniforme) pour un seul motif, et on pourra passer sur le cas d'un modèle nul Markovien, ainsi que sur le cas de plusieurs motifs. La partie loi limite pourra être seulement esquissée sans détails des théorèmes. [More...] [Less...]

Abstract Consider a given pattern H and a random text T generated by a Markovian source. We study the frequency of pattern occurrences in a random text when overlapping copies of the pattern are counted separately. We present exact and asymptotic formulæ for moments (including the variance), and probability of r pattern occurrences for three different regions of r, namely: (i) r = O(1), (ii) central limit regime, and (iii) large deviations regime. In order to derive these results, we first construct certain language expressions that characterize pattern occurrences which are later translated into generating functions. We then use analytical methods to extract asymptotic behaviors of the pattern frequency from the generating functions. These findings are of particular interest to molecular biology problems (e.g., finding patterns with unexpectedly high or low frequencies, and gene recognition), information theory (e.g., second-order properties of the relative frequency), and pattern matching algorithms (e.g., q-gram algorithms). [Less...]
[Pris par Launay]
Assemblathon 1: A competitive assessment of de novo short read assembly methods [PDF]
Dent A. Earl, Keith Bradnam, John St. John, et al.
Un comparatif des méthodes disponibles pour l'assemblage de données de séquençage haut-débit. [More...] [Less...]

Abstract Low cost short read sequencing technology has revolutionised genomics, though it is only justbecoming practical for the high quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort teams were asked to assemble a simulated Illumina HiSeq dataset of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novelhaplotype aware assessments of coverage, contiguity, structure, base calling 1 and copy number were made. We establish that within this benchmark (1) it is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/. [Less...]
[Disponible]
Sparse RNA Folding: Time and Space Efficient Algorithms [PDF]
Rolf Backofen, Dekel Tsur, Shay Zakov, and Michal Ziv-Ukelson
Nouvelle technique algorithmique permettant d'éviter le calcul de toute la matrice de programmation dynamique pour le repliement de l'ARN. [More...] [Less...]

Abstract The classical algorithm for RNA single strand folding requiresO(nZ) time and O(n2) space, where n denotes the length of theinput sequence and Z is a sparsity parameter that satisfies n < Z < n2.We show how to reduce the space complexity of this algorithm. Thespace reduction is based on the observation that some solutions for subproblemsare not examined after a certain stage of the algorithm, andmay be discarded from memory. This yields an O(nZ) time and O(Z)space algorithm, that outputs both the cardinality of the optimal foldingas well as a corresponding secondary structure. The space-effcientapproach also extends to the related RNA simultaneous alignment withfolding problem, and can be applied to reduce the space complexity of thefastest algorithm for this problem from O(n2m2) down to O(nm2 + Z~),where n and m denote the lengths of the input sequences to be aligned,and ~ Z is a sparsity parameter that satisfies nm < Z~ < n2m2.In addition, we also show how to speed up the base-pairing maximizationvariant of RNA single strand folding. The speed up is achieved by combiningtwo independent existing techniques, which restrict the numberof expressions that need to be examined in bottleneck computations ofthese algorithms. This yields an O(LZ) time and O(Z) space algorithm,where L denotes the maximum cardinality of a folding of the input sequence. [Less...]