Laboratoire d'informatique de l'École polytechnique


Optimizing Big Data Computations: Queries and Algebra in a Single Framework

Speaker: Ioana Manolescu (équipe CEDAR)
Date: Thu, 21 Jan 2021, 13:00-14:00

The database management industry is worth $55 bn according to a recent Gartner report, and its importance is growing in particular as more platforms specialize in very large scale data management in cloud platforms. At the core of database management systems (DBMS)’s success lie declarativeness and its twin, automatic performance-oriented optimization. This has enabled users to specify what computation they want executed over the data, not how to execute it. Instead, a dedicated DBMS module called query optimizer enumerates alternative evaluation methods and, with a help of a computational cost model, selects the one estimated to be the more efficient.

For data stored in relational tables, optimizers have, since 1970, helped launch a booming industry, whose products are at work several times in the average day of everyone in a modern society. More recently, novel systems, in particular the so-called NoSQL stores, were developed, each with specific performance advantages, and going beyond tabular data to XML or JSON documents, key-value stores etc. In parallel, the developing interest in machine learning on Big Data has lead to hybrid workloads, which mix database-style queries (fundamentally, logical formulas) and ML-specific operations (complex matrix computations). These developments have complexified the landscape of modern Big Data management and the life of developers, since computations split across systems are often executed “as-such” and do not benefit from automatic optimization.

I will describe Estocada, a project started in 2016 with the purpose of providing a unified optimization framework (a) for queries specified across a large variety of data models, and (b) for workloads mixing queries with ML computations. At the heart of Estocada lies an old powerful tool in database research, namely the chase. I will explain Estocada’s approach for uniformly modeling heterogeneous data models, including numerical matrices, in a relational model with constraints, and how it leverages a modern variant of the chase to automatically rewrite computations spanning across heterogeneous data models and across systems such as Spark, SystemML, TensorFlow and SparkSQL, as well as relational databases an document stores such as MongoDB.

This is joint work with: Rana Al-Otaibi (UCSD), Damian Bursztyn (Inria), Bogdan Cautis (U. Paris-Saclay), Alin Deutsch (UCSD), Stamatis Zampetakis (Inria).

Computing with Ordinary Differential Equations.

Speaker: Olivier Bournez (équipe Alco)
Date: Thu, 17 Dec 2020, 13:00-14:00

Computability, complexity, programming, analog models of computations, and the rest.

Olivier et ses co-auteurs François Fages, Guillaume Le Guludec et Amaury Pouly, ont reçu le prix 2019 du journal La Recherche pour leurs travaux sur le modèle de calcul analogique et en particulier l’article Strong Turing Completeness of Continuous Chemical Reaction Networks and Compilation of Mixed Analog-Digital Programs.

Problèmes de Consensus

Speaker: Bernadette Charron-Bost
Date: Jeu. 19 nov. 2020, 13h00-14h00

Les problèmes de consensus apparaissent dans un grand nombre de systèmes multi-agents, qu’ils soient naturels ou artificiels : les agents doivent, par exemple, se coordonner afin de se synchroniser ou encore adopter un même état mémoire.

Dans tout problème de consensus, les agents possèdent chacun une variable de sortie et doivent finir par s’accorder sur une valeur commune pour toutes ces variables. Au delà de cette clause commune d’accord, les problèmes de consensus diffèrent essentiellement de par la propriété de stabilité exigée pour les variables de sortie. Ainsi peut-on demander à chaque agent de prendre une décision irrévocable (correspondant au cas où les variables de sortie sont à écriture unique). On peut affaiblir cette condition en autorisant les agents à modifier leur décision un nombre fini de fois. Ou même leur permettre de changer d’avis infiniment souvent, auquel cas les variables de sortie doivent simplement converger vers une valeur de sortie limite commune. On parle ainsi de consensus irrévocable, de consensus stabilisant ou de consensus asymptotique.

J’examinerai sous quelle forme ces différents problèmes de consensus apparaissent selon le type d’applications considérées : alors que le consensus irrévocable est requis dans les techniques de réplication pour la tolérance aux fautes, seul un consensus asymptotique est nécessaire pour nombre de problèmes en contrôle distribué, en optimisation distribuée ou encore pour la modélisation de phénomènes naturels de synchronisation (e.g., nuées d’oiseaux, scintillement de lucioles). Quant au consensus de Nakamoto au cœur de la technologie blockchain, sa clause de stabilité se situe entre celle du consensus irrévocable et celle du consensus stabilisant dans la hiérarchie décrite plus haut.

Dans une seconde partie, je présenterai les résultats majeurs de calculabilité et de complexité pour ces différents problèmes de consensus : le théorème de Fischer, Lynch et Paterson (FLP) conduisant à la question “was e-mail a mistake?’’ posée dans un numéro du New Yorker l’an dernier, le théorème dit des”généraux byzantins’’ de Shostak, Pease et Lamport et le résultat plus récent mais tout aussi spectaculaire de Moreau, prouvant la supériorité des interactions symétriques. J’esquisserai les techniques très variées mises en œuvre pour établir ces différents résultats et expliquerai comment mon travail se situe et s’articule autour de ces résultats centraux.

Bernadette a reçu en 2019 le Grand prix Huy Duong Bui, décerné par l’Académie des Sciences (

Talk by Paolo Papotti: «Explainable Fact Checking for Statistical and Property Claims»

Speaker: Paolo Papotti
Location: Zoom
Date: Thu, 15 Oct 2020, 14:00-15:00

Paolo Papotti will present its work on fact checking to the Cedar team, on October 15th at 2pm. The seminar will take place online, using Zoom.

Abstract: Misinformation is an important problem but fact checkers are overwhelmed by the amount of false content that is produced online every day. To support fact checkers in their efforts, we are creating data driven verification methods that use structured datasets to assess claims and explain their decisions. For statistical claims, we translate text claims into SQL queries on relational databases. We exploit text classifiers to propose validation queries to the users and rely on tentative execution of query candidates to narrow down the set of alternatives. qThe verification process is controlled by a cost-based optimizer that considers expected verification overheads and the expected claim utility as training samples. For property claims, we use the rich semantics in knowledge graphs (KGs) to verify claims and produce explanations. As information in a KG is inevitably incomplete, we rely on rule discovery and on text mining to gather the evidence to assess claims. Uncertain rules and facts are turned into logical programs and the checking task is modeled as a probabilistic inference problem. Experiments show that both methods enable the efficient and effective labeling of claims with interpretable explanations, both in simulations and in real world user studies with 50% decrease in verification time. Our algorithms are demonstrated in a fact checking website (, which has been used by more than twelve thousands users to verify claims related to the coronavirus disease (COVID-19) spreads and effects.

Bio: Paolo Papotti is a professor at EURECOM, France since 2017. He got his PhD from Roma Tre University (Italy) in 2007, and had research positions at the Qatar Computing Research Institute (Qatar) and Arizona State University (USA). His research is focused on data integration and information quality. He has authored more than 100 publications, and his work has been recognized with two “Best of the Conference” citations (SIGMOD 2009, VLDB 2016), a best demo award at SIGMOD 2015, and two Google Faculty Research Award (2016, 2020). He is associate editor for PVLDB and the ACM Journal of Data and Information Quality (JDIQ).

Creative Visual Simulation : Combining User control, Knowledge & Learning

Speaker: Marie-Paule Cani (équipe GeoViC)
Location: URL: Password: w8nXYimWg68 SIP:
Date: Thu, 15 Oct 2020, 13:00-14:00

While the use of digital models and simulation has already spread in many disciplines, recent advances in Computer Graphics open the way to a much lighter ways of creating 3D contents. In this talk, I’ll show how the expressive modeling paradigm - namely, the combination of graphical models embedding knowledge with gestural interfaces such as sketching, sculpting or copy-pasting - can allow seamless 3D modeling, progressive refinement, and animation of a variety of models, from individual shapes to full, animated 3D worlds. We discuss how prior knowledge can be enhanced by learning from examples, the later being possibly injected on the fly during a modeling session. As we show through examples, this methodology opens the way to seamless creation and interaction of scientists and engineers with the mental models of their object of study, in areas as different as soft products manufacturing, medicine, plants and ecosystem simulation, or geology.

Marie-Paule a été élue à l’Académie des Sciences en décembre 2019.

Soutenance de thèse de Maxime Buron: «Raisonnements efficaces sur de larges graphes hétérogènes»

Speaker: Maxime Buron
Location: Salle Gilles Kahn, Bât. Alan Turing
Date: Mer. 7 oct. 2020, 14h00-16h00

Maxime Buron (équipe Cedar) soutiendra sa thèse de doctorat, intitulée «Raisonnements efficaces sur de larges graphes hétérogènes», le mercredi 7 octobre 2020, dans l’amphithéâtre Gilles Kahn du batiment Turing,

En raison de la siutation sanitaire, le nombre de places est limité à 40. La soutenance sera aussi retransmise en ligne, voir les détails sur cette page :

Résumé: Dans cette thèse, nous étudions des méthodes d’interrogation centralisées des bases de connaissance construites en utilisant plusieurs sources de données hétérogènes. Ces travaux se placent dans le cadre du Web sémantique, où les connaissances sont représentées sous forme de graphe, ce qui permet de facilement intégrer l’hétérogénéité des sources, et où l’interrogation des connaissances est valorisée par la possibilité de raisonner sur ces dernières. Nous avons développé des méthodes d’interrogation efficaces, qui visent à maximiser l’expressivité offerte par les standards du Web sémantique.

Talk by Alexis Cvetkov-Iliev: «Analytics across data sources: more learning, rather than more cleaning»

Speaker: Alexis Cvetkov-Iliev
Location: Zoom
Date: Thu, 1 Oct 2020, 14:00-15:00

Alexis Cvetkov-Iliev will present its work with Gael Varoquaux, on October 1st at 2pm.

Abstract: Aggregating data across multiple sources faces the challenges of varying knowledge-representation conventions across the sources. To answer an analytical question, a typical workflow requires entity matching, merging variants across sources into clean categorical variables for the analysis. This task still requires labor-intense manual supervision from the analyst, despite great progress in record linkage and deduplication techniques. Here we argue that advanced statistical tools can address directly many analytic tasks across data sources with less manual data cleaning. Reframing analytical questions as machine learning tasks enables the use of vector representations of the data to leverage similarities between entries instead of relying on an exact matching of entities. We benchmark such an approach to manually-supervised entity matching, answering analytic questions typical of socio-economic studies across 14 real-world employee databases. Approaches based on continuous embeddings and machine learning models are competitive with classical techniques, while requiring considerably less human labor. In this light, we believe that more research is needed blending machine learning into analytic data stores for the purpose of analysis, rather than cleaning.