Laboratoire d'informatique de l'École polytechnique

Optimizing Big Data Computations: Queries and Algebra in a Single Framework

Speaker: Ioana Manolescu (équipe CEDAR)
Date: Thu, 21 Jan 2021, 13:00-14:00

The database management industry is worth $55 bn according to a recent Gartner report, and its importance is growing in particular as more platforms specialize in very large scale data management in cloud platforms. At the core of database management systems (DBMS)’s success lie declarativeness and its twin, automatic performance-oriented optimization. This has enabled users to specify what computation they want executed over the data, not how to execute it. Instead, a dedicated DBMS module called query optimizer enumerates alternative evaluation methods and, with a help of a computational cost model, selects the one estimated to be the more efficient.

For data stored in relational tables, optimizers have, since 1970, helped launch a booming industry, whose products are at work several times in the average day of everyone in a modern society. More recently, novel systems, in particular the so-called NoSQL stores, were developed, each with specific performance advantages, and going beyond tabular data to XML or JSON documents, key-value stores etc. In parallel, the developing interest in machine learning on Big Data has lead to hybrid workloads, which mix database-style queries (fundamentally, logical formulas) and ML-specific operations (complex matrix computations). These developments have complexified the landscape of modern Big Data management and the life of developers, since computations split across systems are often executed “as-such” and do not benefit from automatic optimization.

I will describe Estocada, a project started in 2016 with the purpose of providing a unified optimization framework (a) for queries specified across a large variety of data models, and (b) for workloads mixing queries with ML computations. At the heart of Estocada lies an old powerful tool in database research, namely the chase. I will explain Estocada’s approach for uniformly modeling heterogeneous data models, including numerical matrices, in a relational model with constraints, and how it leverages a modern variant of the chase to automatically rewrite computations spanning across heterogeneous data models and across systems such as Spark, SystemML, TensorFlow and SparkSQL, as well as relational databases an document stores such as MongoDB.

This is joint work with: Rana Al-Otaibi (UCSD), Damian Bursztyn (Inria), Bogdan Cautis (U. Paris-Saclay), Alin Deutsch (UCSD), Stamatis Zampetakis (Inria).