Yanlei Diao

Laboratoire d'informatique / Department of Computer Science
Ecole Polytechnique, France

Email:{first-name} dot {last-name}@polytechnique.edu
Phone:+33 1 77 57 80 13
Address: Batiment Alan Turing
1 rue Honore d'Estienne d'Orves
Campus de l'Ecole Polytechnique
91120 Palaiseau
Ile-de-France, France

[Home]  [Research]  [Funding]  [Publications]  [Teaching]  [Service]  [Students

Current Projects

Data Exploration

AIDE: Interactive Data Exploration at Scale. Traditional data management systems are suited for applications in which the structure, meaning and contents of the database, as well as the questions (queries) to be asked, are all well-understood. However, this is no longer true when the volume and diversity of data grow at an unprecedented rate, as we are witnessing in scientific computing, social network analysis, and business data analysis. To bridge the gap, this project explores a new approach of system-aided exploration of big data space and automatic learning of the user interest in order to retrieve all objects that match the user interest -- we call this new service "interactive data exploration". Grounded in an active-learning framework, our project tackles fast convergence of the user interest model, interactive performance for the user experience, and scalability to a large distributed database.

Data Streams

XStream: Predicating and Explaining Patterns in Real-time Stream Analytics. Recent applications such as the Internet of Things, data center monitoring, and market trend analysis present a pressing demand for perpetual, low-latency analytics to support a wide range of time-critical tasks and decisions. Today's data stream systems support passive monitoring by requesting the monitoring application (or user) to explicitly define patterns of interest. However, a growing number of applications demand a new service beyond passive monitoring, that is, the ability of the monitoring system to automatically identify interesting patterns (including anomalous behaviors), produce a concrete explanation for the anomalies from the raw data, and based on the explanation enable a user action to prevent or remedy the effect of the anomaly, or to develop better strategies in the future. Our project aims to provide the new functionality of anomaly detection and explanation discovery over high-volume, diverse data streams from enterprise businesses.

SASE: Complex Event Processing over Streams. We study stream processing in the context of large-scale event-based systems that are gaining adoption in applications such as supply chain management, financial services, and network and application monitoring. These systems create high volumes of events. End applications require these events to be filtered and correlated for complex pattern detection, aggregated on different temporal and geographic scales, and transformed to new events that reach a semantic level appropriate for the applications. We address issues involved in stream-based event processing ranging from the query language to computation complexity to fast implementation. We further consider complex pattern evaluation with imprecise timestamps of events, which commonly arise in event processing in distributed systems.

Big and Fast Data Analysis

GESALL: Large-Scale Genomic Data Analysis. Next-generation sequencing has transformed genomics into a new paradigm of data-intensive computing, raising several salient challenges. First, the deluge of genomic data needs to undergo deep analysis to mine biological information, which requires a full pipeline that integrates many data processing and analysis tools. Second, deep analysis pipelines often take long to run, which entails a long cycle for algorithm and method development. This project aims to bring the latest big data technology and database technology to the genomics domain to revolutionize its data crunching power. The proposed research includes: development of a deep pipeline for genomic data analysis by assembling state-of-the-art methods; automatic parallelization of the workflow using the big data technology; a principled approach to optimizing the genomic pipeline; and integration of streaming technology to reduce latency of important results. The prototype system will be deployed in both private and public cloud environments, and fully evaluated using existing long-running pipelines and in a variety of real use cases.

SCALLA: Scalable Low-Latency Analytics. An integral part of many data-intensive applications is the need to collect and analyze enormous data sets, such as social network data, server log data, scientific data, and biological data. Concurrently, new programming models and architectures have been developed for large-scale cluster computing, exemplified by recent MapReduce systems. In these big data systems, however, data needs to be loaded to the cluster before any queries can be run, resulting in a high delay to start query processing. Moreover, answers to a long-running query are returned only when the entire job completes, causing a long delay in returning query answers. In this project, we design, develop, and evaluate a scalable, low-latency analytics platform, called Scalla, that fundamentally transforms the existing cluster computing paradigm into an incremental parallel processing paradigm, and further extends to near real-time analytics. We further develop a few applications in the domains of social network data analysis and big bio data analysis on the Scalla platform.

Data Quality

CLARO: Uncertain Data Management. The goal of this project is to design and develop a data management system that captures data uncertainty from data collection to query processing to final result generation. Such uncertain data stream processing is crucial to many real-world applications such as hazardous weather monitoring and traffic monitoring. To achieve this goal, our project takes a principled approach grounded in probability and statistical theory to support uncertainty as a first-class citizen, and efficiently integrate this approach into high-volume stream processing. In particular, we aim to capture uncertainty of raw data streams as they are produced as well as changes of uncertainty as data propagates through various query processing operators.


Past Projects

STONES: Flash-based Data Management Systems. Recent advances in flash technology have enabled embedded devices, personal computers, and high-end servers to be equipped with high-capacity flash memory and its packaged devices such as solid state drives (SSDs). Flash memory and SSDs provide faster random access and more energy-efficiet operations over tradiational hard disks. In this project, we are designing new storage systems and query processing algorithms for large-scale data analysis and high-performance databases that employ hybrid storage of flash memory and hard disks.

SPIRE: RFID Data Stream Processing. Radio Frequency Identification (RFID) technology is gaining acceptance in an increasing number of applications for tracking and monitoring purposes. Despite its promise to provide unprecedented visibility in various domains, RFID technology presents numerous challenges, including incomplete and noisy data, lack of information about inter-object relationships, and high volumes. In this project, we develop an RFID stream processing system that employs probabilistic inference to derive locations of unobserved objects and inter-object relationships such as containments and further supports probabilistic query processing to derive high-level information.

Fast and Memory-Efficient Packet Content Scanning. Packet content scanning compares the packet payload against a set of patterns specified as regular expressions. Memory requirements using traditional methods for fast packet scanning are prohibitively high. We develop regular expression rewrite techniques to reduce memory usage, and grouping schemes to increase the regular expression matching speed without increasing memory usage. Our implementation can achieve orders-of-magnitude performance improvements over the implementations used in the Linux L7-filter and Snort system. Such efficient packet content scanning enables new technologies such as real-time worm detection, content lookup in overlay networks, fine-grained load balancing, etc.

ONYX: Internet-Scale XML Data Dissemination. We study Internet-scale data dissemination that delivers XML-encoded documents from multiple publishing sites to millions of subscribers based on the subscribers' data interests. We explore the idea of content-based routing of documents in distributed dissemination systems. We seek to enhance such data dissemination with advanced services such as stateful publish/subscribe and QoS. We investigate implementations that are able to meet demanding efficiency and scalability requirements.

YFilter: High-Volume XML Message Brokering. We design a message brokering system that provides fast, on-the-fly filtering of incoming XML messages for large numbers of simultaneous queries, and transforms the matching messages according to recipient-specific requirements. We explore the key issues including shared processing of queries for efficient and scalable filtering and leveraging the filtering solutions for customized result generation. We released YFilter 1.0, a freely available software system containing the filtering engine and the query workload generator of YFilter.

Stream-based XQuery Processing. We develop a memoization-based approach to shared processing for the full XQuery language in a stream-based environment. We implement the approach by extending the streaming XQuery processor that BEA Systems incorporates as part of their BEA WebLogic Integration 8.1 product. We demonstrate the effectiveness of the approach in typical use cases of XQuery.