Data Science in the Big Data Era Workshop | 17 July 2013

Christos Faloutsos
Carnegie Mellon University, USA

Keynote Speaker (via video conference)

Title: Large Graph Mining - Patterns, Explanations, and Cascade Analysis

Abstract: What do graphs look like? How do they evolve over time? How does influence/news/viruses propagate, over time? We present a long list of static and temporal laws, and some recent observations on real graphs. We show that fractals and self-similarity can explain several of the observed patterns, and we conclude with cascade analysis and a surprising result on virus propagation and immunization.

Aris Anagnostopoulos
Sapienza University of Rome, Italy

Title: Models and Algorithms for Online Collaborative Systems

Abstract: In the last years we have observed the emergence of a variety of systems where users communicate and collaborate online and produce knowledge or provide new services. Success stories include Wikipedia, Linux and the open source community, games with a purpose, crowdsourcing services, online labor marketplaces, and car-sharing services, to name a few. This new paradigm, where users collaborate using the Internet for coordination, requires the development of new models and algorithms to capture both the machine and the human elements, and the interactions between them. In this talk I will start with a high-level presentation of some of these systems and with some of the issues needed to be captured when trying to formally model them. I will then proceed with the presentation of algorithms for the problem of creating online teams of experts, for the problem of assigning tasks to experts in some crowdsourcing systems, and of some experimental results about the "wisdom of crowds". I will conclude with a discussion of challenges for future work.

Thierry Artières
Université Pierre et Marie Curie, France

Title: Classification in a Large Number of Categories

Abstract: Statistical learning has emerged in the last years as a key technology for processing and analyzing large amounts of data that are available today. One particular need that has emerged in the recent years is the ability to perform classification with a large number of classes. This problem occurs in many application area such as filtering and classification of semantic data, annotation of multimedia objects, on-line recommendation etc. For instance one may want to annotate images with keywords, to automatically classify text documents in a hierarchy of classes or concepts like Wikipedia (20000 classes), DMOZ (~600000 classes). Research in this area is still at a preliminary stage since the quantitative change in the scale of the problems, the number of classes, translates into qualitative change of the methods. We will provide an overview of the classification with a large number of classes problem with a focus on text documents and give a panorama of the methods that have been proposed up to now which are designed to optimize various trade-off between training and inference complexity and accuracy. We will detail benchmark datasets including those that we have released for the Large Scale Hierarchical Text Classification (LSHTC) series of challenges that we organized in the last three years. Finally we will present the works that are conducted at LIP6 which aim at designing accurate classifiers with sublinear inference complexity with respect to the number of classes, either by designing efficient binary reduction of the one-vs-all classification scheme or by using hierarchical classifiers on a hierarchy learned from data.

Marc Barthelemy
Institut de Physique Théorique, CEA-Saclay, France

Title: Evolution of Spatial Networks

Abstract: Complex systems are very often organized under the form of networks where nodes and edges are embedded in space. Transportation and mobility networks, Internet, mobile phone networks, power grids, social and contact networks, neural networks, are all examples where space is relevant and where topology alone does not contain all the information. Characterizing and understanding the structure and the evolution of spatial networks is thus crucial for many different fields ranging from urbanism to epidemiology. These networks are both embedded in space and evolves in time, and we thus have to face the difficulty of measuring and characterizing their evolution, and to extract useful information. I will illustrate in this talk these various problems and present some recent results on empirical case studies on the evolution of road and subway networks.

Paolo Boldi
University of Milano, Italy

Title: Injecting Uncertainty in Graphs for Identity Obfuscation

Abstract: Data collected nowadays by social-networking applications create fascinating opportunities for building novel services, as well as expanding our understanding about social structures and their dynamics. Unfortunately, publishing social-network graphs is considered an ill-advised practice due to privacy concerns. To alleviate this problem, several anonymization methods have been proposed, aiming at reducing the risk of a privacy breach on the published data, while still allowing to analyze them and draw relevant conclusions. In this talk I will present a novel anonymization approach that is based on injecting uncertainty in social graphs and publishing the resulting uncertain graphs. While existing approaches obfuscate graph data by adding or removing edges entirely, we propose using a finer-grained perturbation that adds or removes edges partially: this way we can achieve the same desired level of obfuscation with smaller changes in the data, thus maintaining higher utility. Our experiments on real-world networks confirm that at the same level of identity obfuscation our method provides higher usefulness than existing randomized methods that publish standard graphs.

Stéphan Clémençon
Télécom ParisTech and CNRS, France

Title: Maximal Deviations of Incomplete U-statistics with Applications to Empirical Risk Sampling

Abstract: Through this talk, we will show how to extend the Empirical Risk Minimization (ERM) paradigm, from a practical perspective, to the situation where a natural estimate of the risk is of the form of a K-sample U-statistics, as it is the case in the K-partite ranking problem for instance. Indeed, the numerical computation of the empirical risk is hardly feasible if not infeasible, even for moderate samples sizes. Precisely, it involves averaging O(n^{d₁ + ... + d_K}) terms, when considering a U-statistic of degrees (d₁, ... , d_K) based on samples of sizes proportional to n. We propose here to consider a drastically simpler Monte-Carlo version of the empirical risk based on O(n) terms solely, which can be viewed as an incomplete generalized U-statistic, and prove that, remarkably, the approximation stage does not damage the ERM procedure and yields a learning rate of order O_P(1/sqrt(n)). Such results are of crucial importance in the "Big Data" era. Beyond a theoretical analysis guaranteeing the validity of this approach, numerical experiments are displayed for illustrative purpose.

Clémence Magnien
Université Pierre et Marie Curie, France

Title: The Importance of Nodes in Dynamic Networks

Abstract: The large body of works that has studied complex networks has reached a consensus on which statistics are relevant to describe a given network: degree distribution, clustering coefficient, community structure, etc. Different notions have been introduced to compute a node's centrality, i.e. its importance in the networks. One of the most used ones is the betweenness centrality, which quantifies on how many shortest paths this given node stands. However, in dynamic networks, i.e. networks whose nodes and links evolve with time, no consensus exists on how to evaluate a node's importance. We will present some possible solutions to evaluate it. We will study the case of an inter-person contact network. In such a network, people are given sensors that can communicate with each other wirelessly, e.g. by bluetooth or wifi. These sensors emit packets periodically, the reception of such a packet by another sensor indicating that the two corresponding persons are within communication range. Such networks are widely studied for the design of communication protocols.

Nikos Paragios
Ecole Centrale de Paris, Ecole des Ponts-ParisTech and INRIA Saclay, France

Title: Graphical Models and Discrete Optimization on Big (Visual) Data

Abstract: Inverse modeling aims at recovering the set of parameters of a parametric model such that the resulting instance optimally explains the observations/measurements. Such a formulation is often addressed as an optimization problem of an appropriately defined objective function that is in most of the cases is an ill-posed, highly non-linear and highly non-convex problem. In this seminar, we will investigate the use of graphical models (low and higher order rank) to address such inference and present efficient optimization algorithms that can produce either computational efficient near-optimal solutions or optimal ones through tight relaxations and dual decomposition when applied to colossal data involving models of massive complexity. The domain of computer vision/ medical imaging (segmentation/registration/matching and beyond) will be used to demonstrate the extreme potentials of such modeling and inference methods on big data.

Organizers

Michalis Vazirgiannis

Mauro Sozio

Sponsors

Registration