Data challenges

Data Challenges

  1. Predicting link on the French Web-Graph (January 2021)

    In this challenge, the task it to predict links between pages in an extracted sub-graph of the French web-graph. The web-graph is a directed graph G(V, E) whose vertices correspond to the pages of the French web, and a directed edge connects page U to page V if there exists a hyperlink on page U pointing to page V. From the original sub-graph, edges have been deleted at random. Given a set of candidate edges, participants should predict which ones appeared in the original sub-graph. Each node is associated with a text file extracted from the HTML of the corresponding webpage.

    Challenge link: https://www.kaggle.com/c/aai-challenge-2020


  2. H-index Prediction (January 2021)

    In this regression challenge, each sample corresponds to a researcher (i. e., an author of research papers), and the goal is to build a model that can predict accurately the h-index of each author. More specifically, the h-index of an author measures his/her productivity and the citation impact of his/her publications. It is defined as the maximum value of h such that the given author has published h papers that have each been cited at least h times. To build the model, two types of data are provided: (1) a graph that models the collaboration intensity of two individuals (i. e., whether they have co-authored any papers), and (2) the abstracts of the top-cited papers of each author.

    The challenge was organized for the “Advanced learning for text and graph data ALTEGRAD” 2020-2021 course.

    Challenge link: https://www.kaggle.com/c/altegrad-2020


  3. COVID19 Retweet Prediction Challenge 2020 (November 2020)

    In this challenge, the task is to predict is to accurately predict the number of retweets a tweet will get. The provided dataset is a small subset that was extracted from the COVID19 Twitter dataset that was collect by the DaSciM team during the first wave of lockdowns (March 2020). It contains tweet related information, as the text and the number of hashtags, mentions and URLs contained in the tweet, and user related information as the number of followers and tweet that he has published.

    Challenge link: https://www.kaggle.com/c/covid19-retweet-prediction-challenge-2020


  4. Predicting category label for French Domains (January 2020)

    In this challenge, the task is to predict the categories to which the domains of the test set belong. The participants get a sub-graph of the French web graph where nodes correspond to domains. A directed edge between two nodes indicates that there is a hyperlink from at least one page of the source domain to at least one page of the target domain. Furthermore, your are provided with the text extracted from all the pages of each domain. A subset of these domains were manually classified into 8 categories and split between a training set and a test set. The challenge was organized for the “Advanced learning for text and graph data ALTEGRAD” 2019-2020 course.

    Challenge link: https://www.kaggle.com/c/fr-domain-classification


  5. Predicting link on the French Web-Graph (November 2019)

    In this challenge, the task it to predict links between pages in an extracted sub-graph of the French web-graph. The web-graph is a directed graph G(V, E) whose vertices correspond to the pages of the French web, and a directed edge connects page U to page V if there exists a hyperlink on page U pointing to page V. From the original sub-graph, edges have been deleted at random. Given a set of candidate edges, participants should predict which ones appeared in the original sub-graph. Each node is associated with a text file extracted from the HTML of the corresponding webpage. The challenge was organized for the “Machine and Deep learning” 2019-2020 course.

    Challenge link: https://www.kaggle.com/c/link-prediction-data-challenge-2019


  6. Predicting continuous values associated with graphs (January 2019)

    In this competition, students had to solve a multi-target graph regression problem with a Deep Learning architecture for NLP, the Hierarchical Attention Network (HAN). The offered dataset was an undirected, enweighted graph. Each node of each graph is associated with a unique ID which corresponds to the row index of its 13-dimensional attribute vector. The students were asked to perform tasks as sampling, enriching node attributes and tune the HAN architecture. The challenge was organized for the “Advanced learning for text and graph data ALTEGRAD” 2018-2019 course.

    Challenge link: https://www.kaggle.com/c/altegrad-19


  7. Predicting meaning similarity of short text (December 2017)

    The goal of this competition is to predict which of the provided pairs of questions contain two questions with the same meaning. The ground truth is a set of labels supplied by human experts. This is inherently subjective, as the true meaning of sentences can not be known with certainty. Human labeling is a ‘noisy’ process, and different people would probably disagree. As a result, ground truth labels on this dataset should be taken as indications but not 100% accurate, and may include incorrect labeling. The challenge was organized for the “Advanced learning for text and graph data ALTEGRAD” 2017-2018 course.

    Challenge link: https://www.kaggle.com/c/altegrad-challenge-fall-17


  8. Predicting missing links in a citation network (January 2017/ February 2018)

    A citation network is represented as a graph G(V,E) where V is the set of nodes and E is the set of edges (links). Each node corresponds to a paper and the existence of an edge between two nodes u and v means that one of the papers cites the other. Each node is associated with information such as the title of the paper, publication year, author names and a short abstract. A number of edges have been randomly deleted from the original citation network. The mission is to accurately reconstruct the initial network using graph-theoretical and textual features, and possibly other information. The challenge is organized for the “Text Mining and NLP” 2016-2017 course offered to the M1 Polytechnique students.

    Challenge link:


  9. Email recipient recommendation (January 2017)

    It was shown that at work, employees frequently forget to include one or more recipient(s) before sending a message. Conversely, it is common that some recipients of a given message were actually not intended to receive the message. To increase productivity and prevent information leakage, the needs for effective email recipient recommendation systems are thus pressing. In this challenge, we asked the MVA 2016-2017 students to develop such a system, which, given the content and the date of a message, recommends a list of 10 recipients ranked by decreasing order of relevance.

    Challenge link: https://inclass.kaggle.com/c/master-data-science-mva-data-competition-2017


  10. Link Prediction Data Challenge (March 2016)

    In this competition, we define a citation network as a graph where nodes are research papers and there is an edge between two nodes if one of the two papers cite the other. Edges have been deleted at random from a citation network. The mission is to accurately reconstruct the initial network using graph, textual, and other information. The challenge was organized for the “Graph and Text Mining” course offered to the MVA/Data Science M2 master programs.

    Challenge link: https://inclass.kaggle.com/c/mds-mva-kaggle/


  11. AXA 2016 Data Challenge (February 2016)

    The DASCIS chair launched successfully the 1st data challenge based on data and requirements provided by AXA. It was announced to the students of the INF582 course.

    The specific challenge aimed at developing models for an inbound call forecasting system. The forecasting system should be able to predict the number of incoming calls for the AXA Assistance call center in France, on a per “half-hour” time slot basis. The prediction is for three (3) days ahead in time. The specific dataset includes telephony data retrieved from AXA call centers, and corresponds to the the period spanning the calendar years 2011 and 2012. The full description of the data challenge can be found here: http://moodle.lix.polytechnique.fr/data_chalenge/.


  12. Opinion Mining Data Challenge (May 2015)

    As a part of the Professional training program “DSSP”, this challenge functions as an education tool through the application of text mining and data mining techniques for the task of Opinion Mining. The target goal in to produce a classifier that can identify comments in reviews as either positive or negative. The students are also guided to utilize Big Data Technologies (Spark,Hadoop) for the completion of this challenge.