Data Challenges

Generating Graphs with Specified Properties (2024)

The goal of this project is to explore and apply machine learning/artificial intelligence techniques to generate graphs with specific properties. Participants will train a model to generate graphs conditioned on a given textual description using latent diffusion models to produce corresponding graph structures. The challenge is hosted on Kaggle, allowing collaboration and competition.

Sub-event Detection in Twitter Streams (2024)

This project focuses on predicting the presence of specific sub-events in tweets during football games from the 2010 and 2014 World Cups. Participants will build a binary classification model to detect sub-events based on labeled tweet data, using features such as keywords, frequency of phrases, and timing information.

News Articles Title Generation (2024)

This challenge aims to generate informative and engaging titles for news articles using advanced NLP models. Participants will develop algorithms for extractive and abstractive summarization to create concise headlines. The models will be assessed using the ROUGE-L F-Score metric.

Retweet Prediction Challenge (2022)

In this challenge, participants will predict the number of retweets a tweet will receive based on a dataset related to the 2022 French presidential election. The task requires participants to use a combination of supervised or unsupervised techniques to minimize the Mean Absolute Error (MAE).

Predicting Link on the French Web-Graph (2021)

In this challenge, participants were tasked with predicting links between pages in a sub-graph of the French web. The web-graph is a directed graph where vertices correspond to French web pages, and directed edges represent hyperlinks. After randomly deleting edges from the original graph, participants were asked to predict which candidate edges existed in the original structure. Each node comes with a text file extracted from the HTML of its corresponding webpage.

H-index Prediction (2021)

In this regression challenge, participants developed models to predict the h-index of researchers based on their publication history. The h-index measures an author’s productivity and citation impact. Data included a graph modeling collaboration intensity (i.e., co-authorship) and abstracts of top-cited papers.

This challenge was part of the “Advanced Learning for Text and Graph Data ALTEGRAD” 2020-2021 course.

COVID-19 Retweet Prediction Challenge (2020)

In this challenge, participants predicted the number of retweets a tweet would receive, based on tweet content and user information (e.g., number of followers). The dataset was a subset of the COVID-19 Twitter dataset, collected during the first wave of lockdowns in March 2020.

Predicting Category Label for French Domains (2020)

This challenge focused on predicting categories for French domains. Participants worked with a sub-graph of the French web where nodes represent domains, and edges indicate hyperlinks between them. A subset of domains were manually categorized into eight groups, which were used to train and test prediction models.

This challenge was part of the “Advanced Learning for Text and Graph Data ALTEGRAD” 2019-2020 course.

Predicting Link on the French Web-Graph (2019)

Similar to the January 2021 challenge, this task involved predicting links between pages in a sub-graph of the French web-graph, with edges randomly deleted. Each node was associated with a text file extracted from the HTML of the corresponding webpage.

This challenge was part of the “Machine and Deep Learning” 2019-2020 course.

Predicting Continuous Values Associated with Graphs (2019)

In this competition, participants solved a multi-target graph regression problem using a deep learning architecture called Hierarchical Attention Network (HAN). The dataset consisted of an undirected, weighted graph with nodes representing entities and their attributes. The task involved sampling, enriching node attributes, and tuning the HAN architecture for predictions.

This challenge was part of the “Advanced Learning for Text and Graph Data ALTEGRAD” 2018-2019 course.

Predicting Meaning Similarity of Short Texts (2017)

Participants in this challenge predicted which pairs of questions had the same meaning. The task was inherently subjective, as the true meaning of sentences can vary. Human labeling was considered ‘noisy,’ and ground truth labels were used as guidelines.

This challenge was part of the “Advanced Learning for Text and Graph Data ALTEGRAD” 2017-2018 course.

Email Recipient Recommendation (2017)

This challenge aimed to develop a system that recommends email recipients based on the content and date of a message. The goal was to increase productivity and prevent information leakage by suggesting the most relevant recipients.

Link Prediction Data Challenge (2016)

This competition defined a citation network as a graph where research papers are connected by citation relationships. Participants were tasked with reconstructing deleted edges using graph, textual, and other data features.

This challenge was part of the “Graph and Text Mining” course offered to the MVA/Data Science M2 master programs

AXA 2016 Data Challenge (2016)

This challenge was launched by the DASCIS chair, focused on developing a forecasting model to predict the number of incoming calls at AXA Assistance call centers in France. The dataset included telephony data from 2011-2012.

Opinion Mining Data Challenge (2015)

This challenge involved applying text mining and data mining techniques to classify comments in reviews as either positive or negative. Participants were also encouraged to use Big Data technologies like Spark and Hadoop.

DaSciM