Data challenges

Data Challenges

  1. Generating Graphs with Specified Properties

    The goal of this project is to explore and apply machine learning/artificial intelligence techniques to generate graphs with specific properties. Participants will train a model to generate graphs conditioned on a given textual description using latent diffusion models to produce corresponding graph structures. The challenge is hosted on Kaggle, allowing collaboration and competition.


  2. Sub-event Detection in Twitter Streams

    This project focuses on predicting the presence of specific sub-events in tweets during football games from the 2010 and 2014 World Cups. Participants will build a binary classification model to detect sub-events based on labeled tweet data, using features such as keywords, frequency of phrases, and timing information.


  3. News Articles Title Generation

    This challenge aims to generate informative and engaging titles for news articles using advanced NLP models. Participants will develop algorithms for extractive and abstractive summarization to create concise headlines. The models will be assessed using the ROUGE-L F-Score metric.


  4. Retweet Prediction Challenge 2022

    In this challenge, participants will predict the number of retweets a tweet will receive based on a dataset related to the 2022 French presidential election. The task requires participants to use a combination of supervised or unsupervised techniques to minimize the Mean Absolute Error (MAE).


  5. Predicting Link on the French Web-Graph (January 2021)

    In this challenge, participants were tasked with predicting links between pages in a sub-graph of the French web. The web-graph is a directed graph where vertices correspond to French web pages, and directed edges represent hyperlinks. After randomly deleting edges from the original graph, participants were asked to predict which candidate edges existed in the original structure. Each node comes with a text file extracted from the HTML of its corresponding webpage.


  6. H-index Prediction (January 2021)

    In this regression challenge, participants developed models to predict the h-index of researchers based on their publication history. The h-index measures an author’s productivity and citation impact. Data included a graph modeling collaboration intensity (i.e., co-authorship) and abstracts of top-cited papers.

    This challenge was part of the “Advanced Learning for Text and Graph Data ALTEGRAD” 2020-2021 course.


  7. COVID-19 Retweet Prediction Challenge 2020 (November 2020)

    In this challenge, participants predicted the number of retweets a tweet would receive, based on tweet content and user information (e.g., number of followers). The dataset was a subset of the COVID-19 Twitter dataset, collected during the first wave of lockdowns in March 2020.


  8. Predicting Category Label for French Domains (January 2020)

    This challenge focused on predicting categories for French domains. Participants worked with a sub-graph of the French web where nodes represent domains, and edges indicate hyperlinks between them. A subset of domains were manually categorized into eight groups, which were used to train and test prediction models.

    This challenge was part of the “Advanced Learning for Text and Graph Data ALTEGRAD” 2019-2020 course.


  9. Predicting Link on the French Web-Graph (November 2019)

    Similar to the January 2021 challenge, this task involved predicting links between pages in a sub-graph of the French web-graph, with edges randomly deleted. Each node was associated with a text file extracted from the HTML of the corresponding webpage.

    This challenge was part of the “Machine and Deep Learning” 2019-2020 course.


  10. Predicting Continuous Values Associated with Graphs (January 2019)

    In this competition, participants solved a multi-target graph regression problem using a deep learning architecture called Hierarchical Attention Network (HAN). The dataset consisted of an undirected, weighted graph with nodes representing entities and their attributes. The task involved sampling, enriching node attributes, and tuning the HAN architecture for predictions.

    This challenge was part of the “Advanced Learning for Text and Graph Data ALTEGRAD” 2018-2019 course.


  11. Predicting Meaning Similarity of Short Texts (December 2017)

    Participants in this challenge predicted which pairs of questions had the same meaning. The task was inherently subjective, as the true meaning of sentences can vary. Human labeling was considered ‘noisy,’ and ground truth labels were used as guidelines.

    This challenge was part of the “Advanced Learning for Text and Graph Data ALTEGRAD” 2017-2018 course.


  12. Predicting Missing Links in a Citation Network (January 2017 / February 2018)

    This challenge involved a citation network where nodes represented research papers, and edges indicated citations between them. Randomly deleted edges needed to be reconstructed using graph-theoretical and textual features, alongside other data sources.

    Challenge Links:


  13. Email Recipient Recommendation (January 2017)

    This challenge aimed to develop a system that recommends email recipients based on the content and date of a message. The goal was to increase productivity and prevent information leakage by suggesting the most relevant recipients.


  14. Link Prediction Data Challenge (March 2016)

    This competition defined a citation network as a graph where research papers are connected by citation relationships. Participants were tasked with reconstructing deleted edges using graph, textual, and other data features.

    This challenge was part of the “Graph and Text Mining” course offered to the MVA/Data Science M2 master programs


  15. AXA 2016 Data Challenge (February 2016)

    This challenge was launched by the DASCIS chair, focused on developing a forecasting model to predict the number of incoming calls at AXA Assistance call centers in France. The dataset included telephony data from 2011-2012.


  16. Opinion Mining Data Challenge (May 2015)

    This challenge involved applying text mining and data mining techniques to classify comments in reviews as either positive or negative. Participants were also encouraged to use Big Data technologies like Spark and Hadoop.