Software & datasets

COVID19 – ANR XTCOVIF

Our team is working intensively to better understand the spread of the COVID-19 outbreak and its impact on social media.
We study the social and economic impacts of COVID-19 using data obtained from social media. We focus on 6 different societal issues related to the outbreak of COVID-19, namely (1) people’s sentiments and emotions, (2) the decline of tourism, (3) the trust that citizens show in governments, (4) the evolution of language, (5) the increase in racism and xenophobia, and (6) the impact of COVID-19 on population mobility.

You can see our results here.
Large-Scale Linguistic Resources from the French Web

From July to September 2019, we have crawled a very large part of the French web (over 1.3M domains corresponding to more than 30M URIs) which resulted in over 330 GB of uncompressed text. The obtained corpus is, to our knowledge, the largest collection of French text to date. Another recent effort is the French slice of the CommonCrawl that is available through OSCAR (138 GB, 32.7B of SentencePiece tokens).

We also have reconstructed the domain graph. It has 1.3M nodes and 3.2M edges. These resources are valuable for NLP and graph mining research focusing on the French Language and web.
- lists of frequent words and phrases (UNI-grams)
- word embeddings (word2vec, fasTex, gloVe)
- a state-of-the-art French language model, by training BERT on all the text available
- polysemous words for disambiguation tasks
- manually annotated datasets for tasks such as link prediction and domain categorization
- the french web graph
Based on the data above, we already have developed two In-Class Kaggle challenges:
- Challenge 1: Link Prediction
  
  In this challenge, the task it to predict links between pages in an extracted sub-graph of the French web-graph. The web-graph is a directed graph G(V, E) whose vertices correspond to the pages of the French web, and a directed edge connects page U to page V if there exists a hyperlink on page U pointing to page V. From the original sub-graph, edges have been deleted at random. Given a set of candidate edges, participants should predict which ones appeared in the original sub-graph. Each node is associated with a text file extracted from the HTML of the corresponding webpage.
  Available on: https://www.kaggle.com/c/link-prediction-data-challenge-2019
- Challenge 2: Domain classification
  
  In this challenge, the task is to predict the categories to which the domains of the test set belong. The participants get a sub-graph of the French web graph where nodes correspond to domains. A directed edge between two nodes indicates that there is a hyperlink from at least one page of the source domain to at least one page of the target domain. Furthermore, your are provided with the text extracted from all the pages of each domain. A subset of these domains were manually classified into 8 categories and split between a training set and a test set.
  Available on: https://www.kaggle.com/c/fr-domain-classification

GraKeL

GraKeL is a Python package extension, for the study and use of an upcoming area in data-mining and machine learning, known as graph kernels.

Project is currently under alpha development stage and is uploaded on pypi-test.

Code: https://github.com/ysig/GraKeL.
Documentation: https://ysig.github.io/GraKeL/0.1a8/.
Paper: https://arxiv.org/abs/1806.02193.

Graph-of-Words and graph-based keyword extraction

A fully unsupervised, extractive text summarization system that leverages a submodularity framework. It allows summaries to be generated in a greedy way while preserving near-optimal performance guarantees. This tool builds on the graph-of-words representation of text and the k-core decomposition algorithm to assign meaningful scores to words.

Graph of Words code (convert documents into graphs) here.

Prototype link here.

Code available here.

Relevant Papers:
- A. J.-P. Tixier, P. Meladianos, M. Vazirgiannis, “Combining Graph Degeneracy and Submodularity for Unsupervised Extractive Summarization”, EMNLP 2017, Copenhagen, Denmark.
- A. J.-P. Tixier, K. Skianis, M. Vazirgiannis, “GoWvis: A web application for Graph-of-Words-based text visualization and summarization”, ACL 2016, Berlin, Germany.
- F. Rousseau and M. Vazirgiannis., “Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction”. ECIR 2015,
  Vienna, Austria.

Graph-of-Words visualization tool

GoWvis is an interactive web application that represents any piece of text inputted by the user as a Graph-of-Words and leverages graph degeneracy and community detection to generate an extractive summary (keyphrases and paragraph) of the inputted text in an unsupervised fashion. The entire analysis can be fully customized via the tuning of many text preprocessing, graph building, and graph mining parameters. Our system is thus well suited to educational purposes, exploration and early research experiments.

Prototype link here.

Relevant Papers:
- A. J.-P. Tixier, K. Skianis, M. Vazirgiannis, “GoWvis: A web application for Graph-of-Words-based text visualization and summarization”, ACL 2016, Berlin, Germany.

Degeneracy based graph mining

Prototype link here.

Two interesting explorations in the citations space:
– Cross discipline citations in time: How citations inter/intra discipline evolved over time: link
– Institutional level citations (search with Univ. Name to get the citation exchange with other entities): link

Relevant Papers:
- C. Giatsidis, D. Thilikos, M. Vazirgiannis, “Evaluating cooperation in communities with the k-core structure”, in the proceedings of the 2011 IEEE International Conference on Data Mining series (ICDM) , Canada.
- C. Giatsidis, D. Thilikos, M. Vazirgiannis, “D-cores: Measuring Collaboration of Directed Graphs Based on Degeneracy”, in the proceedings of the 2011 International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Taiwan.

Match-the-News: Personalized news recommendation

Our Google Chrome plugin designed for academic exploration, Match-the-News demonstrates the potential of integrating keyword freshness into personalized content delivery, offering a hands-on example of how advanced algorithms can enhance user engagement in real-world scenarios.

Relevant papers:
M. Karkali, V. Plachouras, C. Stefanatos, M. Vazirgiannis, “Keeping Keywords Fresh: A BM25 Variation for Personalized Keyword Extraction” will appear in the proceedings of the WWW2012 – 2nd Temporal Web Analytics Workshop

Google News Dataset

Dataset link here.

Our Google News Dataset consist of Google news manually classified, used in the paper “Efficient Online Novelty Detection in News Streams”.

Relevant papers:
M. Karkali, F. Rousseau, A. Ntoulas, M. Vazirgiannis, “Efficient Online Novelty Detection in News Streams”, Web Information Systems Engineering – WISE 2013

Scientometrics

Prototype Link here.

A set of visualizations derived from the Microsoft Academic Graph to rank scientists based on their impact and influence in academia.
1. Plot of the h-index distribution and the percentile a scientist belongs to.
2. D-core matrix that depicts the directed cores a scientists belongs to.

DaSciM

Data Science and Mining Team @ École Polytechnique

Software & datasets

Software & datasets

COVID19 – ANR XTCOVIF

Large-Scale Linguistic Resources from the French Web

Challenge 1: Link Prediction

Challenge 2: Domain classification

GraKeL

Graph-of-Words and graph-based keyword extraction

Graph-of-Words visualization tool

Degeneracy based graph mining

Match-the-News: Personalized news recommendation

Google News Dataset

Scientometrics