AML-HELAS – Advanced Machine/Deep Learning for Heterogeneous Large scale Data
This Chair, supported by the ANR and the industrial partners MAIF and LINAGORA and will be devoted to research and training in the area of advanced deep learning methods for graph and NLP with an emphasis on French linguistic resources.
The chair is in the heart and intersection of important and challenging research topics in AI. Graphs emerge a universal structure for information representation and learning for different applications including social networks, NLP, biomedical/neuro-computing etc. Industrial interest in graph based AI is very significant with immense scientific production in Deep Learning for Graph Neural Networks and NLP/Text Mining. Both topics are addressed in this chair and the industrial partners have an explicit interest in the results of the chair. The chair will aspire towards research topics that are original and risky but also connected to real life applications and industrial needs.
Topic of interest in the chain include:
- Learning Graph Representations
- Advanced deep learning methods for NLP applications for French
In the past years, the amount of graph-structured data has grown steadily in a wide range of domains, such as in social networks and in bioinformatics. Learning useful graph representations is at the core of many real-world applications. Graph Neural Networks (GNNs) have recently emerged as a general framework for addressing graph-related machine learning tasks. These networks have demonstrated strong empirical performance, and some recent studies have made attempts to formally characterize their expressive power.
In this research activity we are going to elaborate on our previous research results and extend them for in the context of Deep Learning for Spoken Language Understanding and summarization producing relevant linguistic resources for the French Language. An initial effort with French word embeddings is here.
The chair will capitalise on existing personnel (i.e. Prof M. Vazirgiannis, Prof J. Read, post doctoral researchers J. Lutzeyer, M. Seddik, collaborating researcher as Dr. G. Nikolentzos.
Also there will be a significant effort for hiring research personal at the level of doctoral students, postdoctoral researchers/research engineers, assistant professor. Interested candidates may contact the chair holder.
|February 2021: Interview to the Facebook Center for Data Innovation on novel GNN models to predict Covid-19 cases in Europe based on Facebook mobility data.|
|September 2020: Presentation of the Chair in the DATAIA event|
- BARTHhez: A french sequence to sequence pretrained model
- French linguistic resources v1
BARThez is the first french sequence to sequence pre-trained model.
BARThez is pre-trained on 66GB of french raw text for roughly 60 hours on 128 Nvidia V100 GPUs using the CNRS Jean Zay supercomputer.
Our model is based on BART. Unlike already existing BERT-based French language models such as CamemBERT and FlauBERT, BARThez is particularly well-suited for generative tasks, since not only its encoder but also its decoder is pre-trained.
In addition to BARThez that is pre-trained from scratch, we continue the pre-training of a multilingual BART mBART25 which boosted its performance in both discriminative and generative tasks. We call the french adapted version mBARThez.
Our models are competitive to CamemBERT and FlauBERT in discriminative tasks and outperform them in generative tasks such as abstractive summarization.
Demo powered by HuggingFace: https://huggingface.co/moussaKam/BARThez
Word embeddings are a type of word representation that became very important and widely used in different applications in natural language processing. For instance, pre-trained word embeddings played an important role in achieving an impressive performance with deep learning models on challenging natural language understanding problems.
We introduce here the French word vectors of dimension 300 that were trained using Word2Vec CBOW with window size of 20 on a huge 33GB French raw text that we crawled from the French web and performed on multiple pre-processing techniques like French language detection, deduplication and tokenization.
Below are a few for your initial tests. To see the results you have to write your input first and then press submit. You can also just click on one of the ready-made examples and they will be filled in the form. If you want to give your own words, they must be in the vocabulary of the word vectors.
Demo: French Word Embeddings
PUBLICATIONS – TECHNICAL REPORTS
- Dasoulas, George, Giannis Nikolentzos, Kevin Scaman, Aladin Virmaux, and Michalis Vazirgiannis. “Ego-based Entropy Measures for Structural Representations.” arXiv preprint arXiv:2003.00553 (2020).
- Xypolopoulos, Christos, Antoine J-P. Tixier, and Michalis Vazirgiannis. “Unsupervised Word Polysemy Quantification with Multiresolution Grids of Contextual Embeddings.”. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, Long Papers (2020).
- Eddine, Moussa Kamal, Antoine J-P. Tixier, and Michalis Vazirgiannis. “BARThez: a Skilled Pretrained French Sequence-to-Sequence Model.” arXiv preprint arXiv:2010.12321 (2020).
- Boniol, Paul, George Panagopoulos, Christos Xypolopoulos, Rajaa El Hamdani, David Restrepo Amariles, and Michalis Vazirgiannis. “Performance in the Courtroom: Automated Processing and Visualization of Appeal Court Decisions in France.” arXiv preprint arXiv:2006.06251 (2020).