AML-HELAS – Advanced Machine/Deep Learning for Heterogeneous Large scale Data



Michalis Vazirgiannis

Chair holder
DaSciM team, LIX
Ecole Polytechnique, France

This Chair, supported by the ANR and the industrial partners MAIF and LINAGORA and will be devoted to research and training in the area of advanced deep learning methods for graph and NLP with an emphasis on French linguistic resources.

  • Research: To contribute in developing theories and best practices in the field of advanced AI algorithms for alternative data including text, graphs, timeseries with applications in different domains including insurance, digital marketing, biomedical, social networks etc.
  • Students: Prepare by teaching the students to initiate or support innovation in these areas.
  • Academia & Industry: Disseminate in the industry and science, more widely through publications to the general public.


The chair is in the heart and intersection of important and challenging research topics in AI. Graphs emerge a universal structure for information representation and learning for different applications including social networks, NLP, biomedical/neuro-computing etc. Industrial interest in graph based AI is very significant with immense scientific production in Deep Learning for Graph Neural Networks and NLP/Text Mining. Both topics are addressed in this chair and the industrial partners have an explicit interest in the results of the chair. The chair will aspire towards research topics that are original and risky but also connected to real life applications and industrial needs.

Topic of interest in the chain include:

    1. Learning Graph Representations

In the past years, the amount of graph-structured data has grown steadily in a wide range of domains, such as in social networks and in bioinformatics. Learning useful graph representations is at the core of many real-world applications. Graph Neural Networks (GNNs) have recently emerged as a general framework for addressing graph-related machine learning tasks. These networks have demonstrated strong empirical performance, and some recent studies have made attempts to formally characterize their expressive power.

    1. Advanced deep learning methods for NLP applications for French
      linguistic resources

In this research activity we are going to elaborate on our previous research results and extend them for in the context of Deep Learning for Spoken Language Understanding and summarization producing relevant linguistic resources for the French Language. An initial effort with French word embeddings is here.


The chair will capitalise on existing personnel (i.e. Prof M. Vazirgiannis, Prof J. Read, post doctoral researchers J. Lutzeyer, M. Seddik, collaborating researcher as Dr. G. Nikolentzos.
Also there will be a significant effort for hiring research personal at the level of doctoral students, postdoctoral researchers/research engineers, assistant professor. Interested candidates may contact the chair holder.


May 2023: “Graph Neural Networks and applications to heterogeneous data” , Keynote speaker in WEBCONF2023, Texas USA
April 2023: “Graph Machine Learning with GNNs and Applications” – Invited course, DeepLearn 2023 Spring, 9th International School on Deep Learning, Bari Italy
April 2022: “Machine Learning with Graphs and Applications” – Invited course, 5th INTERNATIONAL SCHOOL ON DEEP LEARNING, Guimarães, Portugal
December 2021: “JuriBERT: A Masked-Language Model Adaptation for French Legal Text”, Algorithmic Law and Society Symposium, HEC Paris, Paris, France
October 2021: “Towards GNN explainability: Random Walk Graph Neural Networks” – Keynote talk, in the RecSys 21, GReS Workshop on Graph Neural Networks for Recommendation and Search
February 2021: Interview to the Facebook Center for Data Innovation on novel GNN models to predict Covid-19 cases in Europe based on Facebook mobility data.
September 2020: Presentation of the Chair in the DATAIA eventVideo presentation , Presentation slides


    • BARTHhez: A french sequence to sequence pretrained model

BARThez is the first french sequence to sequence pre-trained model.
BARThez is pre-trained on 66GB of french raw text for roughly 60 hours on 128 Nvidia V100 GPUs using the CNRS Jean Zay supercomputer.

Our model is based on BART. Unlike already existing BERT-based French language models such as CamemBERT and FlauBERT, BARThez is particularly well-suited for generative tasks, since not only its encoder but also its decoder is pre-trained.
In addition to BARThez that is pre-trained from scratch, we continue the pre-training of a multilingual BART mBART25 which boosted its performance in both discriminative and generative tasks. We call the french adapted version mBARThez.

Our models are competitive to CamemBERT and FlauBERT in discriminative tasks and outperform them in generative tasks such as abstractive summarization.

Paper: https://arxiv.org/abs/2010.12321

Github: https://github.com/moussaKam/BARThez

Demo powered by HuggingFace: https://huggingface.co/moussaKam/BARThez

    • BERTweetFR: Domain Adaptation of Pre-Trained Language Models for French Tweets

BERTweetFR is the first pre-trained large scale language model adapted to French tweets.
It is initialized with CamemBERT, the state-of-art general-domain language model for French based on the RoBERTa architecture. We perform domain-adaptive pre-training on 182M deduplicated tweets.
The training runs for roughly 20 hours on 8 Nvidia V100 GPUs (32GB each) using the CNRS Jean Zay supercomputer.
We evaluated BERTweetFR on three downstream Twitter analytic tasks including offensiveness classification, named entity recognition and unsupervised semantic shift detection, all yielding significant improvement compared to general-domain models.

Paper: https://aclanthology.org/2021.wnut-1.49/
Demo powered by HuggingFace: https://huggingface.co/Yanzhu/bertweetfr-base

    • JuriBERT: A Masked-Language Model Adaptation for French Legal Text

JuriBERT is a set of BERT models (tiny, mini, small and base) pre-trained from scratch on French legal-domain specific corpora.
JuriBERT models are pretrained on 6.3GB of legal french raw text from two different sources: the first dataset is crawled from Légifrance and the other one consists of anonymized court’s decisions and the Claimant’s pleadings from the Court of Cassation. The latter contains more than 100k long documents from different court cases.
JuriBERT models are pretrained using Nvidia GTX 1080Ti and evaluated on a legal specific downstream task which consists of assigning the court Claimant’s pleadings to a chamber and a section of the court. While JuriBERT_SMALL outperforms the general-domain BERT models (CamemBERT_BASE and CamemBERT_LARGE), the other models have a similar performance.

Paper: https://aclanthology.org/2021.nllp-1.9//

Word embeddings are a type of word representation that became very important and widely used in different applications in natural language processing. For instance, pre-trained word embeddings played an important role in achieving an impressive performance with deep learning models on challenging natural language understanding problems.
We introduce here the French word vectors of dimension 300 that were trained using Word2Vec CBOW with window size of 20 on a huge 33GB French raw text that we crawled from the French web and performed on multiple pre-processing techniques like French language detection, deduplication and tokenization.
Below are a few for your initial tests. To see the results you have to write your input first and then press submit. You can also just click on one of the ready-made examples and they will be filled in the form. If you want to give your own words, they must be in the vocabulary of the word vectors.
Demo: French Word Embeddings

French is one of the most important languages worldwide with long history and contributions to world culture and civilization.
In the last decade, the presence of unprecedented amounts of digital text online and also deep learning methods have changed NLP and text mining, new representations and methods enable tasks beyond the traditional supervised ones (i.e. document classification, opinion mining) in natural language processing problems such as summarization, language generation, author/style identification etc.
Training on vast amounts of data, new representations are learned – static and contextual embeddings representing linguistic elements in a vector space. These embeddings are very important and widely used in different applications in natural language processing.

In this portal we present and make available to the research and industrial community French linguistic resources of high scale and quality for different tasks result of training on very large quantities of online text collected (by our group as well) from the Web.

All the previously described models are available on the portal.

Portal : French Linguistics Portal


  • Chatzianastasis, Michail, et al. “Neural Architecture Search with Multimodal Fusion Methods for Diagnosing Dementia.” arXiv preprint arXiv:2302.05894 (2023).
  • Chatzianastasis, Michail, Michalis Vazirgiannis, and Zijun Zhang. “Explainable Multilayer Graph Neural Network for Cancer Gene Prediction.” arXiv preprint arXiv:2301.08831 (2023).
  • Salha-Galvan, Guillaume, et al. “New Frontiers in Graph Autoencoders: Joint Community Detection and Link Prediction.” arXiv preprint arXiv:2211.08972 (2022).
  • Lutzeyer, Johannes F., Changmin Wu, and Michalis Vazirgiannis. “Sparsifying the update step in graph neural networks.” Topological, Algebraic and Geometric Learning Workshops 2022. PMLR, 2022.
  • Vela, Ariel R. Ramos, et al. “Improving Graph Neural Networks at Scale: Combining Approximate PageRank and CoreRank.” arXiv preprint arXiv:2211.04248 (2022).
  • Nikolentzos, Giannis, Michail Chatzianastasis, and Michalis Vazirgiannis. “Weisfeiler and Leman go Hyperbolic: Learning Distance Preserving Node Representations.” arXiv preprint arXiv:2211.02501 (2022).
  • Guo, Yanzhu, et al. “Questioning the Validity of Summarization Datasets and Improving Their Factual Consistency.” arXiv preprint arXiv:2210.17378 (2022).
  • Abdine, Hadi, et al. “Word Sense Induction with Hierarchical Clustering and Mutual Information Maximization.” arXiv preprint arXiv:2210.05422 (2022).
  • Kosma, Chrysoula, et al. “Time Series Forecasting Models Copy the Past: How to Mitigate.” Artificial Neural Networks and Machine Learning–ICANN 2022: 31st International Conference on Artificial Neural Networks, Bristol, UK, September 6–9, 2022, Proceedings, Part I. Cham: Springer International Publishing, 2022.
  • Chatzianastasis, Michail, Giannis Nikolentzos, and Michalis Vazirgiannis. “Mass Enhanced Node Embeddings for Drug Repurposing.” Proceedings of the 12th Hellenic Conference on Artificial Intelligence. 2022.
  • Salha-Galvan, Guillaume, et al. “Modularity-aware graph autoencoders for joint community detection and link prediction.” Neural Networks 153 (2022): 474-495.
  • Rennard, Virgile, et al. “Abstractive meeting summarization: A survey.” arXiv preprint arXiv:2208.04163 (2022).
  • Nikolentzos, Giannis, George Dasoulas, and Michalis Vazirgiannis. “Permute me softly: learning soft permutations for graph representations.” IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
  • Seddik, Mohamed El Amine, et al. “Node feature kernels increase graph convolutional network robustness.” International Conference on Artificial Intelligence and Statistics. PMLR, 2022.
  • Eddine, Moussa Kamal, et al. “FrugalScore: Learning cheaper, lighter and faster evaluation metrics for automatic text generation.” Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022.
  • Abdine, Hadi, et al. “Political Communities on Twitter: Case Study of the 2022 French Presidential Election.” Proceedings of the LREC 2022 workshop on Natural Language Processing for Political Sciences. 2022.
  • Chatzianastasis, Michail, et al. “Graph Ordering Attention Networks.” arXiv preprint arXiv:2204.05351 (2022).
  • Panagopoulos, George, and Michalis Vazirgiannis. “Exploratory Analysis of Academic Collaborations between French and US.” arXiv preprint arXiv:2201.01346 (2022).
  • Nikolentzos, Giannis, et al. “Synthetic electronic health records generated with variational graph autoencoders.” medRxiv (2022): 2022-10.
  • Qabel, Aymen, et al. “Structure-Aware Antibiotic Resistance Classification Using Graph Neural Networks.” bioRxiv (2022): 2022-10.
  • Xu, Nancy, et al. “Image Keypoint Matching Using Graph Neural Networks.” Complex Networks & Their Applications X: Volume 2, Proceedings of the Tenth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2021 10. Springer International Publishing, 2022.
  • Nikolentzos, Giannis, Giannis Siglidis, and Michalis Vazirgiannis. “Graph kernels: A survey.” Journal of Artificial Intelligence Research 72 (2021): 943-1027.
  • Douka, Stella, et al. “JuriBERT: A Masked-Language Model Adaptation for French Legal Text.” Proceedings of the Natural Legal Language Processing Workshop 2021. 2021.
  • Salha, Guillaume, et al. “Fastgae: Scalable graph autoencoders with stochastic subgraph decoding.” Neural Networks 142 (2021): 1-19.
  • Guo, Yanzhu, et al. “BERTweetFR: Domain Adaptation of Pre-Trained Language Models for French Tweets.” Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021). 2021.
  • Salha-Galvan, Guillaume, et al. “Cold start similar artists ranking with gravity-inspired graph autoencoders.” Proceedings of the 15th ACM Conference on Recommender Systems. 2021.
  • Panagopoulos, George, et al. “Learning Graph Representations for Influence Maximization.” arXiv preprint arXiv:2108.04623 (2021).
  • Panagopoulos, George, Giannis Nikolentzos, and Michalis Vazirgiannis. “Transfer graph neural networks for pandemic forecasting.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 6. 2021.
  • Abdine, Hadi, et al. “Evaluation Of Word Embeddings From Large-Scale French Web Content.” arXiv preprint arXiv:2105.01990 (2021).
  • Guo, Yanzhu, Christos Xypolopoulos, and Michalis Vazirgiannis. “How COVID-19 is Changing Our Language: Detecting Semantic Shift in Twitter Word Embeddings.” Conférence Nationale en Intelligence Artificielle 2022 (CNIA 2022). 2022.
  • Dasoulas, George, Johannes F. Lutzeyer, and Michalis Vazirgiannis. “Learning Parametrised Graph Shift Operators.” International Conference on Learning Representations.
  • Amariles, David Restrepo, et al. “Computational Indicators in the Legal Profession: Can Artificial Intelligence Measure Lawyers’ Performance?.” U. Ill. JL Tech. & Pol’y (2021): 313.
  • Chatzianastasis, Michail, et al. “Graph-based Neural Architecture Search with Operation Embeddings.” 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). IEEE Computer Society, 2021.
  • Nikolentzos, Giannis, George Panagopoulos, and Michalis Vazirgiannis. “An Empirical Study of the Expressiveness of Graph Kernels and Graph Neural Networks.” Artificial Neural Networks and Machine Learning–ICANN 2021: 30th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 14–17, 2021, Proceedings, Part III 30. Springer International Publishing, 2021.
  • Nikolentzos, Giannis, et al. “Can Author Collaboration Reveal Impact? The Case of h-index.” Predicting the Dynamics of Research Impact (2021): 177-194.
  • Nikolentzos, Giannis, et al. “Image Classification Using Graph-Based Representations and Graph Neural Networks.” Complex Networks & Their Applications IX: Volume 2, Proceedings of the Ninth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2020. Springer International Publishing, 2021.
  • Salha, Guillaume, Romain Hennequin, and Michalis Vazirgiannis. “Simple and effective graph autoencoders with one-hop linear models.” Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings, Part I. Springer International Publishing, 2021.
  • Panagopoulos, George, Fragkiskos D. Malliaros, and Michalis Vazirgiannis. “Multi-task learning for influence estimation and maximization.” IEEE Transactions on Knowledge and Data Engineering 34.9 (2020): 4398-4409.
  • Eddine, Moussa Kamal, Antoine Tixier, and Michalis Vazirgiannis. “BARThez: a Skilled Pretrained French Sequence-to-Sequence Model.” Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.
  • Nikolentzos, Giannis, George Dasoulas, and Michalis Vazirgiannis. “k-hop graph neural networks.” Neural Networks 130 (2020): 195-205.
  • Panagopoulos, George, Giannis Nikolentzos, and Michalis Vazirgiannis. “Transfer graph neural networks for pandemic forecasting.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 6. 2021.
  • Qiu, Yang, et al. “Predicting conversions in display advertising based on URL embeddings.” arXiv preprint arXiv:2008.12003 (2020).
  • Skianis, Konstantinos, et al. “Rep the set: Neural networks for learning set representations.” International conference on artificial intelligence and statistics. PMLR, 2020.
  • Limnios, Stratis, et al. “Hcore-init: Neural network initialization based on graph degeneracy.” 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021.
  • Shang, Guokan, et al. “Speaker-change Aware CRF for Dialogue Act Classification.” Proceedings of the 28th International Conference on Computational Linguistics. 2020.
  • Nikolentzos, Giannis, Antoine Tixier, and Michalis Vazirgiannis. “Message passing attention networks for document understanding.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 05. 2020.
  • Dasoulas, George, Giannis Nikolentzos, Kevin Scaman, Aladin Virmaux, and Michalis Vazirgiannis. “Ego-based Entropy Measures for Structural Representations.” arXiv preprint arXiv:2003.00553 (2020).
  • Xypolopoulos, Christos, Antoine J-P. Tixier, and Michalis Vazirgiannis. “Unsupervised Word Polysemy Quantification with Multiresolution Grids of Contextual Embeddings.”. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, Long Papers (2020).
  • Boniol, Paul, George Panagopoulos, Christos Xypolopoulos, Rajaa El Hamdani, David Restrepo Amariles, and Michalis Vazirgiannis. “Performance in the Courtroom: Automated Processing and Visualization of Appeal Court Decisions in France.” arXiv preprint arXiv:2006.06251 (2020).


mvazirg ~ lix.polytechnique.fr