logo-vista

VISTA

Visual Worlds: Temporal Analysis, Animation and Authoring
VISTA - Research team in Computer Graphics & Vision at LIX, at Ecole Polytechnique/CNRS, Institut Polytechnique de Paris
Objectives: Analyse & Generate Animated Visual Scenes and Interactive 3D Virtual Worlds
Scientific Approach: Generative AI, Reinforcement Learning, Expressive Modeling & Authoring, Real-Time Simulation, Geometric Constraints, Field-Based Representation.
Applications: Entertainment, Design, Natural Sciences.

Research Axes

Our team develop new methods for the creation of Visual and Virtual Worlds with a specific focus on Storytelling for Animated Content. Our methods spans fully-automatic understanding of videos, up to the interactive creation of populated 3D virtual worlds. To this ends we are proposing methods improving the (i) Analysis of visual content, (ii) Shape and Motion representation, and (iii) the Creation of Visual Worlds.

We first propose fully Automatic AI-based Analysis of 2D Videos and 3D Animated Content that leverage Deep-Learning technics with a specific focus on time and multimodal input data. We are specifically developing methods for automatic human recognition, pose estimation, and behavior understanding. We also propose lightweight learning based on statistical approches to extract spatial relation between shapes from a single input.

Second, we develop Interactive Models to efficiently represent Shape and Motion. We are specialized in integrating spatio-temporal constraint into real-time reactive virtual models for game-like application using,either, explicit procedural models, or discovering them via Reinforcement-Learning. We also propose alternative, volume-based, representation for shapes modeling relying on implicit surface. These models are suited for complex shape synthesis or advanced interactive behaviors (precise collision, deformation). We finally develop layered and coupled models of different spatial/temporal nature adapted to simulate efficiently large and multi-scale natural scenes.

Third our models and analysis are aimed at the Creation and Authoring of Visual and Virtual Worlds. To this ends, we propose Expressive Creation Methodology, relying on Sketching or Sculpting Gestures, as well as Sound and Multimodal Systems. These steps are supported by the scene analysis allowing to provide suggestion system, up to helping the narrative design of the scene. We further propose transfert medodologies between geometry, animation, and style in complement to generative models in order to create lively and populated worlds with sufficient variety, or to explore the impact of parameters into a simulated world.

  • 1. Analysis and Understanding of Visual Content
    Deep CNN, Human-centric video learning
    Automatic & multimodal understanding
    Light learning, spatial representation
  • 2. Interactive Models for Shape and Motion
    Alternative representation (Field based, Implicit surfaces, ...)
    Spatio-temporal constraints
    Visual simulation, Layered models
    Behavioral simulation, Reinforcement learning
  • 3. Creating and Authoring Visual Worlds.
    Expressive creation: Sketching or Sculpting gestures, Sound, Multimodal system
    A-priori/learned knowledge constraints
    Narrative design, suggestion system
    Generation and style transfert, Visual transformers
Keywords - Computer Graphics, Computer Vision with Deep Learning, Generative AI, Animated Content, Shape and Motion, Interactive Creation, Visual Simulation, AI for Visual Computing.
Application - Movies, Video Games, Animation Cinema, Natural Science, Medical Imaging, Archeology, Art & Sciences, Design, Fashion, CAD.
Specialized Keywords
- Graphics: Sketch-based Modeling, Virtual Sculpting, Character Animation, Natural Phenomenon, Real-Time, Implicit Surface, Hybrid and Procedural Models.
- Vision: Human-centric Video understanding, Cinematography analysis, Visual Transformers, NeRF, Interior Scenes.
- Learning: Multi-Modal Learning, Generative Models, GANs, Diffusion Models, Reinforcement Learning, Lightweight Learning.


Team Expertise

The specific aspect of our team-based methodology is propose a global Visual Computing approach coupling Automatic Vision and Interactive Graphics methodologies. This allows to tackle complex open scientific problems mixing the analysis of 2D and the synthesis of 3D content. For instance, we develop generative-based approaches ranging from automatic-learning fom data (GAN, diffusion, etc), reinforcement-learning, as well as alternative lightweight and efficient model relying on a-priori knowledge and user-centric design.

We are researchers with mixed expertises and backgrounds in Computer Graphics and Computer Vision. We jointly develop AI-based approaches and efficient representation to improve 2D video analysis and 3D animated virtual world generation.

At the LIX level, our speciality relies on
- Video Analysis and Understanding
- Human Representation and Virtual Character Animation
- Interactive Creation
- Interactive Simulation of Multi-Scale Natural Scenes.
More events

VISTA Recent Events

event-thumbnail
2025/02/06
Event: AI Summit: AI, Science and Society
Vicky Kalogeiton will be part of the Symposium on Frontiers in Generative AI in the AI Summit (AI, Science and Society) taking place at Ecole polytechnique.
event-thumbnail
2024/12/02
Event: Vicky Kalogetion HDR
Congratulation to Vicky Kalogeiton who succesfully defended her HDR on Story-level multimodal generativeAI: from understanding to generating visual data using multiple modalities that presented to the jury composed of Raoul de Charette, Dimitris Samaras, Josef Sivic, Matthieu Cord, Juergen Gall, Vincent Lepetit, Elisa Ricci, Jakob Verbeek.
event-thumbnail
2024/11/23
Award: 3x Awards at the Motion, Interaction, and Games (MIG) conference
We received three awards at the ACM MIG conference. An Honorable Best Paper mention for the article "TwisterForge: controllable and efficient animation of virtual tornadoes" first authored by Jiong Chen, a Best Short Paper Award for the article "Expressive Animation Retiming from Impulse-Based Gestures" first authored by Marie Bienvenu, a Best Poster Award for "Reactive Gaze during Locomotion in Natural Environments" first authored by Julia Meglare.
event-thumbnail
2024/10/30
Award: 3x Awards at the Journées Francaises d'Informatique Graphique
We received three awards at the annual French conference of Computer Graphics jFig. An honorable mention for the article "Optimizing Multi-Agent Heard Model from a Single Video" first authored by Xianjin Gong, a first price at the Shader Toy contest awarded to Tara Butler, the price of the public for the Blender Node contest awarded to Théo Cheynel.
event-thumbnail
2024/08/23
Award: 3x Awards at the Symposium on Computer Animation
We are honoured to received three awards at the SCA conference. Best Paper Honourable Mention, Best Paper Presentation Honourable Mention, Best Poster, including Julia Melgare, Marie-Paule Cani and Damien Rohmer and our collaborators at Inria Sophia Antipolis, PUCRS, Purdue University, Clemson University and Roblox.

VISTA Seminars

2025/03/31 (5pm)
Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment
Minh-Quan Le (PhD, Stony Brook University)
While diffusion models are powerful in generating high-quality, diverse synthetic data for object-centric tasks, existing methods struggle with scene-aware tasks such as Visual Question Answering (VQA) and Human-Object Interaction (HOI) Reasoning, where it is critical to preserve scene attributes in generated images consistent with a multimodal context, i.e. a reference image with accompanying text guidance query. To address this, we introduce Hummingbird, the first diffusion-based image generator which, given a multimodal context, generates highly diverse images w.r.t. the reference image while ensuring high fidelity by accurately preserving scene attributes, such as object interactions and spatial relationships from the text guidance. Hummingbird employs a novel Multimodal Context Evaluator that simultaneously optimizes our formulated Global Semantic and Fine-grained Consistency Rewards to ensure generated images preserve the scene attributes of reference images in relation to the text guidance while maintaining diversity. As the first model to address the task of maintaining both diversity and fidelity given a multimodal context, we introduce a new benchmark formulation incorporating MME Perception and Bongard HOI datasets. Benchmark experiments show Hummingbird outperforms all existing methods by achieving superior fidelity while maintaining diversity, validating Hummingbird's potential as a robust multimodal context-aligned image generator in complex visual tasks.
2025/03/31 (11am)
Discussion on Generative AI
Alexei/Alyosha Efros (UC Berkeley)
2024/12/19 (2pm)
A General Framework for Text Line Recognition
Raphael Baena (Ecole des Ponts)
In this seminar, I will present a quick overview of my Ph.D research on transfer learning and generalization, followed by a detailed discussion of our recent NeurIPS paper on General Detection-based Text Line Recognition (DTLR). DTLR is a novel approach for recognizing text lines, whether printed or handwritten, across diverse scripts, including Latin, Chinese, and ciphered characters. Most HTR methods have focused on autoregressive decoding, predicting characters one after the other. Our method shows strong results across various scripts, even those typically addressed by specialized techniques. In particular, we achieve state-of-the-art performance for Chinese script recognition on the CASIA v2 dataset and for cipher recognition on the Borg and Copiale datasets. Finally, I will highlight several collaborative applications and extensions of this work with historians.
2024/12/13 (2pm)
Enhancing Human-Centred Visual Learning: Innovations in Vision Algorithms for Improved Human Understanding
Hyung Jing Chang (Birmingham University)
The progress of artificial intelligence fundamentally depends on humans, both as inventors and beneficiaries. My research on human-centred visual learning focuses on developing vision-based algorithms that prioritise the usability and usefulness of AI systems by addressing human needs and requirements based on visual cues. A crucial aspect of this work involves understanding human body pose, hand pose, eye gaze, and object interaction, as they provide valuable insights into human actions and behaviours. During this talk, I will discuss recent studies conducted by my group, particularly our latest advancements in combining vision-based methodologies with language models. These include integrating human body pose, motion, hand movements, and gaze estimation with language models to enhance context-aware understanding and interaction. This multimodal approach significantly improves the interpretability and adaptability of AI systems. I will also highlight other advancements in multimodal data integration, such as audio and text-based hand-object pose and shape estimation, as well as face+eye gaze image synthesis. These innovative approaches not only enhance the accuracy and robustness of our algorithms but also open new avenues for intuitive human-AI interaction. Furthermore, I will explore the latest research trends in computer vision and demonstrate how these methodologies have been, and can be, applied to algorithm development. Additionally, we will reflect on potential future directions for this field. The applications leveraging these human-centred vision methodologies are poised to revolutionise the way AI understands and interacts with humans.
2024/12/04 (2pm)
Visit ScienceXGames
David Louapre (Ubisoft)
Visit of the Vista team

Application domain

We have being developing our recent contributions in the following typical domains:
- Human recognition on Videos and 3D virtual Character Animation
- Cloth and Garment analysis and synthesis
- Natural environment simulation (terrain, volcano, flaura and fauna)
- Medical Imaging Analysis and biological shape design
Our research is highly application-driven where we aim at providing scientific support to enhance creativity with application in entertainement (movies and games), design as well as art in general. Our development of interactive visual representation can also find application for general public experience to help understanding time-related phenomenon (ex. terrain evolution, impact of climate change), or for expert public via serious games. Finally, we further provide dedicated analysis, interactive models and visualization for other scientific disciplines such as medical imaging, biology, or archeology where our models can help analysis, or serve as virtual test bench.
- Improve Creative and Entertainment Industries
video games/animation, Movies, VFX, creative arts, design
- Interactive Representation and Experience for the general Public or Experts
Museography, Archeology, Serious games
- Efficient Virtual Test Bench for Natural Sciences
Medical, Biology, Climatology, Natural Environment


We have ongoing (or recent) research collaboration with the following companies.
team-application

Research environment

We are located on the campus of Institut Polytechnique de Paris in the Alan Turing building [Contact].
We are collocated and working in close collaboration with the GeomeriX team at LIX regarding Geometry Analysis and Processing.
At the LIX level, we are part of the Modeling, Simulation and Learning Pole.
At the IP Paris, we are part of GeoVISTA - regrouping the Graphics and Vision teams on the Plateau de Saclay.

Links

publications
Publications
software
Software & Code
job
Job Offers
projects
Funded Projects