VISTA

Research Axes

Our team develop new methods for the creation of Visual and Virtual Worlds with a specific focus on Storytelling for Animated Content. Our methods spans fully-automatic understanding of videos, up to the interactive creation of populated 3D virtual worlds. To this ends we are proposing methods improving the (i) Analysis of visual content, (ii) Shape and Motion representation, and (iii) the Creation of Visual Worlds.

We first propose fully Automatic AI-based Analysis of 2D Videos and 3D Animated Content that leverage Deep-Learning technics with a specific focus on time and multimodal input data. We are specifically developing methods for automatic human recognition, pose estimation, and behavior understanding. We also propose lightweight learning based on statistical approches to extract spatial relation between shapes from a single input.

Second, we develop Interactive Models to efficiently represent Shape and Motion. We are specialized in integrating spatio-temporal constraint into real-time reactive virtual models for game-like application using,either, explicit procedural models, or discovering them via Reinforcement-Learning. We also propose alternative, volume-based, representation for shapes modeling relying on implicit surface. These models are suited for complex shape synthesis or advanced interactive behaviors (precise collision, deformation). We finally develop layered and coupled models of different spatial/temporal nature adapted to simulate efficiently large and multi-scale natural scenes.

Third our models and analysis are aimed at the Creation and Authoring of Visual and Virtual Worlds. To this ends, we propose Expressive Creation Methodology, relying on Sketching or Sculpting Gestures, as well as Sound and Multimodal Systems. These steps are supported by the scene analysis allowing to provide suggestion system, up to helping the narrative design of the scene. We further propose transfert medodologies between geometry, animation, and style in complement to generative models in order to create lively and populated worlds with sufficient variety, or to explore the impact of parameters into a simulated world.

1. Analysis and Understanding of Visual Content
Deep CNN, Human-centric video learning
Automatic & multimodal understanding
Light learning, spatial representation
2. Interactive Models for Shape and Motion
Alternative representation (Field based, Implicit surfaces, ...)
Spatio-temporal constraints
Visual simulation, Layered models
Behavioral simulation, Reinforcement learning
3. Creating and Authoring Visual Worlds.
Expressive creation: Sketching or Sculpting gestures, Sound, Multimodal system
A-priori/learned knowledge constraints
Narrative design, suggestion system
Generation and style transfert, Visual transformers

Keywords - Computer Graphics, Computer Vision with Deep Learning, Generative AI, Animated Content, Shape and Motion, Interactive Creation, Visual Simulation, AI for Visual Computing.

Application - Movies, Video Games, Animation Cinema, Natural Science, Medical Imaging, Archeology, Art & Sciences, Design, Fashion, CAD.

Specialized Keywords

- Graphics: Sketch-based Modeling, Virtual Sculpting, Character Animation, Natural Phenomenon, Real-Time, Implicit Surface, Hybrid and Procedural Models.
- Vision: Human-centric Video understanding, Cinematography analysis, Visual Transformers, NeRF, Interior Scenes.
- Learning: Multi-Modal Learning, Generative Models, GANs, Diffusion Models, Reinforcement Learning, Lightweight Learning.

Team Expertise

The specific aspect of our team-based methodology is propose a global Visual Computing approach coupling Automatic Vision and Interactive Graphics methodologies. This allows to tackle complex open scientific problems mixing the analysis of 2D and the synthesis of 3D content. For instance, we develop generative-based approaches ranging from automatic-learning fom data (GAN, diffusion, etc), reinforcement-learning, as well as alternative lightweight and efficient model relying on a-priori knowledge and user-centric design.

We are researchers with mixed expertises and backgrounds in Computer Graphics and Computer Vision. We jointly develop AI-based approaches and efficient representation to improve 2D video analysis and 3D animated virtual world generation.

At the LIX level, our speciality relies on

- Video Analysis and Understanding
- Human Representation and Virtual Character Animation
- Interactive Creation
- Interactive Simulation of Multi-Scale Natural Scenes.

More events

VISTA Recent Events

2025/06/01

Event: Xi Wang, new Faculty member at VISTA

Xi Wang, Generative AI expert, is a new Tenure-Track Assistant Professor at Vista. Xi obtained his PhD at University of Rennes in 2022, and did a PostDoc on Computational Cinematography at Vista since 2023. He is now joining us on a permanent Faculty position to develop his research on generative models, 3D vision and computational cinematography.
We are thrilled to welcome Xi!

2025/05/22

Event: Vicky Kalogeiton VISTA team leader

Vicky Kalogeiton is now the new team leader of VISTA

2025/05/20

Event: Vicky Kalogeiton - CVPR 2027 Program Chair

We are proud to announce that Vicky Kalogeiton will be Program Chair of the CVPR 2027 conference, the top-tier conference in Computer Vision and among the world-largest in AI, attracting more than 15000 paper submissions every year.

2025/05/18

Award: Honorable Best Paper Award Eurographics

Congratulations to Théo Cheynel for receiving an Honorable Best Paper Award at Eurographics conference for the work "ReConForM: Real-time Contact-aware Motion Retargeting for more Diverse Character Morphologies" developed in collaboration with Kinetix (Thomas Rossi and Baptiste Bello-Gurlet), Damien Rohmer and Marie-Paule Cani.

2025/05/17

Award: Gold Medal Eurographics awarded to Marie-Paule Cani

In recognition to her outstanding research contributions in the Eurographics domains, Marie-Paule Cani was awarded the Eurographics Gold Medal at the Eurographics 2025

VISTA Seminars

2025/11/27 (11am)

Finding needles in a haystack

Giorgos Tolias

⊕

This talk focuses on image-to-image retrieval at the finest level of granularity, i.e. instance-level, where the objective is to identify specific objects rather than broad categories. I will introduce ILIAS (CVPR 2025), a new large-scale benchmark designed to expose open challenges in this domain, such as retrieving small or heavily occluded objects within cluttered scenes. Building on these insights, I will present three distinct approaches that rely on local representations, each with different characteristics: (1) transformer-based architectures optimized for instance-level retrieval (AMES, ECCV 2024), (2) a lightweight, interpretable model with strong inductive biases for robust domain generalization (ELVIS, ongoing work), and (3) a training-free strategy to index multimodal language models for image similarity estimation (ongoing work). These methods reflect diverse design philosophies and trade-offs. The discussion will emphasize large-scale retrieval and, in particular, the critical balance between performance and memory efficiency.
Giorgos Tolias is an Associate Professor at CTU in Prague and leads a research team within the Visual Recognition Group (VRG). He received his PhD in 2013 from NTUA, Greece, under the supervision of Yannis Avrithis and Stefanos Kollias. From 2014 to 2015, he was a postdoctoral researcher at Inria Rennes, France, working with Hervé Jégou, and later joined CTU in Prague for a postdoc with Ondřej Chum. He received a Best Science Paper Award - Honorable Mention at BMVC 2017 and a Junior Star Starting Grant (2021-2025) from the Czech Science Foundation. His research focuses on computer vision, with emphasis on visual representation learning, and instance-level recognition.

2025/10/16 (2pm)

PHD Defense: Design of tangled, branching, and growing organic 3D structures from a sketch

Tara Butler

⊕

Freehand sketching offers an intuitive and expressive means of visualizing, exploring, and communicating complex representations and ideas. This is particularly valuable in scientific fields like biology, where forms are often intricate, dynamic, and span multiple spatial scales. While freehand drawing is natural, flexible, and conceptually rich, most digital 3D modeling tools remain rigid, technically demanding, and require users to work through low-level geometric abstractions. This disconnect limits the ability to visually reason about biological systems or to effectively illustrate them in three dimensions. This thesis explores how sketching can evolve into a perceptually grounded, interactive medium for authoring 3D shapes, structures, and behaviors. Building on the natural use of lines, hatching, gestures, and layers, this work introduces new methods that reinterpret sketching as an expressive input for designing organic 3D forms, with a particular focus on applications in plant biology. To this end, we present four tools within a sketch-based modeling system built on skeleton-driven implicit surfaces. We begin with the perception of static shape, investigating how hatching in drawings conveys depth, a technique long used in scientific illustration. Through a user study, we extract rules linking hatch curvature, direction, and frequency to perceived depth variation, and use them to develop a computational model that synthesizes 3D organic shapes from a single 2D sketch containing contours and hatching. Second, we introduce an interactive interface inspired by geological core sampling, which allows users to control depth layering and spatial interactions in complex scenes such as root networks, tissues, or vascular systems. By visualizing and editing local depth arrangements using a core sample widget, users can intuitively reorder, nest, or deform overlapping 3D sketch-based surfaces. Next, we propose a system for turning 2D sketches into growing 3D branching structures by encoding topology as a Directed Acyclic Graph (DAG) and learning geometric relationships using a Gaussian Mixture Model (GMM). This compact representation supports the generation of diverse 3D branching structures that retain the style and intent of the original sketch while introducing biologically inspired variation. Finally, we address dynamic behavior through a kinematic animation system that interprets directional arrow gestures as flow paths, guided by the local geometry of implicit surfaces. This enables dynamic illustration of internal biological processes such as transport or motion. In conclusion, this work aims to extend sketching into a digital medium that remains as intuitive as pen and paper, yet is capable of illustrating 3D form, capturing growth, and representing dynamic, evolving phenomena. Our goal is to support scientists and educators in exploring ideas, explaining processes, and communicating clearly.

2025/07/03 (11am)

Latent Representations for Better Generative Image Modeling

Spyros Gidaris (Valeo.ai)

⊕

This talk explores how latent representations shape modern generative models. While latent spaces (like those in VQ-VAE and VQ-GAN) are central to today’s generative architectures—from diffusion models to autoregressive approaches—their structure and properties are often overlooked. I will present three works that refine or leverage latent representations for better generative modeling. First, EQ-VAE addresses a key limitation in existing autoencoders used in latent-based generative models: their latent spaces lack equivariance to simple semantic-preserving transformations like rotation or scaling, making generation harder. We introduce a simple regularization method that enforces equivariance, reducing its complexity without degrading reconstruction quality. This improves multiple state-of-the-art models (DiT, SiT, MaskGIT) and speeds up training. Next, ReDi integrates pretrained semantic features into latent diffusion models. Instead of just generating low-level image latents, we jointly model them with high-level semantic features (e.g., from DINOv2). This unified approach boosts image quality and training efficiency while enabling "Representation Guidance", a simple way to steer generation using learned semantics. Finally, DINO-Foresight tackles video prediction. We predict future frames in the semantic feature space of pretrained vision foundation models (e.g., from DINOv2), avoiding pixel-level inefficiencies. This makes forecasting simpler, faster, and more robust, enabling flexible adaptation to downstream tasks. Together, these works highlight how better latent representations can simplify, accelerate, and improve generative modeling.

2025/07/03 (10am)

Expressive representations for digital art & computer-aided manufacturing

Emilie Yu (UCSB)

⊕

Digital representations of 3D objects allow people to create both digital artworks destined to be viewed through a screen, as well as physical manufactured objects through computer-aided design and manufacturing. Designing well-suited digital representations is thus central to let humans extend the range of what they can create through computer software and machines. In this talk, I will present four case studies, in which leveraging specific digital representations and associated algorithms allowed us to design software that supports complex authoring workflows: by decomposing animation authoring into 2D and 3D components, we support the insertion of animated doodles into captured footage ; by introducing a new primitive in VR painting, we can achieve more fine-grained color editing ; by parameterizing patterns for crochet granny square garments, we enable crocheters to re-use material across garments ; and by devising new primitives to represent machine motion, we allow for fine-grained control over fabrication machines. Throughout the presentation, I will emphasize high-level design decisions and practical research methods that guided us in developing adequate digital representations.

2025/05/28 (2pm)

Seeing Beyond What You Have: Integrated Intelligence Through Multisensor Systems

Zongwei Wu (PostDoc, University of Wurzburg)

⊕

In this talk, I will present our recent work on multisensor perception systems. I will begin by discussing individual sensors, such as depth and event-based sensors. Then, I will move on to our efforts in developing a unified approach with a particular focus on emergent alignment and robustness to missing modalities. Finally, I will highlight the potential of such a system and outline future directions.
Bio: Zongwei Wu is a PostDoc Researcher and junior research group leader at the Computer Vision Lab, University of Wurzburg, Germany. He received his diplome d'ingénieur from the University of Technology of Compiègne in 2019 and earned a Ph.D. from Vibot EMR CNRS 6000, University of Burgundy, France in 2022. He was also a visiting scholar at CVL, ETH Zurich. His research focuses on multimodal models and multi-task reasoning for machine vision. He is a main organizer of the NTIRE workshop at CVPR 2024-2025 and acknowledged as an outstanding Associate Editor for IEEE RA-L.

VISTA

Research Axes

Team Expertise

VISTA Recent Events

VISTA Seminars

Application domain

Research environment

Links