VISTA Seminars

2025/07/03 (11am)

Latent Representations for Better Generative Image Modeling,

Spyros Gidaris (Valeo.ai)

This talk explores how latent representations shape modern generative models. While latent spaces (like those in VQ-VAE and VQ-GAN) are central to today’s generative architectures—from diffusion models to autoregressive approaches—their structure and properties are often overlooked. I will present three works that refine or leverage latent representations for better generative modeling. First, EQ-VAE addresses a key limitation in existing autoencoders used in latent-based generative models: their latent spaces lack equivariance to simple semantic-preserving transformations like rotation or scaling, making generation harder. We introduce a simple regularization method that enforces equivariance, reducing its complexity without degrading reconstruction quality. This improves multiple state-of-the-art models (DiT, SiT, MaskGIT) and speeds up training. Next, ReDi integrates pretrained semantic features into latent diffusion models. Instead of just generating low-level image latents, we jointly model them with high-level semantic features (e.g., from DINOv2). This unified approach boosts image quality and training efficiency while enabling "Representation Guidance", a simple way to steer generation using learned semantics. Finally, DINO-Foresight tackles video prediction. We predict future frames in the semantic feature space of pretrained vision foundation models (e.g., from DINOv2), avoiding pixel-level inefficiencies. This makes forecasting simpler, faster, and more robust, enabling flexible adaptation to downstream tasks. Together, these works highlight how better latent representations can simplify, accelerate, and improve generative modeling.

2025/07/03 (10am)

Expressive representations for digital art & computer-aided manufacturing

Emilie Yu (UCSB)

Digital representations of 3D objects allow people to create both digital artworks destined to be viewed through a screen, as well as physical manufactured objects through computer-aided design and manufacturing. Designing well-suited digital representations is thus central to let humans extend the range of what they can create through computer software and machines. In this talk, I will present four case studies, in which leveraging specific digital representations and associated algorithms allowed us to design software that supports complex authoring workflows: by decomposing animation authoring into 2D and 3D components, we support the insertion of animated doodles into captured footage ; by introducing a new primitive in VR painting, we can achieve more fine-grained color editing ; by parameterizing patterns for crochet granny square garments, we enable crocheters to re-use material across garments ; and by devising new primitives to represent machine motion, we allow for fine-grained control over fabrication machines. Throughout the presentation, I will emphasize high-level design decisions and practical research methods that guided us in developing adequate digital representations.

2025/05/28 (2pm)

Seeing Beyond What You Have: Integrated Intelligence Through Multisensor Systems

Zongwei Wu (PostDoc, University of Wurzburg)

In this talk, I will present our recent work on multisensor perception systems. I will begin by discussing individual sensors, such as depth and event-based sensors. Then, I will move on to our efforts in developing a unified approach with a particular focus on emergent alignment and robustness to missing modalities. Finally, I will highlight the potential of such a system and outline future directions.
Bio: Zongwei Wu is a PostDoc Researcher and junior research group leader at the Computer Vision Lab, University of Wurzburg, Germany. He received his diplome d'ingénieur from the University of Technology of Compiègne in 2019 and earned a Ph.D. from Vibot EMR CNRS 6000, University of Burgundy, France in 2022. He was also a visiting scholar at CVL, ETH Zurich. His research focuses on multimodal models and multi-task reasoning for machine vision. He is a main organizer of the NTIRE workshop at CVPR 2024-2025 and acknowledged as an outstanding Associate Editor for IEEE RA-L.

2025/04/03 (5pm)

Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment

Minh-Quan Le (PhD, Stony Brook University)

While diffusion models are powerful in generating high-quality, diverse synthetic data for object-centric tasks, existing methods struggle with scene-aware tasks such as Visual Question Answering (VQA) and Human-Object Interaction (HOI) Reasoning, where it is critical to preserve scene attributes in generated images consistent with a multimodal context, i.e. a reference image with accompanying text guidance query. To address this, we introduce Hummingbird, the first diffusion-based image generator which, given a multimodal context, generates highly diverse images w.r.t. the reference image while ensuring high fidelity by accurately preserving scene attributes, such as object interactions and spatial relationships from the text guidance. Hummingbird employs a novel Multimodal Context Evaluator that simultaneously optimizes our formulated Global Semantic and Fine-grained Consistency Rewards to ensure generated images preserve the scene attributes of reference images in relation to the text guidance while maintaining diversity. As the first model to address the task of maintaining both diversity and fidelity given a multimodal context, we introduce a new benchmark formulation incorporating MME Perception and Bongard HOI datasets. Benchmark experiments show Hummingbird outperforms all existing methods by achieving superior fidelity while maintaining diversity, validating Hummingbird's potential as a robust multimodal context-aligned image generator in complex visual tasks.

2025/03/31 (11am)

Discussion on Generative AI

Alexei/Alyosha Efros (UC Berkeley)

2024/12/19 (2pm)

A General Framework for Text Line Recognition

Raphael Baena (Ecole des Ponts)

In this seminar, I will present a quick overview of my Ph.D research on transfer learning and generalization, followed by a detailed discussion of our recent NeurIPS paper on General Detection-based Text Line Recognition (DTLR). DTLR is a novel approach for recognizing text lines, whether printed or handwritten, across diverse scripts, including Latin, Chinese, and ciphered characters. Most HTR methods have focused on autoregressive decoding, predicting characters one after the other. Our method shows strong results across various scripts, even those typically addressed by specialized techniques. In particular, we achieve state-of-the-art performance for Chinese script recognition on the CASIA v2 dataset and for cipher recognition on the Borg and Copiale datasets. Finally, I will highlight several collaborative applications and extensions of this work with historians.

2024/12/13 (2pm)

Enhancing Human-Centred Visual Learning: Innovations in Vision Algorithms for Improved Human Understanding

Hyung Jing Chang (Birmingham University)

The progress of artificial intelligence fundamentally depends on humans, both as inventors and beneficiaries. My research on human-centred visual learning focuses on developing vision-based algorithms that prioritise the usability and usefulness of AI systems by addressing human needs and requirements based on visual cues. A crucial aspect of this work involves understanding human body pose, hand pose, eye gaze, and object interaction, as they provide valuable insights into human actions and behaviours. During this talk, I will discuss recent studies conducted by my group, particularly our latest advancements in combining vision-based methodologies with language models. These include integrating human body pose, motion, hand movements, and gaze estimation with language models to enhance context-aware understanding and interaction. This multimodal approach significantly improves the interpretability and adaptability of AI systems. I will also highlight other advancements in multimodal data integration, such as audio and text-based hand-object pose and shape estimation, as well as face+eye gaze image synthesis. These innovative approaches not only enhance the accuracy and robustness of our algorithms but also open new avenues for intuitive human-AI interaction. Furthermore, I will explore the latest research trends in computer vision and demonstrate how these methodologies have been, and can be, applied to algorithm development. Additionally, we will reflect on potential future directions for this field. The applications leveraging these human-centred vision methodologies are poised to revolutionise the way AI understands and interacts with humans.

2024/12/04 (2pm)

Visit ScienceXGames

David Louapre (Ubisoft)

Visit of the Vista team

2024/12/02 (2pm)

HDR Defense: Story-level multimodal generativeAI - from understanding to generating visual data using multiple modalities

Vicky Kalogeiton

The jury will be composed of:
Raoul de Charette, Research Director, INRIA (Rapporteur)
Dimitris Samaras, Professor, Stony Brook University (Rapporteur)
Josef Sivic, Distinguished Researcher, Czech Technical University (Rapporteur)
Matthieu Cord, Professor, Sorbonne University (Examinateur)
Juergen Gall, Professor, University of Bonn (Examinateur)
Vincent Lepetit, Professor, ENPC (Examinateur)
Elisa Ricci, Professor, University of Trento (Examinatrice)
Jakob Verbeek, Research Scientist, FAIR, Meta (Examinateur)

2024/05/22 (2pm)

PHD Defense: Interactive 3D Modeling of Evolutionary and Emergent Bio-Inspired Shapes

David-Henri Garnier

Due to manufacturing constraints, Computer-Aided Design has primarily focused on combinations of mathematical functions and simple parametric forms. However, the landscape changed with the advent of 3D printing, which allows for high shape complexity. The cost of additive manufacturing is now dominated by part size and material used rather than complexity, paving the way for a reevaluation of 3D modeling practices, including interactive conception and increased complexity. Inspired by the self-organizing principles observed in living organisms, the field of morphogenesis presents an intriguing alternative for 3D modeling. Unlike traditional CAD systems relying on explicit user-defined parameters, morphogenetic models leverage dynamic processes that exhibit emergence, evolution, adaptation to the environment, or self-healing. The general purpose of this Ph.D. is to explore and develop new approaches to 3D modeling based on highly detailed evolutionary shapes inspired by morphogenesis. The thesis commences with an in-depth exploration of bio-inspired 3D modeling, encompassing various methodologies, challenges, and options for incorporating bio-inspired concepts into 3D modeling practices.
Subsequent chapters delve into specific morphogenesis models. In the first part, the focus extends to adapting a biologically inspired model, specifically Physarum polycephalum, into computer graphics for designing organic-like microstructures. This section offers a comprehensive methodological development, analyzes model parameters, and discusses potential applications in diverse fields such as additive manufacturing, design, and biology. In the second part, a novel approach is investigated, utilizing Reaction/Diffusion models to grow lattice-like and membrane-like structures within arbitrary shapes. The methodology is based on anisotropic Reaction-Diffusion systems and diffusion tensor fields, demonstrating applications in mechanical properties, validation through nonlinear analysis, user interaction, and scalability. Finally, the third part explores the application of deep learning techniques to learn the rules of morphogenesis processes, specifically Reaction/Diffusion. It begins by illustrating the richness offered by Reaction/Diffusion systems before delving into the training of Cellular Automata and Reaction/Diffusion rules to learn system parameters, resulting in robust and "life-like" behaviors.

2024/03/21 (2pm)

Learning Based Simulation

Dinesh Manocha (University of Maryland)

Manufacturing design, virtual prototyping, and digital twin-generation tools create electronicrepresentations of mechanical parts and structures that need to be tested for interconnectivity, functionality, and reliability. A key component of these systems is to perform physics-based simulations based on mathematical models and scientific solvers. Our group has exploited GPU parallelism and neural architectures to scale these technologies to complex systems and use them for real-time applications. In the late 1990s, ours was the first group to exploit the shading and rasterization capabilities for physics, geometric, and database computations, and this field has now matured. Many of these technologies have been adopted by industry. Over the last few years, we have also explored using neural architectures and machine learning methods to accelerate rigid-body, fluid, cloth, acoustic, and traffic simulation. More recently, physics-based simulations are used to generate synthetic datasets for training. This talk will give an overview of our work and highlight some significant challenges in performing fast and reliable simulations.

2024/02/29 (2pm)

Huawei Research in Computer Vision and Graphics

Celine Loscos

2023/12/04 (2pm)

PHD Defense: Robust Geometry Processing: Detecting and handling Discontinuities in 3D Pointset Denoising and Mesh Parameterization

Jiayi Wei

Discontinuities in geometric data can take various forms, including global and explicit features like sharp edges, boundaries, and cuts, as well as local dissimilarities among neighboring samples. Dealing with discontinuities remains as a common challenge in geometry processing. In this thesis, we present a cohesive framework to deal with discontinuities in geometry processing tasks based on the principles of robust statistics via line processes. Specifically, the research is focused on two particular applications: the first addresses 3D pointset denoising, where we propose a novel approach based on a non-linear optimization of the tangent spaces. Using line processes as a tool to effectively identify outliers from inliers, reliably smooth results from outlier-ridden, very noisy pointsets can be obtained. Similarly, the framework is applied to joint optimization of mesh parameterization with seam placement, where the line process is used to determine whether edges of the mesh should be cut or not. The framework offers a new and comprehensive perspective that interprets and consolidates many state-of-the-art techniques.

2023/12/04 (10am)

PHD Defense: Human Locomotion in Natural Environments

Eduardo Alvarado

Designing the motion of virtual characters, as well as studying their interactions with the environment, is a key task for many applications. They can be used to improve the realism of 3D movies or to create plausible reactive characters for video games. On a more scientific basis, controlled characters can be used to compute visual simulations from priors, or to discover the impact that humans may have on the environment in the short and long term. We will consider the particularly challenging application of generating interactive yet plausible animations for humans in natural and dynamic scenes, taking into account the effect they have on the environment and the subsequent feedback that, in turn, takes place in them. Although these environments are particularly complex in terms of interactivity due to their diversity of elements, they may also provide fantastic opportunities to find new methods of designing and modifying interactive animations that adapt to such scenes at different scales and thus use them in a wide range of applications for the creation of virtual content: from video-games, VR or didactic applications, such as the visualization of natural scenery in museums.

2023/11/30 (2pm)

PHD Defense: Multiagent Reinforcement Learning

Ariel Kwiatkovski

2023/11/20 (2pm)

Fundamental Problems with Motion Control Policies

Paul Kry (McGill)

Deep reinforcement learning (DRL) methods have demonstrated impressive results for skilled motion synthesis of physically based characters, and while these methods perform well in terms of tracking reference motions or achieving complex tasks, several concerns arise when evaluating the naturalness of the motion. Specifically, we note that policies can be too stiff, too strong, and too smart, and we present quantitative metrics for measuring the naturalness of motion produced by DRL control policies beyond their visual appearance. In this talk I will also discuss an approach for modifying the latent space of existing policies to allow new behaviors to be quickly learned from similar tasks in comparison to learning from scratch.

2023/10/09 (5pm)

Procedural Noise Functions

Pascal Guehl

Procedural noise is an essential tool used in computer graphics to build appealing and realistic virtual scenes for movies and video games (e.g. texturing, modeling terrain and natural phenomena). From an artistic point of view, it can bring synthetic computer generated images to life. But designing a synthesis algorithm that unifies all the expected properties, such as spectral control, high performance, anti-aliasing and extreme compactness when dealing with solid, animated noise (e.g. 3D+time), still remains a challenge. None of the existing noise generation algorithms covers all properties satisfactorily. In particular, the extension to 3D+time has hardly been taken into account in recent work, despite the fact that it underpins a large number of applications in the modeling of natural phenomena. We introduce a new algorithm, based on waves that propagate randomly through space and time.

2023/07/03 (2pm)

PHD Defense: Efficient visual simulation of volcanic phenomena

Maud Lastic

Offering tools to easily and efficiently create consistent natural phenomena is one of the main challenges in Computer Graphics, where visually plausible but controllable virtual effects are mandatory for 3D films, simulators and games. The goal of this PhD is to propose novel multi-scale models for animating volcanic eruptions. These models are meant to be used by artists, giving several goals to achieve: being plausible, which means having a geometric resemblance to reality, as well as having the correct temporal dynamic; being fast in order to be able to be used interactively while permitting an usage in video games; and finally being controllable, with light and easy to understand model, so that users can easily adapt them to their needs. To animate explosive eruptions, we propose a model that takes their unique dynamics into account, resulting into ascending plumes propagating upward and finally spreading side-way as well as pyroclastic flows spreading down the slopes of the volcano, depending on initial conditions. Our model combines two consistently coupled, simple sub-models: a minimalist Lagrangian simulation, used to represent dynamic horizontal slices of material ejected by the volcano and interacting with the surrounding air; and a procedural model that enhances the visual animation of the turbulent flow with multi-resolution details. We extend this model by combining it with an atmospheric model, where several horizontal layers represent the atmosphere and we simulate the physical phenomena leading to the formation of clouds. Thus, several types of clouds can be animated and interact with a plume. Lastly, lava flows are among the most complex targeted phenomena, since they involve fluids evolving to a variety of behaviors while cooling down, from liquid to plastic, and then to rigid states. The visual aspect of lava is often hybrid, with liquid parts carrying cooler elements, and deformable crust that folds and deforms. None of these methods was able to handle the interaction between the visual state of the lava and the underlying flow, nor the formation and folding of deformable surface sheets. Therefore, rather than tackling pure simulation, we use as a base an existing Eulerian simulation of lava flows and build a geometrical simulation of the surface of the flow, letting folds appear knowing the velocity and temperature of the flow. We texture the flow using time-consistent textures generated according to the thickness of the crust.

2023/06/12 (5pm)

Interactive Authoring of Terrain using Diffusion Models

James Gain

Generating heightfield terrains is a necessary precursor to the depiction of computer-generated natural scenes in a variety of applications. Authoring such terrains is made challenging by the need for interactive feedback, effective user control, and perceptually realistic output encompassing a range of landforms. We address these challenges by developing a terrain-authoring framework underpinned by an adaptation of diffusion models for conditional image synthesis, trained on real-world elevation data. This framework supports automated cleaning of the training set; authoring control through style selection and feature sketches; the ability to import and freely edit pre-existing terrains, and resolution amplification up to the limits of the source data. Our framework improves on previous machine-learning approaches by: expanding landform variety beyond mountainous terrain to encompass cliffs, canyons, and plains; providing a better balance between terseness and specificity in user control, and improving the fidelity of global terrain structure and perceptual realism. This is demonstrated through drainage simulations and a user study testing the perceived realism for different classes of terrain.

2023/04/03 (5pm)

From SLAM Robustness to Computational Cinematography

Xi Wang

In this talk, Xi WANG's research work at the intersection of robotics, 3D vision, and computer graphics will be presented. His focus is on developing novel techniques in SLAM, 3D reconstruction, virtual camera control, and cinematography to create innovative multidisciplinary areas. These areas include computational cinematography, neural network-based camera control and tracking, and content generation by leveraging 3D and camera information. The talk will begin with his Ph.D. dissertation, supervised by Marc Christie and Eric Marchand in Univ Rennes, IRISA, Inria Rennes, which aimed to improve SLAM, specifically monocular visual SLAM systems, under varying illuminative conditions. He proposed several interweaving solutions to enhance the robustness of visual SLAM systems. He will also discuss his contributions to the field of virtual cinematography, where the objective is to control the virtual camera motion in correlation with computer graphics animation and cinematic styles. The talk will conclude with a discussion of his current and future works, which are more focused on the multidisciplinary problems in-betweened among 3D computer vision, virtual cinematography, and video content generation.

2023/02/20 (5pm)

Interval Arithmetics for Efficient Implicit Surface Computation

Kavosh Nakhaie Jazar (McGill University)

2023/02/06

VR + Mocap Real-Time Terrain Deformation

Eduardo Alvarado

2023/02/06 (5pm)

Emotion clustering in human faces

Julia K. Melgare (PUCRS, Brasil)

2023/02/06

Virtual Reality Headset

Tim Scheller

2023/01/30

SIGGRAPH Submission

2022/12/02 (2pm)

Archeologigal study of the Mobility for Homo Heidelbergensis at Tautavel Valley

Sophie Gregoire

2022/11/14

Virtual Terrain Erosion Simulation

Guillaume Cordonnier (Inria Sophia-Antipolis)