Pulp Motion

Abstract

Treating human motion and camera trajectory generation separately overlooks a core principle of cinematography: the tight interplay between actor performance and camera work in the screen space. In this paper, we are the first to cast this task as a text-conditioned joint generation, aiming to maintain consistent on-screen framing while producing two heterogeneous, yet intrinsically linked, modalities: human motion and camera trajectories. We propose a simple, model-agnostic framework that enforces multimodal coherence via an auxiliary modality: the on-screen framing induced by projecting human joints onto the camera. This on-screen framing provides a natural and effective bridge between modalities, promoting consistency and leading to more precise joint distribution. We first design a joint autoencoder that learns a shared latent space, together with a lightweight linear transform from the human and camera latents to a framing latent. We then introduce auxiliary sampling, which exploits this linear transform to steer generation toward a coherent framing modality. To support this task, we also introduce the PulpMotion dataset, a human-motion and camera-trajectory dataset with rich captions, and high-quality human motions. Extensive experiments across DiT- and MAR-based architectures show the generality and effectiveness of our method in generating on-frame coherent human-camera motions, while also achieving gains on textual alignment for both modalities. Our qualitative results yield more cinematographically meaningful framings setting the new state of the art for this task.

Our approach

Multimodal latent space

To efficiently handle multimodal generation of humans and cameras, we adopt a latent diffusion approach and we train an autoencoder that aligns both modalities into a shared latent space along with an auxiliary modality: the on-screen human framing within the camera, with bridging the modalities.
First we encode human and camera through a joint encoder, then a lightweight learnable linear transform maps these embeddings into an on-screen framing latent. Finally, three independent decoders reconstruct each modality from its latent, and the model is trained end-to-end using reconstruction losses.

Auxiliary sampling

Building on our shared multimodal latent space, we propose a latent diffusion framework that incorporates an auxiliary sampling technique to enhance coherence between human motion x and camera trajectories y via the on-screen framing z.
During sampling, the model's prediction is decomposed into a z-dependent component that guides the generation toward a coherent human-camera pair, and a complementary component that serves as an “unconditional” term. This decomposition leverages the linear transform linking the human and camera latents to the on-screen framing, resulting in the orthogonal projection P_||.

$$ \boldsymbol{\epsilon}(\mathbf{x}_t, \mathbf{y}_t, \mathbf{c}, t) = \boldsymbol{\epsilon}(\mathbf{x}_t, \mathbf{y}_t, \emptyset, t) + w_z \mathbf{P}_{\parallel} \, \boldsymbol{\epsilon}(\mathbf{x}_t, \mathbf{y}_t, \emptyset, t) + w_c \Big( \boldsymbol{\epsilon}(\mathbf{x}_t, \mathbf{y}_t, \mathbf{c}, t) - \boldsymbol{\epsilon}(\mathbf{x}_t, \mathbf{y}_t, \emptyset, t) \Big) $$

PulpMotion dataset

Human-camera pair extraction: we extract 3D human-camera pairs from videos using TRAM, which is faster and provides higher-quality estimates than previous methods, enabling large-scale processing.
Human-camera captions generation: detailed human motion captions are generated with the Qwen2.5-VL vision-language model, while camera captions follow the E.T. pipeline using motion tagging and a language model.
Human motion refinement: to improve motion quality, especially for partial or occluded views, we refine TRAM outputs by detecting out-of-frame body parts and applying a pretrained human motion diffusion model with RePaint editing to refine the motion.

Generated results

Camera:
the camera performs a truck right.
Human:
a person raising their right arm.

Camera:
the camera performs a push in.
Human:
a person sitting.

Camera:
the camera trucks right.
Human:
a person walks.

Camera:
the camera booms bottom.
Human:
a person sitting.

BibTeX

@article{courant2025pulpmotion,
        author    = {Robin Courant and David Loiseaux and Xi Wang and Marc Christie and Vicky Kalogeiton},
        title     = {Pulp Motion: Framing-aware multimodal camera and human motion generation},
        journal   = {arXiv},
        year      = {2025},
      }

Pulp Motion:Framing-aware multimodalcamera and human motion generation