Our approach
Multimodal latent space

To efficiently handle multimodal generation of humans and cameras, we adopt a latent diffusion approach and we train an autoencoder that aligns both modalities into a shared latent space along with an auxiliary modality: the on-screen human framing within the camera, with bridging the modalities.
First we encode human and camera through a joint encoder, then a lightweight learnable linear transform maps these embeddings into an on-screen framing latent.
Finally, three independent decoders reconstruct each modality from its latent, and the model is trained end-to-end using reconstruction losses.
Auxiliary sampling

Building on our shared multimodal latent space, we propose a latent diffusion framework that incorporates an auxiliary sampling technique to enhance coherence between human motion x and camera trajectories y via the on-screen framing z.
During sampling, the model's prediction is decomposed into a z-dependent component that guides the generation toward a coherent human-camera pair, and a complementary component that serves as an “unconditional” term. This decomposition leverages the linear transform linking the human and camera latents to the on-screen framing, resulting in the orthogonal projection P||.
$$ \boldsymbol{\epsilon}(\mathbf{x}_t, \mathbf{y}_t, \mathbf{c}, t) = \boldsymbol{\epsilon}(\mathbf{x}_t, \mathbf{y}_t, \emptyset, t) + w_z \mathbf{P}_{\parallel} \, \boldsymbol{\epsilon}(\mathbf{x}_t, \mathbf{y}_t, \emptyset, t) + w_c \Big( \boldsymbol{\epsilon}(\mathbf{x}_t, \mathbf{y}_t, \mathbf{c}, t) - \boldsymbol{\epsilon}(\mathbf{x}_t, \mathbf{y}_t, \emptyset, t) \Big) $$