E.T. the Exceptional Trajectories:
Text-to-camera-trajectory generation
with character awareness

1LIX, Ecole Polytechnique, IP Paris, 2LIGM, Ecole des Ponts, CNRS, UGE 3Inria, IRISA, CNRS, Univ. Rennes

ECCV 2024

Abstract

Stories and emotions in movies emerge through the effect of well-thought-out directing decisions, in particular camera placement and movement over time. Crafting compelling camera motion trajectories remains a complex iterative process, even for skilful artists. To tackle this, in this paper, we propose a dataset called the Exceptional Trajectories (E.T.) with camera trajectories along with character information and textual captions encompassing description of both camera and character. To our knowledge, this is the first dataset of its kind. To show the potentialities of the E.T. dataset, we propose a diffusion-based approach, named DIRECTOR, which generates complex camera trajectories from textual captions that describe the relation and synchronisation between the camera and characters. Finally, to ensure a robust and accurate evaluation, we train CLaTr on the E.T. dataset, a language-trajectory feature representation used for metric calculation. Our work represents a significant advancement in democratizing the art of cinematography for common users.

Video

The Exceptional Trajectories dataset (E.T.)

E.T. key properties

Cinematic Content: Realistic and cinematic camera trajectories extracted from real-world movies, offering a diverse range of visual styles.

Scale: Comprises 115K samples, 11M frames, and 120 hours of footage across 16,210 scenes, making it one of the largest datasets of its kind.

Controllability: Includes camera and character trajectories, as well as camera-only and camera-character captions, providing users with flexibility and personalized search capabilities.

Dataset creation pipeline

Data Extraction and Pre-processing: SLAHMR is used to extract 3D camera and character poses from each shot, followed by pre-processing steps like alignment, filtering, smoothing, and cropping.

Motion Tagging: Trajectories are partitioned into segments with pure camera motions, including static, lateral, vertical, and depth movements. A thresholding-based method is applied to velocity to identify the nature of the motion, resulting in coarse motion tags.

Caption Generation: Rich textual descriptions of camera trajectories are generated by prompting an LLM, to reference the main character's trajectory as anchor points. The prompt includes context, instruction, constraint, and examples.


DIRECTOR

Architecture

DIRECTOR (DiffusIon tRansformEr Camera TrajectORy) generates camera trajectories conditioned on character trajectories and captions.

Director A


Director A

Director B


Director B

Director C


Director C

Director A: The conditioning is added to the context of the transformer input, using the global CLIP token for text and linear embedding for character trajectories.

Director B: A DiT-like architecture is used, where conditioning is concatenated into a single token and mapped at each layer. The Layer Norm is replaced with AdaLN, which modulates the scale and bias of the normalization.

Director C: The full sequence length of conditioning is leveraged. The text and motion sequences are concatenated, pre-processed with transformer encoders, and incorporated into the model using a cross-attention block.


Examples of controlability

The camera [trucks right / trucks left / booms top / booms bottom] while the character remains stationary.

Examples of diversity

While the character moves right, the camera performs a boom bottom.

Examples of complexity

While the character moves to the right, the camera [stays static and pushes-in / trucks right and remains static] once the character stops.

Examples of character-awareness

The camera remains static as the character moves to the [left / right].


CLaTr

Given the lack of relevant metrics for camera trajectory generation, we extend text-image/motion generation metrics to text-trajectory generation. However, there is no commonly accepted text-trajectory feature embedding.

To address this, we introduce Contrastive Language-Trajectory Embedding (CLaTr), a CLIP-like approach trained on the E.T. dataset. CLaTr consists of trajectory and text encoders and a shared feature decoder, trained with reconstruction, KL, and embedding similarity losses.

CLaTr

BibTeX

@article{courant2024et,
        author    = {Robin Courant and Nicolas Dufour and Xi Wang and Marc Christie and Vicky Kalogeiton},
        title     = {E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness},
        journal   = {arXiv},
        year      = {2024},
      }