Automatically understanding funny moments (i.e., the moments that make people laugh) when
comedy is challenging, as they relate to various features, such as facial expression, body
dialogues and culture. In this paper, we propose FunnyNet, a model that relies on cross- and
self-attention for both visual and audio data to predict funny moments in videos. Unlike most
that focus on text with or without visual data to identify funny moments, in this work in
visual cues, we exploit audio. Audio comes naturally with videos, and moreover it contains
cues associated with funny moments, such as intonation, pitch and pauses.
To acquire labels for training, we propose an unsupervised approach that spots and labels funny
We provide experiments on five datasets: the sitcoms TBBT, MHD, MUStARD, Friends, and the TED
Extensive experiments and analysis show that FunnyNet successfully exploits visual and auditory
identify funny moments, while our findings corroborate our claim that audio is more suitable
for funny moment prediction. FunnyNet sets the new state of the art for laughter detection with
audiovisual or multimodal cues on all datasets.