Automatically understanding funny moments (i.e., the moments that make people laugh) when
watching
comedy is challenging, as they relate to various features, such as facial expression, body
language,
dialogues and culture. In this paper, we propose FunnyNet, a model that relies on cross- and
self-attention for both visual and audio data to predict funny moments in videos. Unlike most
methods
that focus on text with or without visual data to identify funny moments, in this work in
addition
to
visual cues, we exploit audio. Audio comes naturally with videos, and moreover it contains
higher-level
cues associated with funny moments, such as intonation, pitch and pauses.
To acquire labels for training, we propose an unsupervised approach that spots and labels funny
audio
moments.
We provide experiments on five datasets: the sitcoms TBBT, MHD, MUStARD, Friends, and the TED
talk
UR-Funny.
Extensive experiments and analysis show that FunnyNet successfully exploits visual and auditory
cues
to
identify funny moments, while our findings corroborate our claim that audio is more suitable
than
text
for funny moment prediction. FunnyNet sets the new state of the art for laughter detection with
audiovisual or multimodal cues on all datasets.