Audiovisual Learning of Funny Moments in Videos
ACCV2022 [Oral, Best student paper award honorable mention]
Zhi-Song Liu*,1
Robin Courant*,2
Vicky Kalogeiton2
Caritas Institute of Higher Education, Hong Kong1
VISTA, LIX, Ecole Polytechnique, IP Paris2


Automatically understanding funny moments (i.e., the moments that make people laugh) when watching comedy is challenging, as they relate to various features, such as facial expression, body language, dialogues and culture. In this paper, we propose FunnyNet, a model that relies on cross- and self-attention for both visual and audio data to predict funny moments in videos. Unlike most methods that focus on text with or without visual data to identify funny moments, in this work in addition to visual cues, we exploit audio. Audio comes naturally with videos, and moreover it contains higher-level cues associated with funny moments, such as intonation, pitch and pauses. To acquire labels for training, we propose an unsupervised approach that spots and labels funny audio moments. We provide experiments on five datasets: the sitcoms TBBT, MHD, MUStARD, Friends, and the TED talk UR-Funny. Extensive experiments and analysis show that FunnyNet successfully exploits visual and auditory cues to identify funny moments, while our findings corroborate our claim that audio is more suitable than text for funny moment prediction. FunnyNet sets the new state of the art for laughter detection with audiovisual or multimodal cues on all datasets.


Paper and Supplementary Material

Audiovisual Learning of Funny Moments in Videos.

Zhi-Song Liu*, Robin Courant*
and Vicky Kalogeiton
In Conference, ACCV 2022.



We would like to thank our colleagues Dim P. Papadopoulos and Xi Wang, who helped review and correcting the manuscript. In addition, we thank the anonymous reviewers for their time and valuable comments.
Finally, thanks to Phillip Isola and Richard Zhang for the project page template; the code can be found here.