FunnyNet-W:
Multimodal Learning of Funny Moments in Videos in the Wild
IJCV2024
Zhi-Song Liu1
Robin Courant2
Vicky Kalogeiton2
Computer Vision and Pattern Recognition Laboratory, LUT University, Finland1
VISTA, LIX, Ecole Polytechnique, IP Paris2
[Paper]
[Code]
[Data]

Abstract

Automatically understanding funny moments (i.e., the moments that make people laugh) when watching comedy is challenging, as they relate to various features, such as body language, dialogues and culture. In this paper, we propose FunnyNet-W, a model that relies on cross- and self-attention for visual, audio and text data to predict funny moments in videos. Unlike most methods that rely on ground truth data in the form of subtitles, in this work we exploit modalities that come naturally with videos: (a) video frames as they contain visual information indispensable for scene understanding, (b) audio as it contains higher-level cues associated with funny moments, such as intonation, pitch and pauses and (c) text automatically extracted with a speech-to-text model as it can provide rich information when processed by a Large Language Model. To acquire labels for training, we propose an unsupervised approach that spots and labels funny audio moments. We provide experiments on five datasets: the sitcoms TBBT, MHD, MUStARD, Friends, and the TED talk UR-Funny. Extensive experiments and analysis show that FunnyNet-W successfully exploits visual, auditory and textual cues to identify funny moments, while our findings reveal FunnyNet-W's ability to predict funny moments in the wild. FunnyNet-W sets the new state of the art for funny moment detection with multimodal cues on all datasets with and without using ground truth information.



Paper and Supplementary Material

FunnyNet-W:
Multimodal Learning of Funny Moments in Videos in the Wild.

Zhi-Song Liu, Robin Courant
and Vicky Kalogeiton
In Journal, IJCV 2024.


[Paper]
[Bibtex]
[Code]


Acknowledgements

This work is supported by a DIM RFSI grant, a Hi!Paris collaborative project grant for V. Kalogeiton, the ANR projects WhyBehindScenes ANR-22-CE23-0007 and APATE ANR-22-CE39-0016, and the HPC resources of IDRIS under the allocation 2022-AD011013951 made by GENCI.
Finally, thanks to Phillip Isola and Richard Zhang for the project page template; the code can be found here.