Andreas Wagner,
"Learning Face-Voice Association"
, 7-2024
Original Titel:
Learning Face-Voice Association
Sprache des Titels:
Englisch
Original Kurzfassung:
With the increasing popularity of social media platforms in the last decades multimodal
learning has also gained more and more interest from researchers. Millions of texts,
images, videos and audio recordings are posted online everyday. Using this vast amount
of data different multimodal tasks such as cross-modal retrieval, matching and verification
have been tackled and researched by scientists[1]. Face-Voice Association (FVA) is an
example for such a multimodal task using audio and image data. The goal of Face-Voice
Association is to learn the underlying characteristics of face images and voice samples of
identities in order to accurately match faces to the voices of the same identity and vice
versa.
In this thesis, we conduct experiments on three different parts of the Face-Voice Association
pipeline. Existing work on FVA only uses late fusion, meaning that fusing the modality
specific features is the last step before obtaining the logits. Therefore, we will analyze
the impact of different fusion strategies such as early and middle fusion, as well as a
fusion independent approach. Furthermore, often we see modality specific sub-networks
being leveraged, to extract discriminative embeddings. To investigate the impact of these
sub-networks on total FVA performance we evaluate the performance using four different
sub-networks. Finally, due to the success of hyperbolic embeddings used in deep learning
we anaylze if hyperbolic embeddings are suited for FVA and compare them to the models
using euclidean space. Our experiments show that FVA performance increases the later
we fuse our features. Additionally, we notice that the choice of sub-networks for extraction
has a major impact on overall performance. Lastly, our experiments using hyperbolic
embeddings lead to the conclusion that they might be very well suited for the task of FVA.
However, more research has to be conducted in order to form a meaningful conclusion.