Paul Primus, Gerhard Widmer,
"Fusing Audio and Metadata EmbeddingsImproves Language-based Audio"
: Proceedings of the 32nd European Signal Processing Conference(EUSIPCO), Lyon, France, 2024
Original Titel:
Fusing Audio and Metadata EmbeddingsImproves Language-based Audio
Sprache des Titels:
Original Buchtitel:
Proceedings of the 32nd European Signal Processing Conference(EUSIPCO), Lyon, France
Original Kurzfassung:
Matching raw audio signals with textual descriptionsrequires understanding the audio's content and the description'ssemantics and then drawing connections between the two modalities.This paper investigates a hybrid retrieval system that utilizesaudio metadata as an additional clue to understand the content ofaudio signals before matching them with textual queries. Weexperimented with metadata often attached to audio recordings,such as keywords and natural-language descriptions, and weinvestigated late and mid-level fusion strategies to merge audioand metadata. Our hybrid approach with keyword metadata and latefusion improved the retrieval performance over a content-basedbaseline by 2.36 and 3.69 pp. mAP@10 on the ClothoV2 and AudioCapsbenchmarks, respectively.