Paul Primus, Gerhard Widmer,
"CP-JKU's Submission to Task 6a of the DCASE2022 Challenge: A BART encoder-decoder for Automatic Audio Captioning trained via the Reinforce Algorithm and Transfer Learning"
: Detection and Classification of Acoustic Scenes and Events 2022 Challenge (DCASE2022), 2022
Original Titel:
CP-JKU's Submission to Task 6a of the DCASE2022 Challenge: A BART encoder-decoder for Automatic Audio Captioning trained via the Reinforce Algorithm and Transfer Learning
Sprache des Titels:
Englisch
Original Buchtitel:
Detection and Classification of Acoustic Scenes and Events 2022 Challenge (DCASE2022)
Original Kurzfassung:
This technical report details the CP-JKU submission to the automatic audio captioning task of the 2022?s DCASE challenge (task 6a). The objective of the task was to train a sequence-to-sequence model that automatically generates textual descriptions for given audio recordings. The approach described in this report enhances the BART-based encoder-decoder model used as the challenge?s baseline system in three directions: firstly, the VGGish embedding model was replaced with a custom CNN10-like model that we pretrained on AudioSet. Secondly, the BART encoder-decoder model was pre-trained on AudioCaps, which led to faster convergence. And finally, the best model was further fine-tuned by optimizing the non-differentiable CIDEr metric using the REINFORCE algorithm. Our best model achieves a SPIDEr score of .29 (single-model performance), which is an improvement of 6.6 pp. over the challenge?s baseline score.