Publikationsdetails

Zitat:	Sebastian Paischer, Markus Hofmarcher, Sepp Hochreiter, Thomas Adler, "Linear Alignment of Vision-language Models for Image Captioning" , Serie arXiv.org, 2024
Original Titel:	Linear Alignment of Vision-language Models for Image Captioning
Sprache des Titels:	Englisch
Original Kurzfassung:	Recently, vision-language models like CLIP have advanced the state of the art in a variety of multi-modal tasks including image captioning and caption evaluation. Many approaches adapt CLIP-style models to a downstream task by training a mapping network between CLIP and a language model. This is costly as it usually involves calculating gradients for large models. We propose a more efficient training protocol that fits a linear mapping between image and text embeddings of CLIP via a closed-form solution. This bypasses the need for gradient computation and results in a lightweight captioning method called ReCap, which can be trained up to 1000 times faster than existing lightweight methods. Moreover, we propose two new learning-based image-captioning metrics that build on CLIP score along with our linear mapping. Furthermore, we combine ReCap with our new metrics to design an iterative datastore-augmentation loop (DAL) based on synthetic captions. We evaluate ReCap on MS-COCO, Flickr30k, VizWiz, and MSRVTT. ReCap achieves performance comparable to state-of-the-art lightweight methods on established metrics while outperforming them on our new metrics, which are better aligned with human ratings on Flickr8k-Expert and Flickr8k-Crowdflower. Finally, we demonstrate that ReCap transfers well to other domains and that our DAL leads to a performance boost.
Sprache der Kurzfassung:	Englisch
Serie:	arXiv.org
Erscheinungsjahr:	2024
Anzahl der Seiten:	26
URL zu weiteren Infos:	https://arxiv.org/abs/2307.05591
Reichweite:	international
Publikationstyp:	Preprint
Autoren:	Sebastian Paischer, Markus Hofmarcher, Sepp Hochreiter, Thomas Adler
Forschungseinheiten:	Institut für Machine Learning LIT Artificial Intelligence Lab

Wissenschaftsgebiete:	Biomathematik (ÖSTAT:101004) Numerische Mathematik (ÖSTAT:101014) Operations Research (ÖSTAT:101015) Optimierung (ÖSTAT:101016) Spieltheorie (ÖSTAT:101017) Statistik (ÖSTAT:101018) Stochastik (ÖSTAT:101019) Wahrscheinlichkeitstheorie (ÖSTAT:101024) Zeitreihenanalyse (ÖSTAT:101026) Dynamische Systeme (ÖSTAT:101027) Mathematische Modellierung (ÖSTAT:101028) Mathematische Statistik (ÖSTAT:101029) Approximationstheorie (ÖSTAT:101031) Informatik (ÖSTAT:102) Artificial Intelligence (ÖSTAT:102001) Bildverarbeitung (ÖSTAT:102003) Bioinformatik (ÖSTAT:102004) Human-Computer Interaction (ÖSTAT:102013) Künstliche Neuronale Netze (ÖSTAT:102018) Machine Learning (ÖSTAT:102019) Computational Intelligence (ÖSTAT:102032) Data Mining (ÖSTAT:102033) Statistische Physik (ÖSTAT:103029) Bioinformatik (ÖSTAT:106005) Biostatistik (ÖSTAT:106007) Embedded Systems (ÖSTAT:202017) Robotik (ÖSTAT:202035) Sensorik (ÖSTAT:202036) Signalverarbeitung (ÖSTAT:202037) Computerunterstützte Diagnose und Therapie (ÖSTAT:305901) Medizinische Informatik (ÖSTAT:305905) Medizinische Statistik (ÖSTAT:305907)

fodok.jku.at

Benutzerbetreuung: Sandra Winzer, letzte Änderung:

Johannes Kepler Universität (JKU) Linz, Altenbergerstr. 69, A-4040 Linz, Austria
Telefon + 43 732 / 2468 - 9121, Fax + 43 732 / 2468 - 29121, Internet www.jku.at, Impressum