NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks

Research output: Chapter in Book/Report/Conference proceedingConference paperResearch

42 Citations (Scopus)
82 Downloads (Pure)

Abstract

Natural language explanation (NLE) models aim at explaining the decision-making process of a black box system via generating natural language sentences which are
human-friendly, high-level and fine-grained. Current NLE models 1) explain the decision-making process of a vision or vision-language model (a.k.a., task model), e.g., a VQA model, via a language model (a.k.a., explanation model), e.g., GPT. Other than the additional memory resources and inference time required by the task model, the task and explanation models are completely independent, which disassociates the explanation from the reasoning process made
to predict the answer. We introduce NLX-GPT, a general, compact and faithful language model that can simultaneously predict an answer and explain it. We first
conduct pre-training on large scale data of image-caption pairs for general understanding of images, and then formulate the answer as a text prediction task along with the explanation. Without region proposals nor a task model, our resulting overall framework attains better evaluation scores, contains much less parameters and is 15× faster than the current SoA model. We then address the
problem of evaluating the explanations which can be in many times generic, data-biased and can come in several forms. We therefore design 2 new evaluation measures:(1) explain-predict and (2) retrieval-based attack, a self-evaluation framework that requires no labels. Code is at: https://github.com/fawazsammani/nlxgpt.
Original languageEnglish
Title of host publicationIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
PublisherIEEE
Pages8322-8332
Number of pages <span style="color:red"p> <font size="1.5"> ✽ </span> </font>11
ISBN (Electronic)978-1-6654-6946-3
ISBN (Print)978-1-6654-6947-0
DOIs
Publication statusPublished - Jun 2022
Event2022 Conference on Computer Vision and Pattern Recognition - New Orleans, United States
Duration: 19 Jun 202224 Jun 2022

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2022-June
ISSN (Print)1063-6919

Conference

Conference2022 Conference on Computer Vision and Pattern Recognition
Country/TerritoryUnited States
CityNew Orleans
Period19/06/2224/06/22

Bibliographical note

Funding Information:
Acknowledgement: This research has been supported by the Research Foundation - Flanders (FWO) under the Project G0A4720N.

Publisher Copyright:
© 2022 IEEE.

Copyright:
Copyright 2022 Elsevier B.V., All rights reserved.

Fingerprint

Dive into the research topics of 'NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks'. Together they form a unique fingerprint.

Cite this