Computational Construction Grammar and Procedural Semantics for Visual Dialog

Research output: Unpublished contribution to conferenceUnpublished abstract

Abstract

Visual Dialog denotes an activity that consists in a robotic agent holding a meaningful and coherent conversation with a human interlocutor, thereby answering questions about the scene that it perceives (Das et al. 2017). There is a range of cognitive functions that a robotic agent must be capable of in order to be able to participate in such multi-turn conversations. First, it must be able to process the visual input and determine the properties of the objects and events that it perceives. Second, it must be capable of understanding the precise meaning of questions formulated in natural language. Finally, it must be capable of reasoning about information that was conveyed in earlier dialog turns (e.g. for resolving co-references). Current approaches to visual dialog fall short especially when it comes to the third capability (Kottur et al. 2018). Here, we present a novel methodology that overcomes this problem by (i) keeping track of relevant information from earlier turns (i.e. topic of the turn, mentioned entities, etc…) in a 'conversation memory', and (ii) designing a procedural semantic representation that effectively integrates this conversation memory in the agent's reasoning process. The methodology builds further on the computational construction grammar approach to visual question-answering that was introduced by Nevens et al. (2019). Within this approach, utterances are mapped onto an executable meaning representation using a computational construction grammar. The meaning representation is expressed in procedural semantics, which means that the meaning of an utterance is represented as a network of cognitive operations that an agent needs to execute in order to find the answer (Winograd 1971; Johnson-Laird 1977). Here, we extend the agent's inventory of cognitive operations with operations that add information to the conversation memory or retrieve information that was previously added. We have evaluated the methodology on the MNIST dialog benchmark dataset (Seo et al. 2017)
and the more challenging CLEVR dialog dataset (Kottur et al. 2018). Using the symbolic annotations of the images, we achieve a question-level accuracy of 100% and 99.99% respectively. The results confirm that our methodology based on computational construction grammar, procedural semantics, and the newly introduced conversation memory, is indeed an excellent candidate to be used as the natural language understanding component in visual dialog systems.
Original languageEnglish
Publication statusPublished - 16 Oct 2020
EventLinguists’ Day – Taaldag – Journée Linguistique LSB 2020 - University of Namur, Namur, Belgium
Duration: 15 May 2020 → …
http://Linguists’ Day – Taaldag – Journée Linguistique LSB 2020

Conference

ConferenceLinguists’ Day – Taaldag – Journée Linguistique LSB 2020
CountryBelgium
CityNamur
Period15/05/20 → …
Internet address

Fingerprint

Dive into the research topics of 'Computational Construction Grammar and Procedural Semantics for Visual Dialog'. Together they form a unique fingerprint.

Cite this