Hybrid AI for Visual Question Answering on CLEVR

Jens Nevens, Roxana Radulescu, Mathieu Reymond, Paul Van Eecke, Kyriakos Efthymiadis, Katrien Beuls

Research output: Chapter in Book/Report/Conference proceedingMeeting abstract (Book)

Abstract

Visual Question Answering (VQA) aims to build systems that bring together
three fundamental properties of human intelligence: perception, natural language
understanding and reasoning. In a VQA task, a system receives both an image
and a natural language question as input and is tasked with finding the answer
to that question in the image.

In this demo, we introduce our novel approach to VQA using the CLEVR
dataset (Johnson et al., 2017). This dataset contains artificially generated images
of geometric objects, together with challenging questions that test a variety of
reasoning skills such as counting, spatial relations or logical operations. We
take inspiration from modular neural network approaches, such as Andreas et al.
(2015), and consider two system components: the program composer and the
program execution engine. The former maps the question to a program consisting
of primitive operations. The latter composes and executes these on the image.

The innovative aspect of our approach is the program composer. It is built
using Fluid Construction Grammar (FCG) (Steels, 2017) and uses linguistic
analysis to map a natural language question onto a meaning representation (i.e.
a program) that is directly executable on the image. The meaning representation Is composed of a number of primitive operations, implemented using small
neural networks or modules. The execution engine simply takes this program,
as composed by FCG, and executes this on the image to find the answer. No
additional processing is required.

Given that FCG is an explainable, white-box system, participants will be able
to see how the linguistic analysis leads to an executable meaning representation.
Furthermore, we demonstrate how these operations are executed on images to
find the answer. Finally, we showcase the bidirectional processing capabilities of
FCG by generating many different questions starting from the same meaning
representation. The online, interactive demonstration of this system can be
found at http://fcg-net.org/demos/clevr-grammar.

Our goal is to build general, open-ended and interpretable systems for natural
language understanding and reasoning, that incorporate not only visual input but
also world knowledge. Such systems should actively recombine acquired skills,
in the form of modules, to solve unseen tasks. We take inspiration from existing
modular approaches but extend them towards a novel hybrid approach. In this
approach, we bring together symbolic and sub-symbolic modules, combining
their strengths. While sub-symbolic modules are good at handling complex data,
such as images, symbolic modules excel at higher-level reasoning tasks such as
planning.

We see many opportunities for such systems in a range of application areas.
Examples include intelligent security assistants that safeguard security prescriptions
in dangerous working environments, the maintenance and inspection of
hard-to-reach places such as wind turbines using drones and complex search
operations over large archives of visual data used by broadcasting companies.
Original languageEnglish
Title of host publicationBNAIC 2018 Preproceedings
Place of Publication’s-Hertogenbosch
Pages171-172
Number of pages2
Publication statusPublished - 8 Nov 2018
Event30th Benelux Conference on Artificial Intelligence - ‘s-Hertogenbosch, Netherlands
Duration: 8 Nov 20189 Nov 2018
https://bnaic2018.nl

Publication series

NameBelgian/Netherlands Artificial Intelligence Conference
ISSN (Print)1568-7805

Conference

Conference30th Benelux Conference on Artificial Intelligence
Abbreviated titleBNAIC 2018
Country/TerritoryNetherlands
City‘s-Hertogenbosch
Period8/11/189/11/18
Internet address

Keywords

  • artificial intelligence
  • visual question answering
  • hybrid AI
  • natural language understanding

Fingerprint

Dive into the research topics of 'Hybrid AI for Visual Question Answering on CLEVR'. Together they form a unique fingerprint.

Cite this