Visual Question Answering (VQA) aims to build systems that bring together
three fundamental properties of human intelligence: perception, natural language
understanding and reasoning. In a VQA task, a system receives both an image
and a natural language question as input and is tasked with finding the answer
to that question in the image.

In this demo, we introduce our novel approach to VQA using the CLEVR
dataset (Johnson et al., 2017). This dataset contains artificially generated images
of geometric objects, together with challenging questions that test a variety of
reasoning skills such as counting, spatial relations or logical operations. We
take inspiration from modular neural network approaches, such as Andreas et al.
(2015), and consider two system components: the program composer and the
program execution engine. The former maps the question to a program consisting
of primitive operations. The latter composes and executes these on the image.

The innovative aspect of our approach is the program composer. It is built
using Fluid Construction Grammar (FCG) (Steels, 2017) and uses linguistic
analysis to map a natural language question onto a meaning representation (i.e.
a program) that is directly executable on the image. The meaning representation Is composed of a number of primitive operations, implemented using small
neural networks or modules. The execution engine simply takes this program,
as composed by FCG, and executes this on the image to find the answer. No
additional processing is required.

Given that FCG is an explainable, white-box system, participants will be able
to see how the linguistic analysis leads to an executable meaning representation.
Furthermore, we demonstrate how these operations are executed on images to
find the answer. Finally, we showcase the bidirectional processing capabilities of
FCG by generating many different questions starting from the same meaning
representation. The online, interactive demonstration of this system can be
found at

Our goal is to build general, open-ended and interpretable systems for natural
language understanding and reasoning, that incorporate not only visual input but
also world knowledge. Such systems should actively recombine acquired skills,
in the form of modules, to solve unseen tasks. We take inspiration from existing
modular approaches but extend them towards a novel hybrid approach. In this
approach, we bring together symbolic and sub-symbolic modules, combining
their strengths. While sub-symbolic modules are good at handling complex data,
such as images, symbolic modules excel at higher-level reasoning tasks such as

We see many opportunities for such systems in a range of application areas.
Examples include intelligent security assistants that safeguard security prescriptions
in dangerous working environments, the maintenance and inspection of
hard-to-reach places such as wind turbines using drones and complex search
operations over large archives of visual data used by broadcasting companies.
Originele taal-2English
TitelBNAIC 2018 Preproceedings
Plaats van productie’s-Hertogenbosch
Aantal pagina's2
StatusPublished - 8 nov 2018
Evenement30th Benelux Conference on Artificial Intelligence - ‘s-Hertogenbosch, Netherlands
Duur: 8 nov 20189 nov 2018

Publicatie series

NaamBelgian/Netherlands Artificial Intelligence Conference
ISSN van geprinte versie1568-7805


Conference30th Benelux Conference on Artificial Intelligence
Verkorte titelBNAIC 2018
Internet adres


Duik in de onderzoeksthema's van 'Hybrid AI for Visual Question Answering on CLEVR'. Samen vormen ze een unieke vingerafdruk.

Citeer dit