Projects per year
Abstract
Visual Question Answering (VQA) aims to build systems that bring together
three fundamental properties of human intelligence: perception, natural language
understanding and reasoning. In a VQA task, a system receives both an image
and a natural language question as input and is tasked with finding the answer
to that question in the image.
In this demo, we introduce our novel approach to VQA using the CLEVR
dataset (Johnson et al., 2017). This dataset contains artificially generated images
of geometric objects, together with challenging questions that test a variety of
reasoning skills such as counting, spatial relations or logical operations. We
take inspiration from modular neural network approaches, such as Andreas et al.
(2015), and consider two system components: the program composer and the
program execution engine. The former maps the question to a program consisting
of primitive operations. The latter composes and executes these on the image.
The innovative aspect of our approach is the program composer. It is built
using Fluid Construction Grammar (FCG) (Steels, 2017) and uses linguistic
analysis to map a natural language question onto a meaning representation (i.e.
a program) that is directly executable on the image. The meaning representation Is composed of a number of primitive operations, implemented using small
neural networks or modules. The execution engine simply takes this program,
as composed by FCG, and executes this on the image to find the answer. No
additional processing is required.
Given that FCG is an explainable, white-box system, participants will be able
to see how the linguistic analysis leads to an executable meaning representation.
Furthermore, we demonstrate how these operations are executed on images to
find the answer. Finally, we showcase the bidirectional processing capabilities of
FCG by generating many different questions starting from the same meaning
representation. The online, interactive demonstration of this system can be
found at http://fcg-net.org/demos/clevr-grammar.
Our goal is to build general, open-ended and interpretable systems for natural
language understanding and reasoning, that incorporate not only visual input but
also world knowledge. Such systems should actively recombine acquired skills,
in the form of modules, to solve unseen tasks. We take inspiration from existing
modular approaches but extend them towards a novel hybrid approach. In this
approach, we bring together symbolic and sub-symbolic modules, combining
their strengths. While sub-symbolic modules are good at handling complex data,
such as images, symbolic modules excel at higher-level reasoning tasks such as
planning.
We see many opportunities for such systems in a range of application areas.
Examples include intelligent security assistants that safeguard security prescriptions
in dangerous working environments, the maintenance and inspection of
hard-to-reach places such as wind turbines using drones and complex search
operations over large archives of visual data used by broadcasting companies.
three fundamental properties of human intelligence: perception, natural language
understanding and reasoning. In a VQA task, a system receives both an image
and a natural language question as input and is tasked with finding the answer
to that question in the image.
In this demo, we introduce our novel approach to VQA using the CLEVR
dataset (Johnson et al., 2017). This dataset contains artificially generated images
of geometric objects, together with challenging questions that test a variety of
reasoning skills such as counting, spatial relations or logical operations. We
take inspiration from modular neural network approaches, such as Andreas et al.
(2015), and consider two system components: the program composer and the
program execution engine. The former maps the question to a program consisting
of primitive operations. The latter composes and executes these on the image.
The innovative aspect of our approach is the program composer. It is built
using Fluid Construction Grammar (FCG) (Steels, 2017) and uses linguistic
analysis to map a natural language question onto a meaning representation (i.e.
a program) that is directly executable on the image. The meaning representation Is composed of a number of primitive operations, implemented using small
neural networks or modules. The execution engine simply takes this program,
as composed by FCG, and executes this on the image to find the answer. No
additional processing is required.
Given that FCG is an explainable, white-box system, participants will be able
to see how the linguistic analysis leads to an executable meaning representation.
Furthermore, we demonstrate how these operations are executed on images to
find the answer. Finally, we showcase the bidirectional processing capabilities of
FCG by generating many different questions starting from the same meaning
representation. The online, interactive demonstration of this system can be
found at http://fcg-net.org/demos/clevr-grammar.
Our goal is to build general, open-ended and interpretable systems for natural
language understanding and reasoning, that incorporate not only visual input but
also world knowledge. Such systems should actively recombine acquired skills,
in the form of modules, to solve unseen tasks. We take inspiration from existing
modular approaches but extend them towards a novel hybrid approach. In this
approach, we bring together symbolic and sub-symbolic modules, combining
their strengths. While sub-symbolic modules are good at handling complex data,
such as images, symbolic modules excel at higher-level reasoning tasks such as
planning.
We see many opportunities for such systems in a range of application areas.
Examples include intelligent security assistants that safeguard security prescriptions
in dangerous working environments, the maintenance and inspection of
hard-to-reach places such as wind turbines using drones and complex search
operations over large archives of visual data used by broadcasting companies.
Original language | English |
---|---|
Title of host publication | BNAIC 2018 Preproceedings |
Place of Publication | ’s-Hertogenbosch |
Pages | 171-172 |
Number of pages | 2 |
Publication status | Published - 8 Nov 2018 |
Event | 30th Benelux Conference on Artificial Intelligence - ‘s-Hertogenbosch, Netherlands Duration: 8 Nov 2018 → 9 Nov 2018 https://bnaic2018.nl |
Publication series
Name | Belgian/Netherlands Artificial Intelligence Conference |
---|---|
ISSN (Print) | 1568-7805 |
Conference
Conference | 30th Benelux Conference on Artificial Intelligence |
---|---|
Abbreviated title | BNAIC 2018 |
Country/Territory | Netherlands |
City | ‘s-Hertogenbosch |
Period | 8/11/18 → 9/11/18 |
Internet address |
Keywords
- artificial intelligence
- visual question answering
- hybrid AI
- natural language understanding
Fingerprint
Dive into the research topics of 'Hybrid AI for Visual Question Answering on CLEVR'. Together they form a unique fingerprint.Projects
- 2 Finished
-
FWOSB64: Hybrid AI for mapping between natural language utterances and their executable meanings
Nevens, J., Beuls, K. & Nowe, A.
1/01/19 → 31/12/22
Project: Fundamental
-
FWOAL785: Artificial Language Understanding in Robots (ATLANTIS)
1/12/15 → 30/11/18
Project: Fundamental