Thanks to recent advances, modern AI systems are able to perform extremely well on some aspects of human intelligence such as perception. Other aspects of human intelligence, such as natural language understanding and reasoning, prove more difficult for computational systems. In this project, we propose a system that brings together these three aspects. The system we propose allows a user to ask questions, in natural language, about an image. Furthermore, the answer to that question should not always be visible on the image. The system can also reason about additional knowledge that is related to the objects on the image.
As oppose to other approaches, we explicitly design our system to be explainable. That way, the system can explain how it found an answer to the question or why it could not find an answer. We argue this is crucial aspect of any AI system that interacts with human users.
The system we propose will not only push forward the state of the art in AI, it can also be used in a wide variety of real-world applications. This system can be used in any situation where something needs to be found in a large collection of visual material, simply by specifying a question in natural language. This can be useful for surveillance purposes, e.g. to find a suspect in a database of security footage, or for media companies, e.g. to find a famous person in a large archive, or for maintenance, e.g. to inspect (parts of) wind turbines photographed by drones.