Explainable visual question answering using procedural semantic representations and human-interpretable concepts

Liesbet De Vos, Jens Nevens, Paul Van Eecke and Katrien Beuls

The state of the art in visual question answering is dominated by deep neural network models that perform direct mappings from image-related questions to their answers. While these models often achieve high levels of accuracy, they suffer from a number of important shortcomings. As such, they exploit statistical biases learnt from huge amounts of training data, are untransparent and do not generalize well to out-of-distribution data. Here, we present an alternative approach based on procedural semantic representations and human-interpretable concepts. We validate our approach using the CLEVR visual question answering benchmark (Johnson et al. 2017) and demonstrate the model’s transparency, explainability and data-efficiency on both the conceptual and the linguistic level.

As an alternative to performing a direct mapping from a question to its answer, we divide the process into two components. The first step consists in mapping from a natural language question to its semantic representation that takes the form of an executable query. Then, this query is executed on a given image in order to retrieve the answer to the question. Importantly, both steps are performed by fully transparent models. The step of mapping from questions to queries is performed by an off-the-shelf computational construction grammar that achieves 100% accuracy on the CLEVR dataset (Nevens et al. 2019). The execution of the resulting queries is performed by an inventory of symbolic primitive operations which implement the different components of the query, in particular set operations such as filtering, querying and counting. The filtering and querying operations rely on concepts, such as cube, left, red, large and metal, which are in our case represented using fully transparent, multi-dimensional representations that were learnt in a data-efficient manner through situated communicative interactions (Nevens et al. 2020). The evaluation results show that we achieve 81.9% accuracy on the CLEVR benchmark dataset while keeping all aspects of the model fully explainable in human-interpretable categories. While both the system for representing and learning concepts, and the grammar for mapping between questions and their procedural semantic representations were previously developed, we will present at CLIN for the first time how these can be integrated into a fully human-interpretable visual question answering system.


Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., & Girshick, R. (2017). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2901-2910).

Nevens, J., Van Eecke, P., & Beuls, K. (2019). Computational construction grammar for visual question answering. Linguistics Vanguard, 5(1).

Nevens, J., Van Eecke, P., & Beuls, K. (2020). From continuous observations to symbolic concepts: a discrimination-Based Strategy for Grounded Concept Learning. Frontiers in Robotics and AI, 84.