Multi-Modal Retrieval of Iconclass Codes using Deep Learning

Nikolay Banar, Walter Daelemans and Mike Kestemont

Iconclass is an iconographic classification thesaurus that is used to analyze the meaning, description and interpretation of artworks. Iconclass consists of 28,000 hierarchically ordered codes, which correspond to the specific subjects in works of art, such as people, events and ideas. The attribution of Iconclass codes is a difficult task that requires advanced interpretive skills and in-depth knowledge of art history. Furthermore, Iconclass provides a large number of codes, which makes the task extremely challenging for computers and human experts.

Deep learning has proved to be a state-of-the-art solution in many cultural heritage problems. In this work, we aim to develop a muti-modal deep learning framework for attribution of Iconclass codes to visual artworks. In the first part of our work, we investigate the feasibility of this task using the cross-modal retrieval model (SAEM), which consists of two branches to process visual and textual information. We evaluate the contribution of these branches to the Iconclass attribution. Next on, we complement the cross-modal retrieval model with an additional layer to combine multiple feature sources: textual features from the multilingual titles of the artworks and visual features, extracted from photographic reproductions of the artworks. To evaluate our approach, we utilize a publicly available dataset of artworks with multilingual titles in English and Dutch. We demonstrate that the textual features have a stronger contribution to performance in comparison to the visual features. However, we observe that the visual features complement the linguistic features to improve the final results. In addition, we demonstrate that our model works with both Dutch and English equally well. This finding can be important for institutions operating in low-resource languages.

In the second part of our work, we investigate transfer learning for our model to overcome the limited amount of training data as it is a common issue in the digital heritage domain. In deep learning, transfer learning consists in fine-tuning the weights of a network on a downstream dataset. However, our previous model could not be fully fine-tuned because it is implemented using different deep learning frameworks. We reimplement our model in PyTorch and fine-tune it on the same dataset. Finally, we demonstrate that fine-tuning boosts the performance by a large margin. We will make our source code and models publicly available to support the further development of the cultural heritage domain.