SGoaB relies on deep learning to train object detection and image recognition models, supplemented by natural language processing (NLP) to produce new or enriched image captions. One of its goals is to align image content with the descriptive text, based on both conventional object detection and on the analysis of the image’s pictorial semantics. The main approach includes several steps:
A| Definition of relevant object classes for iconography.
B| At its core, object detection within images will rely on trained CNN (Convolutional Neural Network) by combining frequency of object tagging of open datasets with knowledge bases (DBpedia, Wikidata, Wikimedia Commons).
C| Image segmentation will rely on improved Mask-RCNN (Mask-Region Convolutional Neural Network) models, with several improvements that take into account painting style and the depicted action or motifs.
A| Image feature extractor
This is a pre-trained model. The next step uses extracted features predicted by this model as input.
B| Sequence Processor
This is a word embedding layer for handling text input. It is supplemented by both a Long Short-Term Memory (LSTM) recurrent neural network layer, assisted by a language model to infer depicted objects.
Both the feature extractor and sequence processor produce a fixed-length vector as output. These are merged together and processed by a Dense layer to make a final prediction.
When and where appropriate, caption generation performance is augmented by implementing an Attention Mechanism to produce pertinent labels based on the selective output of Long Short-Term Memory (LSTM) neural networks.