RESEARCH | Artificial Intelligence

SGoaB relies on deep learning to train object detection and image recognition models, supplemented by natural language processing (NLP) to produce new or enriched image captions. One of its goals is to align image content with the descriptive text, based on both conventional object detection and on the analysis of the image’s pictorial semantics. The main approach includes several steps:

...................................................

Main approach includes several steps:

>> 1. Object detection:

A| Definition of relevant object classes for iconography.

B| At its core, object detection within images will rely on trained CNN (Convolutional Neural Network) but will also incorporate semantic information to disallow the generation of anachronic objects and possibly refine object labels.

>> 2. Caption generation

A| Image feature extractor

This is a pre-trained model. The next step uses extracted features predicted by this model as input. We will also experiment with caption generation using the attention mechanism based on applying our own trained model as an encoder.

B| Sequence Processor

This is a word embedding layer for handling text input. It may be supplemented by both a Long Short-Term Memory (LSTM) recurrent neural network layer, assisted by a language model to infer depicted objects.

C| Decoder

Both the feature extractor and sequence processor produce a fixed-length vector as output. These are merged together and processed by a Dense layer to make a final prediction. The decoder part may be implemented as a Recurrent Neural Network (RNN) with attention mechanism (GRU or LSTM).

...................................................