Toward scene narratives from object detection and image semantics

Mon, 08/02/2021 - 12:00

Part of the challenge inherent in building expert systems for automatic image captioning is that, although multiple objects can be successfully detected and labeled in an image (e.g. by Mask R-CNN), learning the relationships between detected objects in an image is not trivial. The Saint George on a Bike project’s context being that of sacred iconography between the 14th and end of the 18th centuries, such a task would correspond to Panofsky’s second level of interpretation of cultural heritage imagery [1]. An example of that would be for the image beholder to rightfully conclude that 13 men having supper with bread and wine (primary level of interpretation) represent the figure of Jesus Christ flanked by his 12 apostles in “The Last Supper” before his crucifixion in Jerusalem (secondary level of interpretation) as described in the New Testament of the Christian Bible.

Taken together as part of an image, detected objects are decorated by bounding boxes (BBx). BBx are bound together by visual relationships that carry meaning whether in an explicit figurative way or in a symbolic way. Humans are generally capable of detecting such visual relationships instantaneously, because they correspond to scene representations that to a large extent mirror their everyday perception of the world and their experience of the physical and visual interactions that take place in it. Machines however, are not so gifted. One may think of at least two ways to address that difficulty:

(i) a brute force approach would consist in training a deep learning model based on a more massive datasets made of millions of images. Every image would need to be decorated with BBx around extracted objects, their visual relationships correctly labeled, and the represented scene(s) correctly identified and labeled. The computational cost of such endeavor can rapidly become exorbitant.

(ii) implementation of specific heuristics so learning machines may rely on a rule-based algorithmic interpretation of pictorial syntax in an image. The ultimate objective would be to derive meaning from such things as the relative positions of BBx around detected objects, their relative sizes, and the nature of the detected objects' classes. Although dramatically cheaper, one can expect any rule based BBx visual relationship heuristic to be rather rigid and for that reason alone somewhat limited in its reach.

In this post, based on an original algorithm developed at BSC, we outline how the second approach may be applied to extract visual relationship information from the geometry and graphical distribution of bounding boxes in imagery. Visual relationships between multiple objects depicted in paintings usually point to:

- the dominant object (by size and position) as (at least one of) the image’s principal topics,

- overlapping objects (one object partially obfuscating another) as two pictorial entities possibly engaged in either co-presence or co-action in a limited space (a flock of sheep, a group of people fighting in a battle, a grove, a hamlet, a person riding a horse, etc),

- physically interacting objects, where utilization of one object by another is depicted: e.g. “knight [with] halo, [wielding] spear, [riding] horse” (4 somewhat overlapping BBxes), “women [seated on a] throne [wearing] crown, [holding] baby [on her lap]” (4 overlapping BBx),

- visually interacting objects: “upward-looking-man [with] aureola, [on] cross, [with] city [landscape in background]” (4 somewhat overlapping BBxes), “angel [facing] woman” (2 non-overlapping BBxes).

In all the above examples, italicized words are classes on which our Region-Convolutional Neural Network was trained, bold faced words represents the image’s main topic, and words in brackets result from our interpretation of visual relationships between BBxes. The detection of visually interacting objects will not implemented as the next step.

We illustrate such post-processing first with a 1668 painting of Saint Francis of Assisi by the Sevillian baroque painter Bartolomé Esteban Murillo (1617-1682).

Saint Francis of Assisi Crucifixion

On the right hand side, some of the objects of interest are detected and framed in as many BBxes, decorated with labels. For clarity not all BBxes are shown. In this example, the objects “crucifixion_1” and “monk_1” were correctly recognized as the main objects in the painting, and “crucifixion_1” was picked as the main topic based on its central location and on its size. In the same processing step the 100% overlap between “crown of thorns_1” and the head of the crucified figure establishes that the crucified figure represents Jesus of Nazareth. The sacred genre of the representation is suggested by the detection of cherubs (viz. object “angel_1”). The BBx analysis correctly concludes that the “angel_1” BBx overlaps that of object “book_1”, in the higher half of the composition in close proximity to object “crucifixion_1” but slightly in its background. Last in this simple example is the fact that object “monk_1” overlaps considerably with “crucifixion_1”. In that case our BBx-based post-processing cannot go beyond the observation that “monk_1” and Jesus on the Cross are in the same plane of representation, with “monk_1” situated beneath the cross.

A second example is a more complex scene, also of Christ on the Cross, by Rogier van der Weyden (1399-1464), painted circa 1440.

Christ on the Cross

As before the objects “crucifixion_1” and “prayer_1” were correctly recognized the main objects in the painting and “crucifixion_1” was picked as the main topic based on its central location and size. As before the location of “crown of thorn_1” establishes the crucified figure as Jesus of Nazareth. In support of that several angels are correctly detected and identified as being above and behind the main object “crucifixion_1”. By contrast all other figures are detected as being in the same representation plane as the figure on the cross, including “person_1” whose larger proportions could induce our algorithms to incorrectly conclude it is situated in the forefront of the painting. This is contradicted with its BBx’ baseline (BBx‘ lower horizontal boundary) which effectively places it slightly behind others and the cross. The close proximity of 3 characters on your left and at the center of the painting, with overlapping BBxes, and of one additional character “prayer_2” on your right hand side of the painting, with overlapping Bbxes, and of one additional character “prayer_2” on the right side of the painting, all beneath the center point of “crucifixion_1”, allows one to infer the probable representation of mourners.

The above analysis contributes to the emergence of a narrative based on pictorial semantics. The algorithmic treatment is in agreement with rules of composition (such as perspective, hierarchy of subjects, etc.) and plain spatial organization when dealing with relatively simple figurative scenes. This work is in progress and will be completed with the detection of body parts such as hands and limbs, as well as of line-of-sight so that we will soon be able to enrich our annotations with information about which way characters are facing and what other objects they may be touching.

------

[1] Erwin Panofsky’s three levels of interpretation applied to artworks are:
- 1st level of interpretation: primary or natural subject matter understanding (of pure forms: “I see 3 richly dressed men standing before a woman. They are in a house with a gallery. I see trees and clouds in the background.”)
- 2nd level of interpretation: secondary or conventional subject matter understanding (iconography: “13 men having supper is the Last Supper”; “A haloed man on a cross surrounded by lamenting men and woman, in the presence or not of two more crucified figures is the Crucifixion of Christ”)
- 3rd level of interpretation: tertiary or intrinsic meaning (iconology) draws from the fact that artistic representations are not isolated occurrences but artifacts caught in the flow of time, and as such are strongly influenced by their historical environment and the prevailing habitus. It seeks to answer the question: “why did the artist elect this artistic modality, those symbols and other aspects of the composition to represent the scene?”