This paper proposes a deep learning approach to turn image semantics into a graph of objects and relationship — nodes and edges. The quantitative evaluation shows that the problem is far from solved. The ground truth graph for the following image has four nodes (Person, Dog, Frisbee, Frisbee) and three edges (relationships). “Playing with” edge between person and dog nodes. “Catching” edge between dog and frisbee nodes. “Holding” edge between person and frisbee nodes.
The proposed network is straight forward and simple to understand. In its first stage, the network uses hour-glass architecture to produce an output with similar input image resolution. The generated heat map predicts the objects and relationships centers. Along with the heat maps, the hour-glass learns feature representations for these objects and relationships. In the figure below, the four white pixels, in the first heat-map, indicates objects centers. The second heat-map has three white pixels showing the relationships centers.
Feature representation of these white pixels are extracted for further recognition and detection tasks. Objects and relationships features are fed to two fully connected networks (FCN). The object FCN predicts object class, bounding box and ID embedding. The relationship FCN predicts relationship class, source and destination ID embedding.
The IDs embedding are learned to associate relations to objects — edges to graph nodes. Ideally, a relationship source and destination ID embedding should match exactly with the corresponding source and destination objects. For example, if Dog has ID =1 and person has ID = 2, then relationship “playing with” should predict source ID embedding=1 and destination ID embedding=2. Unfortunately, such a hard constraint is difficult to enforce during training on images containing multiple objects and relationships. Thus, relationship embeddings are associated with the nearest embedded objects during training. The following figure depicts the associative embedding idea on R². R² is used for illustration purpose, the embedding space used in the paper is R⁸.
In this diagram, a person with embedding ID (4,1.5) is throwing a frisbee with embedding ID (3.5,3.5). Naturally, the “throwing” relationship/edge should have a source and destination embedding (4,1.5) and (3.5,3.5) respectively. Unfortunately, the source embedding of person is far from the person embedding ID. The loss function penalizes such prediction and try to “pull together” embedding.
This loss function has the degenerate solution in which all embedding converge to a constant — h_i=Constant for all i. A regularize loss term is added to overcomes such limitation by “pushing” different embedding apart
The image-2-graph problem has multiple variants. The most difficult, called SGGen, provides no priors at all. A relaxed version provides objects ground truth boundary boxes during training— called SGCls. An even relaxed version provides the bounding boxes and their classes as well — called PredCls.
The original proposed architecture requires no priors; yet variant architectures are proposed to incorporate such priors. Qualitative and quantitative results illustrate the problem complexity and proposed approach superiority.
The approach is quantitatively evaluated by the number of ground-truth tuples predicted within the first K tuples. In experiments, Both K =50,100 are evaluated. From the reported results, the problem is far from solved.
The variant architectures, to include priors, details are very hand wavy. The proposed solution is to append the priors as channels after “several” layers of convolution and pooling to reduce the network computational cost.
Detecting overlapping objects can be challenging. The proposed solution to handle them is good, but does a better approach exists? I expected a solution like processing object proposals at different resolution; something similar to SSH: Single Stage Headless Face Detector by Najibi el. at.