Bilinear CNN Models for Fine-grained Visual Recognition

Ahmed Taha
5 min readAug 17, 2018


Bilinear CNN is presented at ICCV 2015, a bit old, yet it has few interesting concepts I will revisit in this article. Resources used to prepare this article are listed at the end. The concepts covered are FVGR, orderless descriptors and the bilinear model formulation.

Fine-grained Visual Recognition (FVGR)

FVGR is a classification task where intra category visual differences are small and can be overwhelmed by factors such as pose, viewpoint, or location of the object in the image. For instance, the following image shows a California gull (left) and a Ringed-beak gull (Right). The beak pattern difference is the key for a correct classification. Such a difference is tiny when compared to the intra-category variations like pose and illumination.

Small-annotated datasets are another FGVR challenge. FVGR datasets annotation is expensive due to the expertise required. Three FVGR datasets: CUB birds species (30 images per class); flowers categories (10 per images class); aircraft variants (100 images per class). This repository provides a nice summary of six FGVR datasets. FGVR classification is still an open research, even with fine-tuned well-known architectures, due to the aforementioned challenges.

Orderless Descriptors

Bag of words (BoW) is a simple example of orderless descriptors. It counts detected features in an image and stores them into a histogram-like descriptor. The following image shows a BoW example where the visual words vocabulary contains, for simplicity, four words: Bike seat, light skin, violin part, and eye. In the first image, bike seat and violin parts are missing, while eyes and light skin are detected multiple times. A normalized BoW descriptor looks something like [0.1 bike seat, 0.9 light skin, 0.1 violin part, 0.7 eye]. Descriptors constructed in a similar manner can be used to train an SVM classifier to classify these three classes.

BoW is orderless because it stores no spatial information. In the BoW example first image (female face), eye is detected multiple times. The descriptor tracks the count only and no spatial information is stored. The bilinear CNN authors argue that orderless descriptors can be useful in certain problems like texture classification. In texture classification, the location of a pattern adds a little value given how many times it occurred. Another merit for orderless descriptors is that each bit/unit stores more information, basically it aggregates information from all image. For instance, the bike-seat counter can reflect multiple bike-seats if scattered across the image. Descriptors trained with CNN and fully connected (FC) layers have opposite qualities.

Fully connected features are “orderful” features, spatial information is stored and each bit/unit stores less information because it is for a particular image region. In the following image, a fully connected trained descriptor unit can be traced back to a particular region. It is a region, not a particular pixel, due to the subsampling, max and average pooling operations. Without these operations, a descriptor unit can be traced back to a particular image pixel.

The author claims that orderless descriptors are not generally better than “orderful” descriptors. It is better in problems like texture classification, scene classification and FVGR where the spatial information is less valuable. Fister vector (FV), an orderless descriptor, is reported to outperform FC descriptors for FVGR problem. While FV outperforms FC, training a neural network with FV in an end-to-end fashion is cumbersome due to gradient computation difficulty. That’s why the bilinear CNN model is introduced; it generalizes various orderless texture descriptors such as the Fisher vector, VLAD, and O2P. It can be trained end-to-end as presented in the next section.

Bilinear Model Formulation

To train a bilinear model, two CNN are required to extract image features. The two CNNs are usually early convolution layers from different, or the same, well-established architectures like AlexNet, VGG. Given an image I, the two CNNs (A, B) compute two features F_A, F_B. In the following image, F_A dimensionality is C * W* H, where C is the number of channels, W and H are the width and height of the descriptor, not the original image. Reshape C*W*H into C*M for F_A. Similarly, for CNN-B, F_B can be reshaped to C*N. Take the outer product of F_A and F_B, this results in C-(MxN) matrices.

To perform orderless pooling, the authors propose summation of these C Matrices. This results in a single MxN matrix, representing the whole image, that can be reshaped finally into a 1D vector descriptor. The advantage of such a formulation is that all operations are differentiable. The outer-product and summation pooling are both differentiable which enables end-to-end training. The outer product gradient is well-illustrated in the paper. A matrix differentiation introduction, required to understand the gradient formulation, is out of this article scope, so it is skipped.

At the end of the paper, the authors present a quantitative evaluation of BNN on multiple datasets like CUB birds[3], aircrafts[4] and cars[5].

The evaluated approaches are

  1. FV-SIFT: Image SIFT features pooled using Fisher Vector descriptor
  2. FC-CNN[M]: M-Net CNN feature extractor followed by Fully Connected layer descriptor
  3. FC-CNN[D]: VGG-Net (Deep) CNN feature extractor followed by Fully Connected layer descriptor
  4. FV-CNN[D]: VGG-Net (Deep) CNN feature extractor followed by Fisher Vector descriptor
  5. B-CNN [D-M]: Bilinear Model with VGG(Deep) and M-Net CNN feature extractors followed by summation pooling

Conclusion: FV outperforms FC even without end-to-end training. This highlight orderless descriptors potential for FGVR applications.

B-CNN models outperform FV-CNN. This is expected because B-CNN are trained in end-to-end fashion and It is conceptually equivalent to FV-CNN — both orderless features.

Resources List


[1]Bilinear CNN Models for Fine-grained Visual Recognition

[2]Improving Fine-Grained Visual Classification using Pairwise Confusion

[3]The Caltech-UCSD Birds-200–2011 Dataset

[4]Fine-grained visual classification of aircraft

[5] 3D object representations for fine-grained categorization

My Comments:

  • The paper is well-organized, mathematically formulation is simple to understand given the required background
  • Multiple interesting concepts are introduced
  • I particularly like the detailed experiments results analysis and the in-depth comparison with previous work even when the benchmark is different — compare methods that use/don’t use image part annotations.