I don’t fully buy the authors’ claim for FGVR. So, I agree with you that it is not clear why orderless pooling is useful.

But here is one possible argument.

In FGVR, there is a high probability you will miss some features (birds’ eye, beak, head shape). These features are not only tiny but they are very similar across different classes. The orderless argument claims that feature recognition (whether it exists or not), is more important that its detection (where did it happen). For instance, the fact that a red-beak exists is more important than whether the red-beak feature is spatially close to the bird’s eye or not. Thus, it is better to classify the image based on whether a feature exists or not (does a red-beak exists?) than a spatial agreement between multiple features. A red-beak recognized at the image’s top-corner is as good as one recognized at the center.

Sorry, I am aware of any articles that give a better justification. It is very hard to prove these claims; it is mostly intuition-based.

I write reviews on computer vision papers. Writing tips are welcomed.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store