I don’t fully buy the authors’ claim for FGVR. So, I agree with you that it is not clear why orderless pooling is useful.
But here is one possible argument.
In FGVR, there is a high probability you will miss some features (birds’ eye, beak, head shape). These features are not only tiny but they are very similar across different classes. The orderless argument claims that feature recognition (whether it exists or not), is more important that its detection (where did it happen). For instance, the fact that a red-beak exists is more important than whether the red-beak feature is spatially close to the bird’s eye or not. Thus, it is better to classify the image based on whether a feature exists or not (does a red-beak exists?) than a spatial agreement between multiple features. A red-beak recognized at the image’s top-corner is as good as one recognized at the center.
Sorry, I am aware of any articles that give a better justification. It is very hard to prove these claims; it is mostly intuition-based.