I prefer “bilinear is nothing but a fancier name to outer product right” over “bilinear pooling is nothing but a fancier name to outer product right”.
Pooling is the source of orderless, not the bilinear. Please excuse any mathematical slips in the following
Let's say you have two CNNs (A, B). Their outputs have dimensions (WxHxM, WxHxN). The bilinear operation output has a dimension (WxHxMxN). This output is “order-ful”, it preserves spatial information inside the WxH dimension. After pooling across all image’s locations (WxH), it becomes MxN which is finally flattened into MNx1. Pooling is where it becomes orderless.
This pooling idea is actually explored by standard classification architectures, i.e., it is used for general classification problems. Check table 1 in the “Densely Connected Convolutional Networks” paper; notice the last 7x7 global average pooling operation.