Rethinking Attention with Performers — Part I

Ahmed Taha
7 min readOct 11, 2022

This article’s objective is to present a hand-wavy understanding of how Performers [1] work.

Transformers dominate the deep-learning literature in 2022. Unfortunately, Transformers suffer quadratic complexity in the self-attention layer. This has hindered transformers for long-input signals, i.e., large sequence L. Large sequences are not critical in NLP applications since most sentences have less than 40 words. Yet, large sequences are abundant in other applications such as protein sequencing [1] and high-resolution medical images [4]. Accordingly, there is a need for Efficient Transformers.

Figure 1: Vanilla self-attention computes the similarity (dot-product) between every query q and key k. This has quadratic complexity in terms of the input sequence length.

This is why Efficient Transformers literature is an active area of research. Tay et al. cluster these Efficient Transformers into categories as shown in Fig. 2. This article will focus on Performer from the Low Rank / Kernels category (blue circle).

Figure 2: Taxonomy of Efficient Transformer Architectures as defined by [3]

Performers are simple compared to alternative Efficient Transformers. Performers make minimal — if any — assumptions about the problem, e.g., the maximum sequence length L is not required. Furthermore, Performers introduce minimal hyperparameters. For these reasons, I am writing two articles to promote Performers. In this first article, I will present a hand-wavy picture of Performers, just an intuitive way to think about them. To do so, I will neither use the Performer’s paper notation nor dive into the technical hurdles tackled in this paper. In the second article, I will re-present Performers following the paper’s style and notation.

This article presents a Performer from a hashing perspective. Hashing is an easy concept that every software engineer understands. Those familiar with local sensitive hashing (LSH) will find this article easier to grasp. For those unfamiliar with LSH, we will use multiple hashing tables instead of one. Before diving into the hashing part, let us highlight the core idea first. The self-attention’s quadratic complexity stems from the need to compute the similarity between every pair of query-key. Basically, computing (Q × K)R^{L×L} is the source of all trouble in Eq. 1.

Equation 1: Vanilla self-attention has quadratic complexity in sequence length L. Normalizing factor (e.g., sqrt(d)) is dropped for brevity.

Vanilla self-attention computes (Q×K) first, then multiplies by V, i.e., the matrix multiplication order is ((Q×K)V). If we can change the order of these operations, we can evade the quadratic complexity. Concretely, if we multiply (K×V) first and then multiply by Q — (Q (K×V)) — our problem would be solved as shown in Fig. 3.

Figure 3: By changing the order of matrix multiplication, we reduce the self-attention complexity from quadratic (left) to linear (right). Dashed blocks indicate the order of computation. SM is the softmax kernel.

Yet, we can not simply do this because of the softmax operation on (Q×K). We need to get rid of the softmax first, so we can change the order of matrix multiplication. We can not simply remove this operation, but we can approximate it. We can regard softmax as a special function (kernel) that operates on two variables (Q, K). Then, we approximate this kernel using random feature maps ϕ as follows

Equation 2: We regard softmax (SM) as a kernel function that can be approximated by random feature maps, i.e., another univariate function.

This means we can re-write the self-attention operation as follows

Equation 3: We regard softmax as a kernel that can be approximated by a uni-variate function ϕ. This enables us to change the order of matrix multiplication and evade quadratic complexity.

So, how to find ϕ that approximates softmax? This is the main contribution of the Performers paper and it requires a lot of mathematical background. Yet, this article’s objective is to give a high-level understanding of Performers. Accordingly, I will use the hashing analogy.

Given L points, we need to compute LxL matrix to measure the similarity between every pair of points. Fortunately, we can approximate this operation using hashing. The Performer paper does not explicitly use the word hashing but the paper uses random features w. These random features w can be regarded as binary-hashing functions.

In Fig. 4, we have L=12 points (3 yellow, 3 green, 3 purple, and 3 red). An accurate pair-wise similarity costs O(12x12) computations, i.e., O(L²). To reduce this cost, we approximate this computation using a random feature w_1 as a binary-hashing function. Given a point x, this hashing function operates by projecting (dot-product) x on w, i.e., w^T x. Points with a similar (different) hashing sign will be assumed similar (different). Accordingly, the yellow and green points are assumed similar, while the purple and red points are assumed similar. Not a perfect solution, but it is a good start and has linear complexity!

Figure 4: A random feature w can be regarded as a binary hashing function that generates +ve/-ve hashes for its inputs.

Ideally, we want yellow and green points to be different. So, we introduce a second random feature w_2. By introducing a new random feature (hashing function) w_2, we have another hash table H_2 where yellow and purple points are similar, while green and red points are similar.

Figure 5: A random feature w_2 helps us separate yellow points from green points.

In our toy example, any individual hash table (H_1 or H_2) gives an inferior approximation but combining multiple hash tables gives a better approximation. By combining H_1 and H_2, we can conclude that yellow points are similar but different from the green, purple and red points as shown in Fig. 6.

Figure 6: We get a better pair-wise similarity approximation by combining multiple hashing tables.

Assuming a constant dimensionality d, an accurate pair-wise similarity costs O(L²), while our random-features approximation costs O(L×m) where m is the number of random features. In our toy example, the accurate pair-wise similarity costs 12x12, while the random-feature approximation costs 12x2, i.e., m=2. Clearly, the random feature approximation has linear complexity in terms of the sequence length L.

In our toy example, m=2 random features were enough. Yet, m should be bigger in complex embedding spaces. Another important observation is that while the random features w are supposed to be random, they are not entirely random. We prefer orthogonal (perpendicular) random features. Orthogonal features span (cover) larger embedding space and help differentiate between similar/different points. In Fig. 7, the orthogonal random features (w_1, w_2) can better approximate similarity between different points compared to non-orthogonal random features (w_1, w_3).

Figure 7: Orthogonal random features (w_1,w_2) are better than (w_1,w_3). Orthogonal random features span (cover) larger embedding space and help differentiate between different points.

Now, let us take another look at the Performer’s core equation

Equation 4: Feature maps ϕ used by Performers to approximate the softmax kernel.

We can see that we have m random features w on which we project (dot-product) our points x. So far, these random features w have been regarded as binary hashing functions that generate +ve/-ve hashes. Thus, our hashing tables contain two slots only. We can do better if our hashing table has more slots. For instance, it is better to assign positive points with different magnitudes (small and large) to different slots in our hashing table as shown in Fig. 8.

Figure 8: By quantizing/discretizing our hashes into finer slots, we can better approximate the similarity between different points.

To achieve this, we need an extra function f. For a given random feature w, this function f groups points with similar hashing magnitude and separates points with different hashing magnitude. So, I like to think of f as a step function as shown in Fig. 9. This perspective is good for a high-level understanding only. In Performers, f is selected to satisfy certain mathematical requirements. For instance, f should be differentiable and preferably with positive value for all inputs, i.e., f(x)≥0 ∀ x. Recall we are approximating softmax(Q,K) which outputs values [0,1]. So, f should be picked both to approximate the softmax and to remain stable for irrelevant features (points) which is the case for most queries and keys.

Figure 9: A step function example that assigns different hash values based on both sign and magnitude.

Accordingly, regarding f as a step function is good for high-level understanding only. For implementation and theoretical guarantees, f should be selected carefully.

My Final Comments

  1. The Performer paper [1] is a well-motivated worth reading paper. I highly recommend it even for those not interested in Transformers with linear complexity.
  2. I usually criticize papers in the My Comments section, but I want to criticize my article this time. First, this hashing perspective is not novel. Kitaev et al. [5] used a similar perspective to present the Reformers. Second, I had to violate some of the Performers' assumptions to maintain the hashing perspective/story. My goal is not to deliver an accurate explanation of Performer but to give a high-level picture. I believe the hashing perspective makes Performers less alien.
  3. Finally, I hope to do a better job presenting Performer's paper and its technicalities in a follow-up article, i.e., Part II.

References

[1] Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L. and Belanger, D., 2020. Rethinking attention with performers. arXiv preprint arXiv:2009.14794.

[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems

[3] Tay, Y., Dehghani, M., Bahri, D. and Metzler, D., 2020. Efficient transformers: A survey. ACM Computing Surveys

[4] Taha, A., Truong Vu, Y.N., Mombourquette, B., Matthews, T.P., Su, J. and Singh, S., 2022. Deep is a Luxury We Don’t Have. In International Conference on Medical Image Computing and Computer-Assisted Intervention

[5] Kitaev, N., Kaiser, Ł. and Levskaya, A., 2020. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451.

--

--