(Jaegle et al., ICML 2021) is one of the recent models that I’m most excited about. The main motivation of the work is to enable a Transformer-like architecture to scale to very high-dimensional inputs (Vision Transformers
are typically applied to image patches to overcome the computational complexity of self-attention). There have been a lot of recent more efficient Transformer architectures (see this paper
for an overview) but these still depend on the length of the input, typically linearly.
In contrast, the Perceiver uses a latent array of a fixed dimensionality as its base representation (see above). This representation is then conditioned via cross-attention (as in a standard encoder-decoder model) on the much larger input array and then processed with a Transformer stack, in alternating fashion. If parameters are shared across Transformer blocks and cross-attention layers, the Perceiver can essentially be seen as an RNN with a Transformer at its core. It is also similar in spirit to the Universal Transformer
(Dehghani et al., ICLR 2019), a model that applies the same Transformer block to an input multiple times.
The authors apply the Perceiver to three datasets across different modalities (ImageNet, video and audio, and 3D point clouds) and report performance competitive with the state of the art on all of them. You can also check out Yannic Kilcher’s video
for a more visual introduction and contextualisation of the Perceiver.
Beyond the Transformer
Another recent trend has been the emergence of models that seek to replace the ubiquitous self-attention layer, most notably using multilayer perceptrons
(MLPs). The MLP-Mixer
(Tolstikhin et al., 2021) applies MLPs independently to image patches as well as across patches and achieves competitive results on image classification tasks. Liu et al. (2021)
propose gMLP, a gated MLP architecture that achieves performance similar to Transformers on NLP and vision tasks.
A non-MLP based recent model is FNet
(Lee-Thorp, 2021), which uses 1D Fourier Transforms instead of self-attention to mix information at the token level. While the model is less expressive than self-attention based models such as BERT, it is much faster and still achieves competitive results in many settings.
Another thread of work in this area revisits the dominance of self-attention by applying the same treatment to convolutions (Tay et al., ACL 2021
): It turns out that if CNNs are pre-trained the same way as Transformer models, they achieve competitive performance on many NLP tasks. They mainly underperform on tasks that require modelling relations across sentences (such as paraphrasing, NLI, or QA), tasks that are notably over-represented on standard benchmarks such as GLUE.
On a similar note, a recent paper (Dehghani et al., 2021
) by some of the same authors argues that the tasks we focus on as part of a benchmark induce a bias in terms of the models that will succeed. If standard benchmarks such as GLUE were constructed differently, would we still have ended up with self-attention-based models dominating or would CNN-based models be much more common?
In sum, an MLP may unfortunately not be all you need. However, while the hegemony of self-attention may still endure, recent challengers based on MLPs, convolutions, and various other transformations encourage us to rethink the fundamental building blocks of our models.