NLP News

By Sebastian Ruder

GitHub Copilot, The Perceiver, Beyond the Transformer, Data augmentation, NL augmenter 🦎 → 🐍, Research communication



Subscribe to our newsletter

By subscribing, you agree with Revue’s Terms of Service and Privacy Policy and understand that NLP News will receive your email address.

NLP News
Hi all,
This newsletter is a bit delayed. I had to skip the last one as I had to take a break after a busy period in the end of May (EMNLP and NeurIPS deadlines). At the same time, there’s so much happening that I’ve found it hard to catch up. Now I’m back, feeling more energized, and updating myself (and you) on what’s new.
I’ll discuss the biggest advances over the last months including GitHub Copilot, the Perceiver, and non-self-attention models.
I’ll also talk about something that is challenging for me when writing this newsletter: striking the right balance between content that is both timely but also relevant in the long-term. TL;DR: I’m planning to keep newsletters somewhat shorter in the future to have more time for in-depth blog posts.
I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.
Click here to view this newsletter in your browser.
If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐥.

May–July round-up
OpenAI Codex / GitHub Copilot
If you are working with software, then you’ve probably heard about the release of GitHub Copilot, a coding assistant based on Codex, a GPT language model fine-tuned on code on GitHub (see the paper). As far as I’m aware, this represents one of the first products of a large company where a user directly interacts with a large generative language model. Large language models are also used in many other applications such as Google Search, but such applications typically include a wide array of other signals.
There are a couple of interesting take-aways from the paper: One is that the model was not trained from scratch but an existing GPT model (up to 12B parameters, so not the largest GPT-3 model; the deployed model may be larger, however) was adaptively fine-tuned on code from GitHub. In addition, they further fine-tune two variants of the model (Codex-S and Codex-G) to generate stand-alone functions and docstrings respectively.
Given that language models are prone to reproduce inputs (Raffel et al., 2020) people have already found memorization issues with Copilot such as copy-pasting a person’s contact info. One issue is that the Codex model was trained on all code on GitHub including code with potentially problematic licenses. GitHub claims that code produced by an AI model is “fair use"—it’s controversial whether this is actually the case, given that the model may reproduce passages verbatim.
Another question is whether Copilot will be able to make a meaningful difference in the workflow of programmers. In a previous newsletter, I discussed current ML on code work. In particular, I highlighted a study, which found that current models did not improve productivity or code quality when used for in-IDE code generation. A practical limitation of Copilot is that it only considers the code in the current file (rather than in the entire codebase) and can thus only generate relatively self-contained code. So it remains to be seen whether it will provide meaningful benefits beyond the capabilities of existing models.
GitHub markets Copilot as an "AI pair programmer”. Pair programming is essentially a form of dialogue grounded in code. Similar to interacting with a dialogue agent, one requirement for successfully completing a task is a shared foundation of meaning. A task-oriented dialogue agent needs to know about the relevant entities and intents that are necessary for, say, booking a restaurant. In the same vein, an effective pair programming assistant should also have knowledge of the underlying codebase and its functions and variables.
Similar to how conversational question answering has been a focus in the community of late (see this recent paper for an overview), a conversational pair programming task would be a great way to measure progress regarding not just whether a model can produce a given function but whether it can effectively collaborate with a human. Given the promise of such models for augmenting the programming workflow, expect to see much more work in this space.
The Perceiver
The Perceiver uses cross-attention to project a large input byte array to a small latent array, which is processed with a regular Transformer stack. Cross-attention and Transformer stacks are interleaved throughout the model, with optional parameter sharing.
The Perceiver uses cross-attention to project a large input byte array to a small latent array, which is processed with a regular Transformer stack. Cross-attention and Transformer stacks are interleaved throughout the model, with optional parameter sharing.
The Perceiver (Jaegle et al., ICML 2021) is one of the recent models that I’m most excited about. The main motivation of the work is to enable a Transformer-like architecture to scale to very high-dimensional inputs (Vision Transformers are typically applied to image patches to overcome the computational complexity of self-attention). There have been a lot of recent more efficient Transformer architectures (see this paper for an overview) but these still depend on the length of the input, typically linearly.
In contrast, the Perceiver uses a latent array of a fixed dimensionality as its base representation (see above). This representation is then conditioned via cross-attention (as in a standard encoder-decoder model) on the much larger input array and then processed with a Transformer stack, in alternating fashion. If parameters are shared across Transformer blocks and cross-attention layers, the Perceiver can essentially be seen as an RNN with a Transformer at its core. It is also similar in spirit to the Universal Transformer (Dehghani et al., ICLR 2019), a model that applies the same Transformer block to an input multiple times.
The authors apply the Perceiver to three datasets across different modalities (ImageNet, video and audio, and 3D point clouds) and report performance competitive with the state of the art on all of them. You can also check out Yannic Kilcher’s video for a more visual introduction and contextualisation of the Perceiver.
Beyond the Transformer
Another recent trend has been the emergence of models that seek to replace the ubiquitous self-attention layer, most notably using multilayer perceptrons (MLPs). The MLP-Mixer (Tolstikhin et al., 2021) applies MLPs independently to image patches as well as across patches and achieves competitive results on image classification tasks. Liu et al. (2021) propose gMLP, a gated MLP architecture that achieves performance similar to Transformers on NLP and vision tasks.
A non-MLP based recent model is FNet (Lee-Thorp, 2021), which uses 1D Fourier Transforms instead of self-attention to mix information at the token level. While the model is less expressive than self-attention based models such as BERT, it is much faster and still achieves competitive results in many settings.
Another thread of work in this area revisits the dominance of self-attention by applying the same treatment to convolutions (Tay et al., ACL 2021): It turns out that if CNNs are pre-trained the same way as Transformer models, they achieve competitive performance on many NLP tasks. They mainly underperform on tasks that require modelling relations across sentences (such as paraphrasing, NLI, or QA), tasks that are notably over-represented on standard benchmarks such as GLUE.
On a similar note, a recent paper (Dehghani et al., 2021) by some of the same authors argues that the tasks we focus on as part of a benchmark induce a bias in terms of the models that will succeed. If standard benchmarks such as GLUE were constructed differently, would we still have ended up with self-attention-based models dominating or would CNN-based models be much more common?
In sum, an MLP may unfortunately not be all you need. However, while the hegemony of self-attention may still endure, recent challengers based on MLPs, convolutions, and various other transformations encourage us to rethink the fundamental building blocks of our models.
Data augmentation, NL augmenter 🦎 → 🐍
Data augmentation is a common tool used in computer vision but much less common in NLP. NLP is more challenging for augmentation due to the discrete nature of language, which also makes it harder to preserve meaning across transformations. This recent survey (Feng et al., ACL Findings 2021) gives an overview of recent approaches in this area. In particular, one thing that is missing for current data augmentation approaches in NLP is a unified benchmark and framework where many different approaches can be tried and compared to each other.
Towards this goal, NL-Augmenter is a collaborative effort that aims to collect a wide range of transformations, perturbations, and filters that generate additional data either for training or to test model robustness.
It is motivated by recent efforts such as the Beyond the Imitation Game Benchmark (BIG-bench), a collaborative project that crowd-sourced tasks to probe large language models. BIG-bench project has attracted a large amount of interest with people proposing 100s of tasks.
NL-Augmenter invites submissions via GitHub pull requests. Submitted transformations may augment data in diverse ways such as introducing spelling errors, translating to a different language, randomizing names and numbers, paraphrasing, etc. Some of my favourite submissions introduce transformations that randomly swap words that sound similar to each other, replace names with more gender and culturally diverse ones, or translate random words to another language.
The submission deadline is September 1, 2021. If you are interested in data augmentation for NLP, this is a great chance to contribute to a large community effort.
The research communication continuum 🗣
I think a lot about how to communicate research effectively. One thing that is on my mind lately is what I call—for lack of a better word—the ‘research communication continuum’, essentially to what extent and in what form to discuss a given research topic. I have depicted different levels of content and the most common formats below in order of increasing complexity, but of course different formats are possible for each.
The research communication continuum
The research communication continuum
There are a lot of great sources to get highlights or digests of interesting new articles such as SkyNet Today’s Last Week in AI, Yannic Kilcher’s ML News, The Batch, Import AI, and many more.
To complement these resources, I have tried to focus more on in-depth discussions with this newsletter. However, such content is more time-consuming to write and takes time away from writing more comprehensive blog posts.
Going forward, I’m planning to focus more on slightly lighter, opinionated takes on current research themes rather than super in-depth discussions in this newsletter. That should give me more time go deep in blog posts and to explore exciting research topics with you more regularly. Stay tuned!
Rethinking ML Papers 📝
Talking about research communication, the Rethinking ML Papers workshop at ICLR 2021 explored just this topic and featured many luminaries of the ML communication space (if you registered at ICLR, you can view the content of the workshop here). My highlights were:
Fun papers
And Now For Something Completely Different…
If I fits, I sits.
If I fits, I sits.
If I fits I sits: A citizen science investigation into illusory contour susceptibility in domestic cats (Felis silvestris catus) (Applied Animal Behaviour Science, 2021) From a different subject area, this article is a large-scale study that capitalizes on two important trends: 1) citizen science, which emphasizes public participation and collaboration in research, and 2) the Internet’s love of cats. It turns out that cats not only prefer to sit in physical box-like spaces but also tend to do so if enclosures are illusory, such as the Kanizsa square visual illusion.
Lecturers going the extra mile for their students by dressing up in NLP-themed costumes (here: ELMo)
Lecturers going the extra mile for their students by dressing up in NLP-themed costumes (here: ELMo)
Teaching a Massive Open Online Course on Natural Language Processing (Teaching NLP Workshop 2021) Teaching a new course, particularly during the COVID pandemic can be incredibly challenging. This is a nice example of a massive open online NLP course that lecturers from Moscow taught. In the paper, they share their 12-week syllabus, consisting of lectures covering both fundamentals as well as recent work, real-time coding sessions, and interviews with experts. One thing that surely contributed to the course’s success are the thematically fitting wardrobe choices of the lecturers who dressed in Sesame street kigurumis (see above).
Did you enjoy this issue? Yes No
Sebastian Ruder
Sebastian Ruder @seb_ruder

Regular analyses of advances in natural language processing and machine learning.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Created with Revue by Twitter.