EACL, ICLR, NAACL papers round-up, Research reality checks, ML on code

NLP News
Hi there,
I’ve spent the last days catching up on papers—before inevitably falling behind again once ACL papers are released. With EACL just behind us, ICLR happening right now, and NAACL still some time away, here’s a round-up of some of my favourite papers from each conference so far.
This newsletter also contains some reflections on research progress (including discussions of improvements to optimizers and Transformers) and an overview of applying machine learning to source code.
I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.
If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.

Inspired by an xkcd comic, compilations of “typical” papers in different disciplines such as the above have been making the rounds on social media over the last couple of days. My favourite ones cover machine learning, economy, and neuroscience.
Looking beyond the meme, the exercise of coming up with typical paper titles encourages us to consider what narratives and contributions we are perhaps tired of seeing in the current literature (note: not all of the above titles fall into this category). For instance, if a title describes a direct application of BERT or T5 to a new task, are you still likely to read the abstract?
Conversely, if you are on the verge of submitting a paper that falls into such a category, it is worthwhile to think about how you can subvert a reader’s (or reviewer’s) expectation. Does your new method shed new light on what these models learn or where they fail? Did you discover an interesting interaction in the model’s learning dynamics?
Finally, if are just starting a project, consider directing it into a direction that allows you to steer clear of such well-trodden paths and onto terrain that may yield more surprising and interesting narratives.
EACL, ICLR, NAACL papers round-up 📝
Language Modelling as a Multi-Task Problem Most settings in which multi-task learning (MTL) is commonly studied can seem artificial: tasks often share little information and are sometimes even entirely independent. This paper studies multi-task learning in a more natural setting by viewing language modelling through the MTL lens as different linguistic tasks. They focus on a subset of such tasks, negative polarity items (NPIs). These are words that occur only in negative contexts such as any, either, etc. Another interesting aspect about the paper is that they use the area between learning curves to measure the impact of how much the information from other tasks (here: NPIs) helps during training.
Keep Learning: Self-supervised Meta-learning for Learning from Inference Dynamic evaluation (Krause et al., 2017) has been used to achieve state-of-the-art results on some language modelling benchmarks by updating models based on the context of test examples. For standard classification tasks, this method is not directly applicable. This paper proposes to fine-tune a model on the most confident predictions at test time, together with a combination of class balanced filtering, meta-learning, and regularization. They also study the online setting where test examples arrive one after another. The method shows improvements over strong baselines such as BERT in all settings. Overall, even though our current models are powerful, they can still be improved by adapting to the distribution at test time.
Maximal Multiverse Learning for Promoting Cross-Task Generalization of Fine-Tuned Language Models During multi-task learning, we train multiple classifier heads on top of one model to perform different tasks. Are multiple classifier heads also useful when doing a single task? What about if you have 100s of them? To keep this setting computationally efficient during inference, this paper proposes to start training around 1000 classifier heads and to prune them during training so that only the best-performing classifiers are retained. The key is to enforce the classifier heads to be orthogonal. What was surprising to me is that orthogonality does not lead to diverse heads (heads make mostly the same predictions) but encourages robust hidden representations that can be used by many classifiers. This is in contrast to how orthogonality is typically used in domain adaptation, to encourage representations between different domains to be dissimilar (Bousmalis et al., 2016), or in semi-supervised learning, where we used it to learn diverse representations with tri-training (Ruder and Plank, 2017).
AdapterFusion: Non-destructive task composition for transfer learning Adapters (Houslby et al., 2019) are an effective way to learn parameter-efficient task-specific representations but using the information of multiple existing adapters is not straightforward. This paper proposes a contextual selection of relevant adapter layers via self-attention. The method improves performance particularly on tasks with few data. In line with prior work on intermediate task transfer learning (Pruksachatkun et al., 2020), adapters from high-resource tasks such as MNLI and QQP are often selected. The method has also recently been used to combine representations from 140 domain adapters (Rückle et al., 2020). Leveraging the information from so many different domains would be prohibitive using large pre-trained models. Adapters and AdapterFusion thus give rise to settings where the information from many separately trained experts can be combined in an efficient way.
SCoRe: Pre-Training for Context Representation in Conversational Semantic Parsing This is a great example of a paper that injects inductive biases into a pre-trained model that are relevant for a particular downstream task via behavioural fine-tuning on dedicated data and objectives. Specifically, the model is made more suitable for conversational semantic parsing by fine-tuning on synthetic data to a) predict the corresponding database operation for each column and table name (to encourage alignment between natural language and database schema) and b) predict how operations change between dialogue turns. The combination of synthetic data or natural data + task-specific objectives is very powerful and something we will likely see more of.
PMI-Masking: Principled masking of correlated spans Masked language modelling is the standard objective for pre-training large language models, which randomly masks subwords. First introduced in a GitHub commit in the BERT repo, masking whole words instead of subwords was found to lead to better performance. Although a study of different unsupervised objectives in the T5 paper (Section 3.3) did not highlight any clear differences between the pre-training objectives, masking contiguous spans has become more common (Joshi et al., 2020). This paper puts such span masking approaches on a more principled foundation and proposes a masking strategy based on pointwise mutual information (PMI), which jointly masks a token n-gram if its subwords have a high co-occurrence probability in the corpus relative to the tokens’ individual occurrence probabilities. They also show that with smaller vocabulary sizes, the performance of a model with subword masking deteriorates much more quickly than that of a model with whole word masking. The key insight is that random subword masking is wasteful: it overtrains on easy subword tasks (for example predicting the subword “igen” in “e-igen-val-ue”) and undertrains on harder whole-word tasks (predicting “eigenvalue” given the surrounding context).
Multi-timescale Representation Learning in LSTM Language Models This is a nicely motivated paper that starts from an empirical observation: temporal dependencies in natural language tend to decay following a power law (Lin and Tegmark et al., 2016). Based on this, it derives how LSTM language models could model this power law decay and then shows that LSTMs trained on English in fact approximate the relevant distribution. A nice aspect is that this gives rise to a theoretically motivated model enhancement: enforcing the distribution explicitly (by setting the forget gate biases to some constant) leads to improvements in perplexity.
I am also a co-author on two ICLR papers. In Rethinking Embedding Coupling in Pre-trained Language Models, we identify how decoupling embeddings during pre-training can lead to more parameter-efficient models for fine-tuning and inference. We also propose RemBERT, a rebalanced mBERT that outperforms XLM-R and mT5-Large. Our key insight is that allocating additional capacity during pre-training that specializes to the pre-training task makes the rest of the model more transferable. Code and checkpoints for the model should soon be online. In Long Range Arena: A Benchmark for Efficient Transformers, we evaluate a wide range of recent efficient Transformer models on a new benchmark suite that require dealing with long contexts. The models that strike the best balance between speed and performance are BigBird (Zaheer et al., 2020) and Performer (Choromanski et al., 2021).
For more ICLR papers, have a look at my discussion of the ICLR outstanding papers in the last newsletter.
Hurdles to Progress in Long-form Question Answering This is a great paper that goes the extra mile. They propose a sparse, retrieval-augmented Transformer that obtains state-of-the-art results on the ELI5 long form QA dataset (Fan et al., 2019). They could have left it at that; after all, a state-of-the-art system is pretty convincing already. However, when analysing the answers the model generated, conditioned on the retrieved documents, they find that it actually does not use the documents that it retrieves. Specifically, they find that replacing retrieved documents with randomly sampled ones has almost no effect on the quality of the generated answers. Overall, they attribute this behaviour in part to a training / test overlap. They also highlight that ROUGE-L is not a good measure for evaluation long form answers and highlight that even human evaluation is challenging in this setting.
Representing Numbers in NLP: a Survey and a Vision Numbers are pervasive in language but most existing work treats them as any other token. As a result, models are largely unable to reason with numbers in a robust way. This paper provides an overview of different methods that have been used to encode and decode numbers in NLP. What I particularly liked is their taxonomy, which categorizes downstream tasks based on whether they deal with exact (birds have two legs) or approximate (Jon is about 180 cm tall) quantities and whether numbers are abstract (2 + 3 = 5) or grounded (2 apples + 3 apples = 5 apples). Thinking of downstream tasks in this more fine-grained way reveals more clearly what aspects of numerical reasoning models can do reasonably well and where they fail.
How Many Data Points is a Prompt Worth? I like this paper because it has clear practical value for NLP practitioners. Prompts are a tool to incorporate useful inductive bias via domain expertise into the data by leveraging the power of pre-trained language models. You can see an example of a QA dataset with a prompt consisting of a pattern (in bold) and a question (in italics) below:
“Posthumous marriage – Posthumous marriage (or necrogamy) is a marriage in which one of the participating members is deceased. It is legal in France and similar forms are practiced in Sudan and China. Since World War I, France has had hundreds of requests each year, of which many have been accepted. Based on the previous passage, can u marry a dead person in france ? <MASK>
The prediction of the model for the <MASK> token is then mapped to a class using a verbalizer (here “Yes”: True. “No”: False). While prompts can be used in zero-shot settings as in the GPT-3 paper, in most practical settings we would like to fine-tune the model using the prompt. This paper does a side-by-side comparison of standard fine-tuning of a masked language model with and without using a prompt. They find that prompts are generally very beneficial: they can be worth between 3500 data points (MNLI) and 280 data points (RTE). Overall, they are most useful in low and medium-data scenarios. So if you are prototyping a new NLP application using a pre-trained language model, it is worth drafting a number of prompts before labelling 100s of examples.
Research reality checks ꩜
“Sometimes innovation is only old ideas reappearing in new guises […]. But the new costumes are better made, of better materials, as well as more becoming: so the research is not so much going round in circles as ascending a spiral.”Karen Spark-Jones (2001)
Karen Spark-Jones’ quote is as true today as it was 20 years ago. Our costumes du jour might consist of more layers and of more robust materials, but they dress ideas that are much older. The number of new ideas is limited. We all know that most deep learning models are secretly three or more logistic regressions in a trench coat. Even this joke was made before.
Karen’s image of an ever-ascending spiral is nevertheless reassuring as it implies that we make consistent progress. Despite winters and detours, research in ML and NLP has made tremendous advances in recent years. In her presidential address at ACL 2018, Marti Hearst suggested that citation analysis indicates that progress in research looks more like the below than a spiral.
Research progress as intertwined staircases (Credit: Marti Hearst)
Research progress as intertwined staircases (Credit: Marti Hearst)
Whether we view research developing in a spiral or in many intertwined staircases, it is clear that research climbs in many directions at the same time. Let’s now look more closely at two of these directions, improvements to standard optimizers such as Nesterov momentum and Adam and improvements to the Transformer and whether in each case we have actually been making progress.
Optimizer reality check
Gradient descent optimization algorithms are the workhorses of deep learning optimization. Among the most popular variants are Nesterov momentum and Adam. However, some things have changed about how current models are optimized since these methods first emerged. For once, it is now possible to train models with much larger batch sizes (up to 65,536).
Such large-batch settings are mostly relevant when pushing the limits of how fast one can train large models to a certain performance target (as measured on the MLPerf benchmark). While this setting is currently mostly of interest to a few places with lots of compute, the necessary compute required to train cutting-edge models generally falls rapidly. Recently two new algorithms, LARS and LAMB (You et al., 2017; 2019) were proposed to train models effectively with such large batch sizes. They apply layer-wise normalization before each gradient update to gradient descent with momentum and Adam using ResNet-50 and BERT models respectively.
In a recent study, Nado et al. (2021) find that both Nesterov momentum and Adam can match or exceed the results of these recent optimizers at large batch sizes. They also establish a new state of the art for fastest speed when pre-training BERT. The key was to apply the same optimization tricks used in the LARS pipeline to the classic algorithms and to properly tune them. Specifically, the learning rate schedule is really important with a small step budget. For BERT pre-training, fixing an issue in the BERT open-source pre-training code led to increased stability.
In sum, standard optimizers are strong baselines even in large-scale setting and so far there is no evidence that more recent optimizers scale better.
Transformer reality check
The Transformer is the most successful recent architecture in NLP, arguably even in the entire field of ML since ResNet (He et al., 2016). Given its ubiquity, it is unsurprising that many possible extensions of it have been proposed, modifying its activations, embeddings, softmax, attention, among many others.
Unfortunately, it is common for baselines to be under-tuned (see for example Melis et al. (2019) for a reflection on the evaluation of LSTM LMs) and for most of the computation budget to be spent on tuning the new method. This often leads to an over-estimation of the benefit of the proposed method. For instance, Dodge et al. (2019) found that many recent model comparisons would have reached a different conclusion if authors had used more (or less) computation for each method.
With regard to the Transformer, Narang et al. (2021) recently evaluated a wide range of Transformer modifications in the same codebase. They found that most modifications do not meaningfully improve the model’s performance. The few modifications that led to significant improvements were generally relatively minor changes, for example using GeGLU (Hendrycks and Gimpel, 2016) instead of the ReLU activation function and replacing layer norm with RMS normalization (Zhang and Sennrich, 2019). Most of the architecture modifications that led to improvements such as the Switch Transformer (Fedus et al., 2021), mixture-of-experts (Lepikhin et al., 2020), and product key memories (Lample et al., 2019) require significantly more parameters. Interestingly, embedding decoupling, which we proposed in the ICLR paper mentioned in the previous section, generally improved performance with only a minor increase in parameters.
A corollary of the above results is that the original Transformer may be close to a local optimum in terms of the underlying model architecture. Hill climbing on the Transformer architecture may thus provide limited overall benefits; instead, we may have to restart and consider fundamentally different architectures.
ML on code 👩‍💻
What is it? ML on code is the broad area of applying ML techniques to source code. The main objective is to automate parts of the software engineering workflow. Common tasks are generating documentation, comments, or git commit messages, summarizing a code snippet, and recommending API usage examples (see Elnaggar et al., 2021 for some example datasets), automatically detecting vulnerabilities in source code (Russell et al., 2018), and synthesizing entire programs from natural language descriptions (Polosukhin and Skidanov, 2018).
What have people been doing? Source code can be treated as a structured language. Given that a large amount of unlabelled source code is publicly available on websites such as GitHub, researchers have applied methods from NLP to learn useful representations. For instance, there are variants of BERT (Kanade et al., 2020) and of T5 (Elnaggar et al., 2021) that have been pre-trained on source code.
What’s new? Recently, a number of new methods for pre-training general-purpose representations of source code have been proposed. The most interesting aspect of these methods is that they leverage information about the structure and idiosyncrasies of source code. Zügner et al. (2021) give the model access to distances in the abstract syntax tree of the source code. Jain et al. (2020) and Roziere et al. (2021) propose new pre-training objectives that rely on source code transformations. The former generate syntactically diverse programs with the same functionality and encourage similar representations between them. The latter train a model to revert obfuscated source code (where names of functions and variables have been replaced by uninformative names). This is a much harder pre-training task compared to standard masked language modelling.
Are these models good enough to be useful to actual developers? Xu et al. (2021) recently conducted a study on in-IDE code generation using a plugin developed for that purpose. While developers generally had a positive experience interacting with the system, they did not find any conclusive evidence that pointed to increased productivity, better code quality, or improved program correctness. So while these models are getting more powerful, there is still some way to go in terms of improving their effectiveness as developer assistants.
Did you enjoy this issue? Yes No
Sebastian Ruder
Sebastian Ruder @seb_ruder

Regular analyses of advances in natural language processing and machine learning.

In order to unsubscribe, click here.
Created with Revue by Twitter.