Language Modelling as a Multi-Task Problem
Most settings in which multi-task learning (MTL) is commonly studied can seem artificial: tasks often share little information and are sometimes even entirely independent. This paper studies multi-task learning in a more natural setting by viewing language modelling through the MTL lens as different linguistic tasks
. They focus on a subset of such tasks, negative polarity items
(NPIs). These are words that occur only in negative contexts such as any
, etc. Another interesting aspect about the paper is that they use the area between learning curves to measure the impact of how much the information from other tasks (here: NPIs) helps during training.
Keep Learning: Self-supervised Meta-learning for Learning from Inference
Dynamic evaluation (Krause et al., 2017)
has been used to achieve state-of-the-art results on some language modelling benchmarks by updating models based on the context of test examples. For standard classification tasks, this method is not directly applicable. This paper proposes to fine-tune a model on the most confident predictions at test time, together with a combination of class balanced filtering, meta-learning, and regularization. They also study the online setting where test examples arrive one after another. The method shows improvements over strong baselines such as BERT in all settings. Overall, even though our current models are powerful, they can still be improved by adapting to the distribution at test time
Maximal Multiverse Learning for Promoting Cross-Task Generalization of Fine-Tuned Language Models
During multi-task learning, we train multiple classifier heads on top of one model to perform different tasks. Are multiple classifier heads also useful when doing a single task? What about if you have 100s of them? To keep this setting computationally efficient during inference, this paper proposes to start training around 1000 classifier heads and to prune them during training so that only the best-performing classifiers are retained. The key is to enforce the classifier heads to be orthogonal. What was surprising to me is that orthogonality does not lead to diverse heads (heads make mostly the same predictions) but encourages robust hidden representations that can be used by many classifiers
. This is in contrast to how orthogonality is typically used in domain adaptation, to encourage representations between different domains to be dissimilar (Bousmalis et al., 2016
), or in semi-supervised learning, where we used it to learn diverse representations with tri-training (Ruder and Plank, 2017
AdapterFusion: Non-destructive task composition for transfer learning
Adapters (Houslby et al., 2019
) are an effective way to learn parameter-efficient task-specific representations but using the information of multiple existing adapters is not straightforward. This paper proposes a contextual selection of relevant adapter layers via self-attention. The method improves performance particularly on tasks with few data. In line with prior work on intermediate task transfer learning (Pruksachatkun et al., 2020
), adapters from high-resource tasks such as MNLI and QQP are often selected. The method has also recently been used to combine representations from 140 domain adapters (Rückle et al., 2020
). Leveraging the information from so many different domains would be prohibitive using large pre-trained models. Adapters and AdapterFusion thus give rise to settings where the information from many separately trained experts can be combined in an efficient way
SCoRe: Pre-Training for Context Representation in Conversational Semantic Parsing
This is a great example of a paper that injects inductive biases into a pre-trained model that are relevant for a particular downstream task via behavioural fine-tuning
on dedicated data and objectives. Specifically, the model is made more suitable for conversational semantic parsing by fine-tuning on synthetic data to a) predict the corresponding database operation for each column and table name (to encourage alignment between natural language and database schema) and b) predict how operations change between dialogue turns. The combination of synthetic data or natural data + task-specific objectives is very powerful
and something we will likely see more of.
PMI-Masking: Principled masking of correlated spans
Masked language modelling is the standard objective for pre-training large language models, which randomly masks subwords. First introduced in a GitHub commit
in the BERT repo, masking whole words instead of subwords was found to lead to better performance. Although a study of different unsupervised objectives in the T5 paper
(Section 3.3) did not highlight any clear differences between the pre-training objectives, masking contiguous spans has become more common (Joshi et al., 2020
). This paper puts such span masking approaches on a more principled foundation and proposes a masking strategy based on pointwise mutual information
(PMI), which jointly masks a token n-gram if its subwords have a high co-occurrence probability in the corpus relative to the tokens’ individual occurrence probabilities. They also show that with smaller vocabulary sizes, the performance of a model with subword masking deteriorates much more quickly than that of a model with whole word masking. The key insight is that random subword masking is wasteful
: it overtrains on easy subword tasks (for example predicting the subword “igen” in “e-igen-val-ue”) and undertrains on harder whole-word tasks (predicting “eigenvalue” given the surrounding context).
Multi-timescale Representation Learning in LSTM Language Models
This is a nicely motivated paper that starts from an empirical observation: temporal dependencies in natural language tend to decay following a power law
(Lin and Tegmark et al., 2016
). Based on this, it derives how LSTM language models could model this power law decay and then shows that LSTMs trained on English in fact approximate the relevant distribution. A nice aspect is that this gives rise to a theoretically motivated model enhancement: enforcing the distribution explicitly (by setting the forget gate biases to some constant) leads to improvements in perplexity.
I am also a co-author on two ICLR papers. In Rethinking Embedding Coupling in Pre-trained Language Models
, we identify how decoupling embeddings during pre-training can lead to more parameter-efficient models for fine-tuning and inference. We also propose RemBERT, a rebalanced mBERT that outperforms XLM-R and mT5-Large. Our key insight is that allocating additional capacity during pre-training that specializes to the pre-training task makes the rest of the model more transferable
. Code and checkpoints for the model should soon be online. In Long Range Arena: A Benchmark for Efficient Transformers
, we evaluate a wide range of recent efficient Transformer models on a new benchmark suite that require dealing with long contexts. The models that strike the best balance between speed and performance are BigBird (Zaheer et al., 2020) and Performer (Choromanski et al., 2021)
For more ICLR papers, have a look at my discussion of the ICLR outstanding papers
in the last newsletter
Hurdles to Progress in Long-form Question Answering
This is a great paper that goes the extra mile. They propose a sparse, retrieval-augmented Transformer that obtains state-of-the-art results on the ELI5 long form QA dataset (Fan et al., 2019)
. They could have left it at that; after all, a state-of-the-art system is pretty convincing already. However, when analysing the answers the model generated, conditioned on the retrieved documents, they find that it actually does not use the documents that it retrieves
. Specifically, they find that replacing retrieved documents with randomly sampled ones has almost no effect on the quality of the generated answers. Overall, they attribute this behaviour in part to a training / test overlap. They also highlight that ROUGE-L is not a good measure for evaluation long form answers and highlight that even human evaluation is challenging in this setting.
Representing Numbers in NLP: a Survey and a Vision
Numbers are pervasive in language but most existing work treats them as any other token. As a result, models are largely unable to reason with numbers in a robust way. This paper provides an overview of different methods that have been used to encode and decode numbers in NLP. What I particularly liked is their taxonomy, which categorizes downstream tasks based on whether they deal with exact (birds have two legs) or approximate (Jon is about 180 cm tall) quantities and whether numbers are abstract (2 + 3 = 5) or grounded (2 apples + 3 apples = 5 apples)
. Thinking of downstream tasks in this more fine-grained way reveals more clearly what aspects of numerical reasoning models can do reasonably well and where they fail.
How Many Data Points is a Prompt Worth?
I like this paper because it has clear practical value for NLP practitioners. Prompts are a tool to incorporate useful inductive bias via domain expertise into the data by leveraging the power of pre-trained language models. You can see an example of a QA dataset with a prompt consisting of a pattern (in bold
) and a question (in italics
“Posthumous marriage – Posthumous marriage (or necrogamy) is a marriage in which one of the participating members is deceased. It is legal in France and similar forms are practiced in Sudan and China. Since World War I, France has had hundreds of requests each year, of which many have been accepted. Based on the previous passage, can u marry a dead person in france ? <MASK>”
The prediction of the model for the <MASK> token is then mapped to a class using a verbalizer (here “Yes”: True. “No”: False). While prompts can be used in zero-shot settings as in the GPT-3 paper, in most practical settings we would like to fine-tune the model using the prompt. This paper does a side-by-side comparison of standard fine-tuning of a masked language model with and without using a prompt. They find that prompts are generally very beneficial: they can be worth between 3500 data points (MNLI) and 280 data points (RTE). Overall, they are most useful in low and medium-data scenarios. So if you are prototyping a new NLP application using a pre-trained language model, it is worth drafting a number of prompts before labelling 100s of examples.