Highlights, new tasks & graph ML in 2021; Safer pre-trained models; Embeddings: Larger ≠ better
I hope you've had a good start to the new year. This newsletter covers my and others' highlights of 2021. I also discuss recent pre-trained models that put more emphasis on safety and recent text similarity models where large is not always better.
I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.
Click here to view the newsletter in your browser.
Looking back at 2021 👀
ML and NLP Research Highlights of 2021 💡
I wrote up some of my research highlights in 2021 in this post. Overall, most of the trends I observed revolved around pre-trained models and their capabilities—how to train them more effectively, how to do few-shot learning with them, how to use them efficiently, how to evaluate them, using them for new applications such as program synthesis, etc. What were your highlights? You can share them by replying to the tweet below and I'll summarize them in the next newsletter.
New ML Tasks in 2021 💽
Another area I'm quite excited about is when ML is used to do new things. While such applications can be practically useful such as using AlphaFold 2.0 to accelerate the drug discovery process, I particularly enjoyed unconventional tasks or tasks provide a new perspective on existing research areas. Here are my favourites from 2021:
BIG-bench contains a smorgasbord of diverse, sometimes quirky tasks for probing language models (LMs). Predicting checkmate? ✅ Guessing movies based on emojis? ✅ Reasoning in a fantasy world? ✅ The latter task includes examples like:
As an amputee you experience phantom arm syndrome. Then one day you realize you can use it to punch ghosts. Your left arm is amputated but you still have your right arm. Do you use your left arm to hit the late Elvis Presley to make him stop bothering you? Answers: Yes / No
I don't know about you but I would prefer my ML model to resolve their problems without resorting to punching ghosts 👻.
Cryptic crossword puzzles Solving cryptic crossword puzzles is a task that has attracted recent interest in the form of two datasets and an associated BIG-bench task. Crossword AIs have recently surpassed humans in a tournament but cryptic clues are still very challenging for current models as these require both an understanding of semantics as well as wordplay (see below for an example from Cryptonite) that requires correctly identifying and resolving an anagram.
Reconstructing ancient texts The task of masked language modeling, filling in missing tokens in a text, lends itself directly to predicting missing tokens in the transliterated texts of ancient Akkadian clay tablets (Lazar et al., 2021). Such a setting is arguably more interesting than language modeling on the Penn Treebank—and trained models are practically useful by assisting experts in transcribing texts in extinct languages.
Decontextualization is a new NLP task that requires rewriting an in-context sentence to be interpretable out of context (Choi et al., 2021). This means dealing with various phenomena such as resolving coreferences and anaphora, adding relevant modifiers or necessary background information. Decontextualization is useful, for instance, in the context of question answering: instead of providing a sentence answer, which may be difficult to understand without the surrounding context, models can produce a sentence that stands on its own. See below for an example of how decontextualization looks like in practice.
Text-based NP enrichment is a new information extraction task that focuses on extracting all relations (that are mediated by prepositions) between noun phrases in a text. NP enrichment unifies and complements many existing entity-related tasks such as relation extraction, semantic role labeling, entity linking, coreference resolution, etc. You can see how the annotation for this task looks like in the example below (Elazar et al., 2021).
I am particularly excited by the newer tasks that explicitly go beyond core NLP tasks such as coreference resolution, which can be relatively narrow. Coreference resolution, for instance, is still useful to probe a model's reasoning abilities, e.g., as part of the Winograd schema and later instantiations such as Winogender for gender bias and WinoGrande for commonsense reasoning. However, as models become more powerful, we can apply them to a broader, more general set of problems, which may also be more practically useful.
I am constantly amazed by the emerging capabilities of models in ML and NLP and the new settings where they are applied and excited for the new things that we will be able to do this year.
Geometric & Graph ML in 2021 📐
Graph machine learning is one of the hottest emerging areas in ML. Graph ML methods are useful in a variety of domains, from modelling network data to molecules, interactions in physics, relations between entities, mathematical graphs, etc. Michael Bronstein and Peter Veličković interviewed experts in the area on their impressions of 2021 and predictions for 2022. The article is a great read for anyone who wants to get up to speed in this area. For a quick overview, you can check out their take-home messages highlighting, among others, the importance of message-passing—networks that update the hidden states of nodes based on information from adjacent nodes—, challenges of reasoning and generalisation, the combination of Transformers with graph neural networks, etc.
Papers with Code 2021 👩💻
Papers with Code, one of the best resources for finding results, papers, and code in ML—which is also recently integrated into the ACL Anthology (see the code and data section at the bottom of a paper, such as this one)—highlights the top trending papers, libraries, and datasets of 2021. The most talked about paper proposes a method to synthesize new views for an image from arbitrary camera angles, which captured people's attention with the below demo featuring impressive synthesized camera shots.
My 2021 👨💻
My main threads of research in the past year were parameter efficiency (how can we make pre-trained models more efficient?), cross-lingual generalisation (how can multilingual models generalise better to under-represented languages?), and multilingual evaluation (see my Google Scholar for the detailed publications). Some of the most fulfilling work was collaborating with passionate researchers such as from the Masakhane community on building datasets in their own languages. I'm looking forward to doing more of this in 2022.
Like many, I had ups and downs. I've had less energy ⚡️ to do things outside of work, so have been less active on Twitter and only written blog posts and newsletters infrequently. I also regretted not being able to meet people from our community in person.
Overall, I'm hopeful for the new year and that things will slowly start going back to normal. I'm excited to write more in my spare time again and I'm looking forward to seeing many of you in person, at conferences or similar events.
Safer Pre-trained Models 😷
Prior work has found that pre-trained models are biased and can generate discriminatory or even toxic language. Ensuring safe responses is thus an important aspect of the development of such models. Recent models such as LaMDA, InstructGPT, and Gopher developed by Google, OpenAI, and DeepMind respectively emphasize safety in their model evaluation and training. A common recipe is to fine-tune pre-trained models on data labeled with safety ratings by human annotators—using a reward model + RL or by training a detector and filtering out unsafe responses.
For LaMDA, crowdworkers annotate model responses based on different safety criteria. The model is then fine-tuned both to generate dialogue responses as well as to predict the annotated safety labels. This multi-task setting is not only more efficient but also enables sharing information between the tasks. At test time, candidate responses where the model predicts a low safety rating are filtered out. The authors find that this fine-tuning setting significantly improves the safety of generated responses.
For InstructGPT, GPT-3 is first fine-tuned on the demonstrations of annotators following instructions in a supervised setting. In a second step, raters rank multiple outputs of the fine-tuned model, which is used to train a reward model. Finally, the model is fine-tuned based on the output of the reward model using reinforcement learning. In an evaluation, the outputs of InstructGPT are significantly preferred over GPT-3's outputs while InstructGPT replaces GPT-3 in the API.
For Gopher, the authors perform an extensive analysis of the toxicity and bias of the model. They find that larger models increase the toxicity of toxic input but do not amplify training data toxicity when unprompted. They also observe that large models are prone to bias against subgroups in a few-shot setting and that larger models are not able to overcome limitations in the coverage of dialects.
Overall, prior work as well as these recent efforts demonstrate that we cannot just pre-train models and expect them to produce safe or harmless responses. Instead, safety and inclusion need to be key design criteria that are included as part of the development of such models. This requires clearly enumerating and defining potential safety risks, collecting and annotating relevant data as well as explicitly training models to demonstrate safe behaviour. For recent reviews that highlight potential risks associated with language models, have a look here. I hope to see safety being considered as a design criterion and evaluation dimension in more work going forward.
Embeddings: Larger ≠ better 🏋️♀️
Nils Reimers analyzes embeddings from OpenAI's recently released embeddings endpoint. OpenAI provides embeddings in different sizes, from 1,024–12,288 dimensions. He evaluates them on three downstream tasks—text similarity, text search, and code search.
He finds that the text similarity models perform much worse than state-of-the-art models such as all-mpnet-base-v2 and all-roberta-large-v1—MPNet and RoBERTa models respectively fine-tuned on 1B sentence pairs. They are also 6 points weaker than extremely small models with just 22M parameters that can run in a browser such as all-MiniLM-L6-v2. On text search, they perform competitively but not quite at the level of the state of the art.
At the same time, due to their high dimensionality, the OpenAI embeddings are much slower than existing embedding models that have up to 768 dimensions and take up much more memory. He highlights that encoding the 21M passages of English Wikipedia in 384-dimensional embeddings requires about 16 GB while using 12,288 dimensions requires 516 GB of memory. Not only does retrieval using high-dimensional embeddings consume much more memory but it is also much slower than using smaller models.
Retrieval is also important for recent retrieval-augmented models such as Retro, which retrieve from corpora of up to 2T tokens using frozen BERT representations (1,028 dimensions for BERT-large). Encoding such corpora with 12,288 dimensions would be prohibitive. Text similarity and retrieval-style tasks are one of the few settings these days where more parameters does not give you more bang for your buck; instead, for most realistic applications, low-dimensional performant embeddings are the way to go. Check out Nils' library sentence-transformers as well as the above models for efficient, powerful sentence representations.