NLP News

The Evolving Landscape of LLM Evaluation

Sebastian Ruder — Mon, 13 May 2024 18:54:25 GMT

Edit, May 16: Added mention of Benchmarking Benchmark Leakage in Large Language Models (Xu et al. , 2024).

Throughout recent years, LLM capabilities have outpaced evaluation benchmarks. This is not a new development.1 The set of canonical LLM evals has further narrowed to a small set of benchmarks such as MMLU for general natural language understanding, GMS8k for mathematical reasoning, and HumanEval for code, among others. Recently, concerns regarding the reliability of even this small set of benchmarks have emerged.

"Datasets are the telescopes of our field."—Aravind Joshi

Without reliable benchmarks, we are effectively flying blind. With many public benchmarks no longer seen as hallmarks of objectivity, ML researchers and practitioners increasingly rely on their intuition to assess a model’s ‘vibe’, i.e., how interactions with it feel like.

Let’s explore how we got here and what the path forward may look like.

Looking Back on Benchmarking

Benchmarks have been integral to the development of ML and NLP (see my previous post for a brief history of ML benchmarking). Break-through architectures such as ResNets or Transformers first captured people’s attention through their impressive ImageNet and WMT results respectively. Benchmarks such as MNIST, CIFAR-100, and ImageNet have been around for more than a decade and are still used to date.

In recent years, in light of rapidly improving model capabilities, the time between a benchmark’s creation to its saturation—when model performance exceeds human performance—has dramatically reduced.

Benchmark saturation over time for popular benchmarks. Initial performance and human performance are normalised to -1 and 0 respectively (Kiela et al., 2021). Performance on MNIST and Switchboard saturated only after 20+ years while performance on GLUE and SQuAD 2.0 saturated already within 1–2 years.

In addition to recent benchmarks more quickly saturating, two additional problems are contributing to the current benchmark crisis: memorization and overfitting.

Memorization

Most popular benchmarks are either directly available on the web or may have been uploaded in different forms on GitHub or other platforms. Current models are trained on much of the Internet, with newer models trained on more recent snapshots of CommonCrawl. Unless filtering measures are taken to specifically remove benchmark data from pre-training, models are invariably exposed to test data through their pre-training. While they may not memorize every example, training on the data makes it more likely for the model to produce the correct prediction.

As a result, LLMs including Aquila2 and Qwen models repeat verbatim training and even test examples from MATH and GSM8k. GPT models perform much better on coding problems released before their pre-training data cut-off. LLMs also perform much better on datasets released before the pre-training data cut-off—and barely improve over a majority baseline (!) on uncontaminated classification tasks in zero and few-shot settings2.

What can we do to mitigate memorization? Ravaut et al. (2024) highlight some best practices to reduce contamination in their survey:

Encrypting evaluation datasets. The authors of the MTOB dataset (discussed in a previous newsletter) did this. The downside: if the dataset is shared unencrypted at any point, it is hard to contain it again.
Scanning newly released evaluation datasets. We can make sure to only evaluate on uncontaminated data.
Preventing data leakage to closed-source APIs. Evaluating on closed-source APIs inadvertently leaks data to them whether or not the data is made available online or not.

Overfitting

With increased attention afforded to LLMs and billions of dollars in funding at play, the pressure to do well on public benchmarks has increased. A couple percentage points on an established benchmark such as MMLU can make or break an investor presentation or convince potential customers to try a model.

As a result, there is the risk of overfitting to public benchmarks if models are optimized to do well on them. One way of overfitting is through the creation of synthetic data, which may inadvertently reflect use cases in the test data rather than a broader set of model applications. For example, a recent study found that several model families such as Phi and Mistral models show evidence of systematic overfitting on the GSM8k grade school math dataset.

Models arranged by their drop in performance between GSM8k and the newly created GSM1k (lower is worse). Mistral and Phi models show a drop of 10% on GSM1k compared to GSM8k (Zhang et al., 2024).

For benchmarks such as MT-Bench that employ a specific model (often GPT4) as evaluator, there is the additional risk of overfitting to biases of the evaluator. If we optimize our model to do well on MT-Bench, we may end up training on GPT4-created data. This can lead to higher scores on GPT4-rated evals but worse performance when tested by humans as models mimic GPT4’s style but not other aspects such as its factuality.

Training on such model-created synthetic data may thus lead to gains and improve evals in the short term; in the long term, it can lead to unanticipated biases and blind spots in the user experience.

Taking into account the possibility of both memorization and overfitting, the results on popular static evaluation benchmarks should be taken with a grain of salt.

To Vibe or Not to Vibe?

Instead of blindly trusting public benchmarks or result tables in press releases, it is thus more important than ever to run your own tests. LLMs are versatile and different people have different preferences and use cases. Utility is in the eye of the user. However, very few have the means to exhaustively evaluate and compare many different LLMs.

The currently preferred evaluation benchmark for such ‘vibe-based’ evals is Chatbot Arena, which crowd-sources user ratings in blind A/B test conversations with various LLMs. The platform enables large-scale user testing and aggregates win rates across 10,000s of conversations in a central leaderboard.

Chatbot Arena is not perfect. Humans can get fooled and may prefer a response due to a variety of factors including its formatting, style, type of humor, etc, which can be exploited and optimized for. Chatbot Arena covers a narrow range of domains and conversations wildly vary in quality. Nevertheless, it provides an uncontaminated evaluation of chat user interactions (I’ve written previously about it); you can also check out ’s thoughts on Chatbot Arena in his recent post.

The Future of Evaluation

The time when benchmarks lasted multiple decades has passed. Going forward, we will rely less on public benchmark results. Instead, the ability to evaluate a model directly for a specific downstream use case will be much more important.

This requires a) knowledge on how to efficiently and robustly evaluate LLMs; b) the infrastructure to enable this; and c) domain expertise to know how the problem can be modeled using an LLM. Institutions will be less likely to share such evals given that a release means they will become contaminated, reducing their usefulness.

Benchmark creators should mitigate the risk of contamination in the design of their test data as much as possible. Internet data should be used sparingly—and not as the source of the solution. New benchmark data should ideally be created by humans from scratch.

As we already evaluate LLMs how we assess humans—on standardized exams and general aptitude and proficiency tests—we should also follow a similar design process for future LLM evals: With regularly updated tests, assuming access to all prior exam data and accounting for a willingness to exploit loopholes.

Overall, we need to rethinking the way we evaluate LLMs and shift to efficient evaluation processes that can keep track with the pace of model advances.

The AI Index Report 2021 mentions this and I’ve previously written about challenges in NLP benchmarking.

The authors examined less powerful LLMs up to GPT3.5-Turbo (March 2023 release). It is likely that more powerful LLMs would exhibit stronger zero-shot and few-shot capabilities.

Command R+

Sebastian Ruder — Mon, 15 Apr 2024 17:23:52 GMT

This post is an update on what I’ve been up to since I joined Cohere. I’ve had fun contributing to the launch of Command R and R+, the latest Cohere models. I’ll discuss more details once the tech report is out; in the meantime I’ll share what I’m most excited about.

Command R+ is ranked as the top open-weights model on Chatbot Arena, even outperforming some versions of GPT-4. Why is this exciting?

Chatbot Arena leaderboard as of April 9, 2024 (source: lmsys.org).

Chatbot Arena

Let’s talk about why we should care about Chatbot Arena rankings in the first place. I’ve written in the past about challenges in NLP benchmarking. Pre-LLM benchmarks such as SuperGLUE mostly consist of classification tasks and no longer provide sufficient signal to differentiate the latest generation of LLMs. More recent benchmarks such as MT-Bench consist of small samples of open-ended questions and rely on LLMs as evaluators, which have their own sets of biases.1

MMLU, one of the most widely used benchmarks consisting of 14k multiple-choice questions sourced from public sources covering 57 domains has been featured prominently in GPT-4, Claude 3, and Mistral Large posts. The data is not without errors, however, and given its release in 2020, training data of recent models is likely at least partially contaminated.

Chatbot Arena is a platform where users rate conversations in a blind A/B test. They can continue the conversation until they choose a winner. Of course, short user interactions often do not reveal more advanced model capabilities and annotators can be fooled by authoritative but non-factual answers. Nevertheless, this is the closest to an assessment on realistic user interactions that we currently have. As models are always evaluated based on new user conversations, there is no risk of data contamination.

Command R+ outperforms versions of GPT-4 on Chatbot Arena while being much cheaper to use. It does also well on use cases that are under-represented in Chatbot Arena such as RAG, tool use, and multilinguality.2

(left) Performance comparison of Command R+, Mistral-Large, and GPT4-turbo on three key capabilities: Multilingual, RAG, and Tool Use. (right) Comparison input and output token costs per million for models available on Azure. Source

A GPT-4 Level Model on Your Computer

Command R+ consists of 104B parameters with publicly available weights. This is the first time that a model that is close to GPT-4 performance is available for research use. With the right setup, Command R+ can generate text at a rate of 111 tokens/s (!) when deployed locally.3 To understand how to effectively prompt the model, check out the prompting guide.

I’m excited about what this means for the open-source community and research, with the gap between closed-source and open-weight models closing and SOTA-level conversational models being more easily accessible.

The gap between closed-source and open-weights models on Chatbot Arena is closing (source: Maxime Labonne).

Other recently released models such as DBRX (132B parameters), Mixtral 8x22B (176B parameters), and Grok-1 (314B parameters) are based on a Mixture-of-Experts (MoE), trading off inference speed for memory costs. While these models only activate a subset of parameters for each token, they still require storing all parameters in-memory, which makes them harder to use locally. So far, they are not available or rank much lower than Command R+ on Chatbot Arena.4

Command R+ comes with a non-commercial license. If you want to self-host or fine-tune it for commercial purposes, we’ll work with you to find something that works for you.

RAG and Tool Use

While Command R+ can be used as a chatbot, it has been designed for enterprise use. Faithful and verifiable responses are especially important in an enterprise setting. Reducing hallucinations and providing trustworthy responses are important research challenges. There are different ways to mitigate hallucinations, ranging from debiasing and model editing to specialized decoding strategies (see Huang et al. (2023) for an overview).

Retrieval-augmented generation (RAG; Lewis et al., 2020), which conditions on the LLM’s generation on retrieved documents is the most practical paradigm IMO. Command R+ uses RAG with in-line citations to provide grounded responses.

However, evaluation of the quality and trustworthiness of such responses is challenging and motivated the development of new evaluation frameworks such as Attributable to Identified Sources (AIS; Rashkin et al., 2023).5 On our internal human evaluation measuring citation fidelity, Command R+ outperforms GPT4-turbo. On public multi-hop QA benchmarks, it outperforms models at the same price point such as Claude 3 Sonnet and Mistral-large.6

(left) Human head-to-head preference results using a holistic grading scheme combining text fluency, citation quality, and overall utility. (right) Accuracy of multi-hop REACT agents powered by various models with access to the same search tools retrieving from Wikipedia (HotpotQA) and the Internet (Bamboogle and StrategyQA). Source

You can easily use RAG via the API on the Internet or your own documents. A complete RAG workflow additionally involves document search and reranking, which can be seen in this Colab with an example RAG setup on Wikipedia.

Example of basic RAG usage with Command models using the Cohere API (source).

In enterprise settings, seamless integrations with existing APIs and services is crucial. I’ve written before about the promise of tool-augmented models. Tools can help decompose complex problems and make LLMs outputs more interpretable by enabling users to look at the trace of API calls. Command R+ has been trained for zero-shot multi-step tool use. On public tool use benchmarks, it outperforms GPT4-turbo.

Conversational tool-use and single-turn function-calling evaluations using Microsoft’s ToolTalk (Hard) benchmark (Farn & Shin 2023) and Berkeley's Function Calling Leaderboard (BFCL) (Yan et al. 2024). Source

The recommended way to leverage multi-step tool use with Command R+ is via LangChain. To teach the model to use a new tool, you only need to provide the name, definition (a Python function), and the arguments schema. The model can then be used as a ReAct agent in LangChain with a range of tools (see this Colab for an example workflow).

Example of using Command R+ in LangChain with different tools including Internet search, vector store search, and Python execution.

I hope that strong support for RAG and tool use in an open-weights model will lead to progress in important research directions, some of which I have outlined here. If you want the most efficient solution for RAG, Command R demonstrates highly competitive RAG and tool-use performance at cheaper cost (35B-parameter weights are publicly available).

Multilingual

Command R+ works well in languages beyond English. It was pre-trained on 23 languages, with our main focus on 10 key language of global business: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, Simplified Chinese, and Arabic. In our evaluations on translation tasks, it is competitive with GPT4-turbo.

Comparison of models on FLoRES (in French, Spanish, Italian, German, Portuguese, Japanese, Korean, Arabic, and Chinese) and WMT23 (in German, Japanese, and Chinese) translation tasks.

Command R+ has been designed with multilinguality in mind. Its tokenizer is much less English-centric than others and compresses text in non-English languages much better than both the Mistral and OpenAI tokenizers.7 As LLM providers charge based on the number of input/output tokens, tokenizer choice directly impacts API costs for users. At the same cost-per-token, if one LLM generates 2x as many tokens as another, the API costs will also be twice as large.

Comparison of the number of tokens produced by the Cohere, Mistral (Mixtral), and OpenAI tokenizers for different languages (as a multiple of the number of tokens produced by the Cohere tokenizer). The Cohere tokenizer produces much fewer tokens to represent the same text, with particularly large reductions on non-Latin script languages. For instance, in Japanese, the OpenAI tokenizer outputs 1.67x as many tokens as the Cohere tokenizer.

Ahia et al. (2023) highlighted that such over-segmentation leads to “double unfairness”: higher API prices and lower utility (reduced performance) for many languages. In comparison, Command R+ is much more equitable. I hope that companies will take into account the impact of tokenization and other design choices on API costs in future LLMs.

Given the focus on the Latin script in existing models, I particularly want to highlight Command R+’s performance in some prominent non-Latin script languages: Japanese, Korean, and Chinese. We evaluated on translation tasks as well as language-specific benchmarks such as Japanese MT-Bench 8. Command R+ outperforms Claude 3 Sonnet and Mistral Large9 and is competitive with GPT 4-Turbo.

Japanese evaluation on FLoRES, WMT23, and Japanese MT-Bench (source).

We see similar trends for evaluations in Korean and Chinese. On Chinese Chatbot Arena, Command R+ is only behind GPT4 and Claude 3 Opus, models that are 2–3x more expensive. It’s been exciting to read the feedback from speakers of different language communities using Command R+.

Conclusion

Overall, I’m really excited about Command R+’s capabilities and the future of LLMs. We will be pushing its multilingual capabilities to make it useful in many languages used in business. I’ve been using Command R+ with RAG via Cohere’s playground for my exploratory Internet searches and creative tasks in English and German and have been impressed by its quality. Feel free to try it in your language—and share your feedback in the comments or via email. I’d love to hear what works or doesn’t work for you.

I’m also excited to hear from you if you’d like to explore using Command R+ for your (multilingual) business applications.

LLM evaluators prefer longer responses or are affected by the order in which responses are presented.

Evaluation results on other public benchmarks can be found here.

This is on 4x A100 GPUs using a highly optimized open-source backend. Quantized versions of Command R+ can be run on 2x 3090 GPUs at around 10 tokens/s.

DBRX ranks at #26 on Chatbot Arena as of the publication date of this post.

Things used to be easier when the default setting was purely extractive QA and accuracy and exact match (EM) were the go-to metrics (see, for instance, Natural Questions or TyDi QA). With generative models, automatically identifying whether a (possibly verbose) response answers a question is much more challenging.

Evaluation for HotpotQA and Bamboogle is done via a committee-of-LLMs-as-judges to reduce evaluator bias.

The Anthropic tokenizer is not public so we could not compare to them.

We use GPT4 as an evaluator so this evaluation is biased towards GPT4-turbo.

Mistral Large doesn’t officially support these languages so its results are expectedly lower.

True Zero-shot MT

Sebastian Ruder — Tue, 27 Feb 2024 09:46:32 GMT

Little over a week ago, Gemini 1.5 reported close to human-level performance on MTOB, a recent challenging translation dataset. In this newsletter, we’ll dig into this result, explore true zero-shot machine translation (MT), and consider how to teach LLMs a new language like humans.

Low-resource MT

To set the scene, let’s first consider what it means for a language to be considered “low-resource”. As with LLMs, the performance of MT models depends on the amount of training data—both parallel and monolingual—available in a given language. As a result, there is a gulf between languages with lots of data and languages with little data in common pre-training corpora. The latter are typically referred to as “low-resource”.1

To bridge the gap between resource-rich and resource-poor languages and make machine translation more accessible, new translation benchmarks have been created that cater specifically to low-resource languages. The Conference on Machine Translation (WMT) now regularly hosts shared tasks on low-resource MT such as for Indic and African languages; workshops such as AmericasNLP support indigenous languages; and large-scale decentralized collaborations such as Masakhane, SEACrowd and Aya created MT datasets for African languages, Indonesian languages, and 100+ languages respectively. Recently, FLORES-200 expands translation data coverage to 200 languages. Beyond theses efforts, through extensive work on data cleaning, filtering, and language identification, researchers have been able to obtain data and train MT models for 1000+ languages (Bapna et al., 2022; NLLB Team, 2022).

LLMs are typically trained on parallel data (Kale et al., 2021) and are increasingly used for translation (Vilar et al., 2023). However, in light of ever-larger pre-training datasets, the opacity of pre-training data, and challenges of language identification on the web (Caswell et al., 2020), it is unclear how much data LLMs have seen in a low-resource language during pre-training. Chances are that most LLMs have seen some data in most languages that are available on the web.

From an experimental perspective, it is thus not straightforward to assess how much data LLMs actually need to learn to translate in a new language. While we could restrict the languages a model is trained on and examine its performance on a held-out language with increasing amounts of training data, we know that most LLM capabilities only emerge at scale. So how can we study the translation abilities of fully pre-trained LLMs in a controlled setting?

True Zero-shot MT

To study this setting, we have to look at translating a language that was truly unseen during pre-training.2 For simplicity, I use the term true zero-shot MT to designate the setting of translating to a language with no pre-training data using only in-context learning data.3

While we can find data for 1500+ languages on the web (Bapna et al., 2022), there are around 7000+ languages spoken around the world. So the remaining 5500 languages with little or no presence online are potential candidates for our target language.4

As expected, for these languages there is very little data available that has been used in standard LLM pipelines. Interestingly, the resources that are available for these languages are similar to those humans might use to learn a second language (L2) including:5

a list of words and their translations to learn a language’s vocabulary;
paired sentences to learn about word usage, word order, and some morphology;
a grammar book to study the structure of the language.

Let’s take a look at how these resources can be used by LLMs:

1. A Bilingual Word List

Database coverage of bilingual word lists in PanLex.

While lists of words and their translation only teach us a limited amount about a new language, they are available for a huge number of languages thanks to projects such as PanLex. However, for many languages, they only contain translation pairs for a small number of terms such as the numbers from 1–10.

Bilingual lexicons have been a core resource for aligning word embeddings across languages (Ruder et al., 2019). They also have been used to bootstrap unsupervised MT systems (Artetxe et al., 2018), though, translating text word-by-word does not get you far, even if you account for word order differences (Lample et al., 2018).

On the other hand, even in the best LLMs, cross-lingual alignment of words in the vocabulary is still not perfect. Bilingual lexicons are thus a potential viable—and large-coverage—form of supervision. There are multiple ways in which they can be used with LLMs:

data augmentation: “noisy” code-mixed samples are created by replacing words in source language sentences with their target language translations (Wang et al., 2022, Reid & Artetxe, 2022);
lexical prompting: prepending pairs to the prompt for source words that occur in the input (Ghazvininejad et al., 2023);
parallel data: pairs are treated as “sentence” pairs and added to the existing parallel data.

Jones et al. (2023) compare several augmentations. They find that they provide similar gains for unsupervised machine translation and that high-quality bilingual lexicons such as GATITOS are crucial. Recently, Koto et al. (2024) use PanLex to extend sentiment lexicons to more languages.

2. Few Parallel Sentences

Parallel data is the bread and butter of MT research. MT models and LLMs are trained on millions of parallel sentences. Prior work on low-resource MT such as for Nepalese–English used 100,000s of parallel sentences (Guzmán et al., 2019).

More recent work studied the impact of around 1–6k professionally translated sentences across different low-resource languages (Maillard et al., 2023). They find that already 1,000s of high-quality parallel sentences are helpful while multilingual training and back-translation are crucial to achieve good performance.

But what can you do if you only have very few parallel sentences for a language? Such scenarios are common in the form of Rosetta Stone puzzles in Linguistic Olympiads.

An example Rosetta Stone puzzle (Sahin et al., 2020).

These puzzles are designed so that only a minimum number of parallel sentences are required to deduce the relevant translations to and from the target language (Chickasaw above). Sahin et al. (2020) collected 100 such puzzles. LLMs such as GPT-3.5 Turbo still fall short of solving these puzzles with an Exact Match score of around 20% and even advanced prompting methods are not helpful (Lin et al., 2023).

3. A Grammar Book

While both bilingual lexicons and parallel sentences are well-known data sources for training MT models, grammar books are much less common. A reference grammar is a key product of linguistic fieldwork, which can also result in other artifacts such as language learning materials including alphabet books, learners’ guides, etc.

Many reference grammars are available as books in pdf format. As part of sourcing open-source OCR data from Google Books for XTREME-UP (Ruder et al., 2023), we came across many public-domain books in under-represented languages that described the grammar or vocabulary of the language and were created as part of missionary efforts and linguistic fieldwork.6

Compared to these, A grammar of Kamalang (Visser, 2022) used in MTOB is a more recent reference grammar. Reference grammars are publicly available in 100s of languages and present a promising and—with exception of MTOB—untapped resource for studying the language acquisition of LLMs.

An excerpt from A grammar of Kamalang.

4. Putting Everything Together: MTOB

MTOB (Machine Translation from One Book; Tanzer et al., 2023) is a recent dataset that provides the three above resources for Kalamang, an endangered language spoken by less than 200 people7, which is essentially absent from pre-training corpora.

In addition, it is not closely related to other languages with many speakers (which is important to measure true zero-shot performance) and uses the Latin script, which makes it easy to process with LLMs. The authors obtained the permission of the Kalamang-speaking community for using their data, which is crucial for this type of work.

Compared to standard MT evaluation where native speaker translations are the benchmark to beat, producing native-level translations may not be feasible using only the provided resources. The authors thus provide another human baseline: The first author Garrett Tanzer learned how to translate Kalamang from scratch by reading the grammar book for 10+ hours and then used the parallel data and the Internet as reference when performing the translation task over the course of several weeks. That is dedication to human evaluation!

In their evaluation, all LLMs underperform the human baseline, with Claude 2 performing best in comparison. Recently, Gemini 1.5 Pro using its extremely long context window is able to use in-context learning on the Kalamang resources to improve substantially on the English->Kalamang translation task. Let’s look at what these results mean for the field overall and for under-represented languages in particular.

Future Impact

Long Context Modeling

It’s quite remarkable how long-context models have progressed over the last years. Being able to translate into a new language using only in-context learning based on existing linguistic resources is an impressive feat. Nevertheless, parameter-efficient fine-tuning and retrieval-augmented generation (RAG) are important baselines for this setting that can put the in-context learning performance in perspective.

Long Context Datasets

While many challenging long-context benchmarks are synthetic in nature, including the most challenging tasks in Long Range Arena (Tay et al., 2021) and the needle-in-the-haystack tasks in the Gemini 1.5 evaluation, datasets like MTOB provide a more realistic setting using standard linguistic resources. In addition, they enable comparison to a human baseline and to a (potentially unattainable) human native speaker. Given the increasing popularity of long-context models, I hope we see more long-context datasets grounded in realistic data and human-level comparisons.

Under-represented Languages

While these results highlight the potential of LLMs to learn to translate with very little data, many under-represented languages are primarily spoken; they do not have a written tradition or a standardized orthography. Text-based NLP technology is thus of limited use to them. For these languages, multi-modal LLMs will be an important foundation. Given the powerful capabilities of LLMs, we should design evaluations with the needs of the language communities in mind. A promising area is to support ‘contact languages’ including creoles and regional varieties of standardized languages (Bird, 2022).

NLP and Cognitive Science

In order to study language acquisition in language models, benchmarks such as the BabyLM Challenge employ a developmentally plausible corpus including mostly transcribed speech and a limited number of tokens. However, the embodied, interactive, and multi-modal nature of first language (L1) acquisition is challenging to replicate with current models. L2 acquisition in the form of true zero-shot MT may be a more accessible testbed to study how a model learns a new language based on limited linguistic resources.

Interpretability and Model Understanding

Grammar books and language learning resources similarly provide a means to analyze how LLMs acquire new information and how they use their existing knowledge to reason over new inputs. They can be used to better understand models’ inner workings via tasks such as grammar induction (Kim et al., 2020). For instance, an interesting question is whether the sentences a model retrieves via RAG are similar to those consulted by a human when translating into a new language.

NLP and Linguistics

To make most use of such resources, both in obtaining and understanding the data as well as interpreting model results requires collaborating with linguists. MTOB provides an example of how such a collaboration can look like in practice, with the linguist actively participating in the research and co-authoring the paper. Such inter-disciplinary collaborations, while challenging and complex, are often a breath of fresh air—so I hope to see more of them in the future.

Note that the term “low-resource” is typically a misnomer as many low-resource languages have data available; it may just not be easily accessible due to being in different formats, in another modality, etc. I’ll still use it here as we will strictly talk about the data available in pre-training for these languages.

We have studied transfer to languages with unseen scripts in prior work (Pfeiffer et al., 2021) though many of these scripts have likely been seen by more recent LLMs.

I use the term true zero-shot MT to contrast with zero-shot MT (Johnson et al., 2016), which refers to the setting where a language has monolingual but no parallel data in pre-training. The term also relates to the usage of the term true few-shot learning (Perez et al., 2021), which does not use any other held-out examples.

There are many factors that affect the choice of the target language for true zero-shot MT: Many languages don’t have a standardized orthography; others use a different script, which may complicate processing; connecting with native speakers and finding data may be difficult for many languages; language similarity plays an important role for learning a new language; finally, some communities do not approve the use of MT for their language.

Note that this is a text-centric perspective of zero-shot MT. In practice, for such local languages we can also expect the existence of raw speech with translations (Bird, 2022).

We excluded these books from our data as they are mainly in English with examples in the target language rather than the monolingual target language data that we wanted to collect.

Kalamang is spoken in the Western part of the island of New Guinea—part of Indonesia—while the Eastern part of the island is part of Papua New Guinea. Incredibly, Papua New Guinea and Indonesia are also the two countries with the most languages in the world. See (Koto et al., 2022) for an overview of NLP challenges for Indonesian languages and this AACL 2023 Tutorial for the current status of NLP in Southeast Asia.

Thoughts on the 2024 AI Job Market

Sebastian Ruder — Mon, 12 Feb 2024 09:51:17 GMT

It’s crazy how the AI and NLP landscape has evolved over the last five years. 5 years ago, around the time I finished my PhD, if you wanted to work on cutting-edge natural language processing (NLP), your choice was relatively limited.

Recently, I decided to go on the job market again, which has become much more diverse. In this post, I want to highlight some macro trends that I observed and the reasons that I joined my new company, Cohere, which may be helpful in guiding your own job search.

Note: This post reflects my personal opinions and not those of my previous and current employers. It is written from my perspective as a Europe-based researcher focused on NLP. If you are interested in AI companies but have a different skillset, some of these thoughts should still be relevant to you.

AI Job Market Trends

1. Research has become more applied.

In the past, most problems at the forefront of ML and NLP were firmly in the purview of fundamental or basic research. As models were not powerful enough, datasets reflected simplified evaluation settings that were feasible at the time and typically far removed from applications. In order to work on such cutting-edge problems, you generally had the choice of joining academia or going to a handful of big tech labs (Google Research/Brain, DeepMind, FAIR, MSR, etc).

For research advances to make their way into products could take a team months or years of dedicated work—if it succeeded at all. An exception is machine translation where research breakthroughs such as the emergence of statistical MT and neural MT resulted in concrete product improvements.1 On the other hand, applied research departments directly worked on improving a specific application.

In light of the emergence of pre-training (NLP’s ‘ImageNet moment’) and models becoming more powerful, the gap between fundamental and applied research in NLP has consistently narrowed: The integration of BERT-based representations led to one of the biggest quality leaps in Google Search history and the recent generation of large language models (LLMs) enabled a plethora of new applications.

Problems that were previously in the domain of basic research (how to measure generation quality, how to teach models to reason, how to learn with long-range dependencies, etc) now impact real-world applications. As a result, new advances in research have the potential to have a much broader impact. This leads to many new opportunities and emerging research areas. However, it also means that researchers must consider challenges regarding the safe and responsible use of such technology.

As the Generative AI space heats up, new research breakthroughs are perceived to provide an edge over the competition. As a result, publishing them has become more challenging. In addition, with increasing proximity to product, doing purely curiosity-driven research has become more difficult: when most other research has immediate product impact, how do you justify working on an unproven direction? For researchers, this requires balancing short-term impact with long-term research potential, with the scale tilting to the former.

To encourage long-term research in this application-oriented climate, companies need to create an environment that still rewards open-ended directions. With companies focusing on applications, researchers in academia should focus on unexplored directions and look to the blue skies. While compute requirements for research have increased, there are plenty of less compute-intensive directions:

2. Startups are a serious alternative to a PhD.

When people have asked me for advice in the past on whether they should do a PhD, I generally told them that it’s well worth the time investment. Not only does it unlock research jobs that require a PhD, it also is a great way to focus on your personal growth.2

In light of the applied nature of current research problems, there is another path that exposes you to cutting-edge AI work: joining a startup. To be sure, startups—particularly early-stage ones—are not for everyone. They require a person with a certain type of mindset and motivation: Someone who enjoys solving real-world problems and having a tangible, direct impact; who can work autonomously and without much guidance; who thrives in a hectic, unstructured environment and can handle ambiguity.

But if you’re comfortable in these conditions, you can acquire certain skills and knowledge much faster than in a typical PhD. You will also get hands-on experience with emerging methodologies such as instruction and preference tuning, red-teaming, LLM alignment, etc, which can prove invaluable for your career. Of course, in a startup, your work is typically determined by the company’s needs rather than your own interests so you need to be flexible.

A PhD is still the best option for you if you would like to follow your own curiosity and focus on your personal development; if you enjoy to go deep and fully dedicate yourself to a topic; if you value collaboration and mentorship; and if you enjoy being creative and coming up with genuinely new ideas. A scientific mindset as well as other research-related skills such as designing ablations and testing for hypotheses, publishing, and developing research taste, among many others, are also more easily learned during a PhD.

3. ML has become less open and more polarized.

An amazing attribute of the ML community has been that much of ML development and research has been conducted in the open. The top ML journal JMLR was founded in 2000 with the goal to provide open access to its publications. Conferences provide free access to their proceedings online. ML frameworks such as TensorFlow and PyTorch were originally developed by companies and then open-sourced. In NLP, journals and conferences are similarly open-access and open-source libraries such as Transformers are common building blocks.

Early pre-trained models such as ELMo, ULMFiT, GPT, and T5 were open-sourced as this enabled wide-spread adoption. However, this landscape of radical openness has shifted. Stalwarts of open-source AI such as OpenAI and Google have gradually released less information about their models. Starting with the first generation of LLMs, models such as GPT-3 and PaLM were increasingly locked behind APIs—but papers still described the architecture and data in detail. More recent models such as GPT-4, PaLM-2, and Gemini are not only closed-source but the corresponding papers reveal nothing about the architecture and training data.

This lack of knowledge sharing may impede progress in AI development. Fortunately, other companies and organizations continued to release a steady stream of open-source LLMs. Still, even among open-source models there is a spectrum of openness. For instance, the exact composition of training data often remains a secret. Few models such as BLOOM or the recently released OLMo are truly open. Among the big tech companies, Facebook and Microsoft showed a renewed commitment to open-source. Even Apple—with its reputation of being secretive—has been quietly open-sourcing AI projects.

For industry researchers, the trend towards closed-source means that it has become harder to publish. In the past, researchers at top AI industry labs were often able to publish a steady stream of publications similar to their academic counterparts. For LLM-related papers, this stream is reduced to a trickle and new advances may eventually see the light of day only as patents rather than research publications. Additionally, it has become more difficult to publish individual contributions as advances are more likely to be absorbed into large collaborations producing a single report.

4. Research is concentrated in large projects.

The average number of authors on a publication has steadily increased. Starting in particle physics, author numbers have surged in recent years due to massive global collaborations such as the Large Hadron Collider. The emergence of LLMs brought this trend to ML and NLP. Recent examples of such large-scale collaborations are BLOOM (300+ authors), GPT-4 (200+ authors), and Gemini (900+ authors). While several successful LLMs have been produced by small teams, the number of authors of an LLM has generally increased with the number of its parameters.

LLM projects not only require people with research skills but also strong software engineers that can design systems that scale efficiently to 100s of billions of parameters and trillions of tokens. In addition, LLMs require disparate sets of expertise including data processing, optimization, fine-tuning, RL, evaluation, safety, infrastructure, multi-modality, etc. As a result and due to their strategic importance, the size of teams working on the latest generation of LLMs has rapidly increased.

This size contrasts with the previous generation of AI breakthroughs such as AlphaGo, which were executed by much smaller, focused teams. Such size poses challenges for the effective execution and prioritization, increasing friction and making it more difficult to quickly make decisions. A less direct downside of the increasing number of people getting absorbed into LLM-related research is that other research directions that do not directly relate to the latest generation of LLMs such as the development of Transformer alternatives are deprioritized.

5. More companies, more opportunities.

Generative AI startups (Credit: Dawn Capital).

The advent of LLMs led to a wave of new companies leveraging this technology—and prompted existing companies to figure out how incorporate these models into their products. YC, the prolific startup incubator, has already funded more than 100 generative AI startups. A recent McKinsey report estimated that Generative AI’s impact on productivity could add trillions of dollars to the global economy, with most of the expected value concentrated across four use cases: customer operations, marketing and sales, software engineering, and R&D.

Generative AI, however, is still just at the beginning. Many research challenges remain including mitigating hallucinations and ensuring trustworthiness and attribution, aligning models to reliably elicit desired behavior, ensuring robust reasoning, etc. In order to effectively use LLMs for business use cases, we furthermore need to successfully conduct pilot studies, assess biases and risks, define suitable guardrails, rethink core business processes, and develop new skills in the workforce, among others.

With all of these new AI companies, it is difficult to choose the one that is the best fit for you.

Why I Joined Cohere

Below, I highlight the criteria that led me to join Cohere. Many of these considerations are personal but I hope they may be useful to you as inspiration or to guide your own job search.

1. Openness and community

In addition to building powerful proprietary enterprise models, Cohere supports openness and inclusion through its non-profit research arm, Cohere for AI (C4AI). C4AI’s openness is not an after-thought but part of Cohere’s DNA: the idea for C4AI goes back to FOR.ai, a decentralized ML collaboration—which I highlighted in a 2020 newsletter—initiated by Cohere founders Aidan Gomez and Ivan Zhang, among others. C4AI published more than 30 papers in 2023. Talented researchers from a diversity of backgrounds are mentored by researchers across the organization in their Scholars Program. C4AI organizes large-scale community initiatives including Aya, a massive collaboration to develop a large multilingual open-source instruction tuning dataset and model. In addition, Cohere also invests in programs that make ML more accessible such as LLM University.

2. A mature start-up

At an early-stage startup, you can move fast but things are hectic and unstructured. In big tech, there are established tools and processes available but bureaucracy can impede progress.

Cohere occupies a middle ground. It has been around for a while and had time to build structure and processes, which make it easy to hit the ground running and directly have an impact without getting bogged down by unrelated tasks. The core components of the LLM pipeline are firmly in place and are being refined and iterated upon.

At the same time, the team is small enough so that you can have impact across the entire LLM stack and own crucial parts of the pipeline. There is little friction and red tape, which makes it easy to prototype and test new improvements and to collaborate across teams.

3. Enabling remote work

With a new baby, having a flexible working arrangement that would allow me to work remotely was very important to me. However, not every company that allows its employees to work from home is set up for remote work. It’s good to be aware: Does the culture support remote work or will you miss out on conversations at the micro-kitchen? Do the tools enable working remotely? Are there opportunities to meet in person?

For companies with multiple offices, an important factor is where the company’s main office or a project’s center of gravity is located as it may be harder to achieve the same level of impact if you are in a satellite office or working remotely. In the same vein, working with colleagues in similar timezones often facilitates collaboration. At the beginning of your career, it is often a good idea to work in-person as it will be easier to seek advice and learn from colleagues.

At Cohere, half of all employees are remote. Tools and benefits support remote work and events provide equal access for distributed teams. Cohere has offices in Toronto, London, San Francisco, and Palo Alto that you can visit to work with team members in-person. Many people working on ML and modeling at Cohere are based in Europe or in the US East Coast so it’s easy to collaborate across these timezones.

4. Alignment

Another thing that was important to me is to be aligned with the overall goals of the company. While most companies are profit-driven, some are more serious about having a positive impact on society than others. This is reflected in the way they develop their technology (do they prioritize safety and ethics and put appropriate safeguards in place?) as well as the programs they organize, causes they support, and the way they interact with the community.

Cohere aims to develop AI models in a responsible manner to serve humanity, an important objective. Making models more accessible across languages is not only a personal objective but serves this overarching goal as it enables Cohere’s customers to reach their users across the world.

5. Team and culture

Cohere’s team is world-class. I had worked with a few at Google DeepMind and knew others from conferences or by reputation. Cohere has teams with deep expertise across many LLM areas such as pre-training, RL, retrieval augmentation and search. The latter are crucial for knowledge-intensive and enterprise use cases. Beyond research, senior leaders at Cohere have experience in building and scaling products and businesses to billions of users.

In addition, I enjoyed all my interactions and conversations with Cohere employees. Everyone I got to know was kind, humble, and genuine. The culture is collaborative and everyone is aligned towards the same objective.3 People are motivated by helping each other succeed.

For more information about careers at Cohere, check out the careers page. Cohere is hiring across many roles.

For a job market perspective from an RLHF researcher, check out ‘s Interconnects post.

Other examples include OCR, image recognition, predictive text, dependency parsing (when it was still widely used as an intermediate step in NLP pipelines), among others.

I’ve collected some advice in 10 tips for research and a PhD.

A company’s culture is difficult to assess from the outside. Try to ask current and former employees about their work environment and read reviews of the company online.

The Big Picture of AI Research

Sebastian Ruder — Thu, 18 Jan 2024 07:09:49 GMT

More papers on AI are published than ever before1 but each paper tends to only present its part of the picture—and it becomes difficult to recognize the larger story to which a paper is connected. To encourage the community to explore broader research narratives, we2 co-organized the Big Picture Workshop at EMNLP 2023.3 We received a number of high-quality submissions that distill important research topics, from narrative understanding to modern generation techniques.

My favorite part of the workshop, however, were the invited talks. We had asked researchers from different labs working on the same topic to reflect on and consolidate their often disagreeing contributions. The result were talks that were more nuanced and engaging than the—typically one-sided—scientific presentations. The talks covered topics from in-context learning to attention as explanation and morality.4

The audience at the Big Picture Workshop.

I recently joined Cohere to help solve real-world problems with LLMs. We are hiring!

What Does In-Context Learning Need?

In-context learning (ICL) is one of the most important emerging phenomena of LLMs but it is still not clearly understood what factors contribute to its success. At EMNLP 2022, two papers with seemingly contradictory hypotheses were published: In “Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?”, Sewon Min and others argued that random labels perform similarly to using ground-truth labels in the demonstrations. On the other hand, Kang Min Yoo and Junyeob Kim found that ground-truth labels are important in “Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations”. So, do we need ground-truth labels for ICL or not?

In their talk (slides), Sewon and Junyeob shed light on this conundrum. ICL with random labels is more sensitive than using ground-truth labels—but still works with careful prompting. However, small deteriorations in performance can be observed, which are not negligible in real-world applications. Nevertheless, although their impact may vary across setups, the correctness of labels is still one of the core components of successful ICL.

They argue that the main reason why ICL without ground-truth labels works is because ICL activates priors from pre-training rather than learning new tasks on-the-fly. What I found very insightful is that they even put their findings in the context of a more recent work, “Larger language models do in-context learning differently”, which hypothesizes that overriding such semantic priors is an emerging ability in larger models. Upon closer inspection of that paper’s results, however, they found that no model with flipped labels performs better than random. So with standard ICL, even large models are unable to override their pre-training priors.

Is "Attention = Explanation"?

In 2019, there was a series of papers memorably titled “Attention is not Explanation” and “Attention is not not Explanation” that studied whether attention is useful as faithful explanation of model predictions.5 In their joint talk (slides), Sarthak Jain and Sarah Wiegreffe, the first authors of the papers, reconciled the findings of their papers and contextualized them with regard to recent developments in the field.

So, is “attention = explanation”? Putting things into perspective, both authors highlight that attention mechanisms in LSTM networks can serve as faithful explanation under certain conditions; there is no one-size-fits-all answer. However, faithfulness evaluation is difficult due to a lack of a ground truth.

But how useful is attention to explain model predictions today? They highlighted that attention is no longer very useful for instance-level explanations but that it still matters for understanding the mechanisms underlying general-purpose Transformers beyond specific models, datasets, and tasks. Understanding attention is thus still important in this context.

Can Machines Learn Morality?

In 2021, Liwei Jiang led researchers from AI in training an ML model, Delphi, to reason about ethical judgements. To train the model, they created a new dataset, Commonsense Norm Bank, containing 1.7M examples of descriptive judgements on everyday situations. This research program was critiqued by Zeerak Talat and others (“On the Machine Learning of Ethical Judgments from Natural Language”). In 2023, Liwei and others created Value Kaleidoscope, a new dataset to model potentially conflicting human values involved in decision-making.

In their joint talk (slides), both discussed their research agendas and the challenges of teaching morality and ethics to AI models. They also engaged in a higher-level meta discussion on disagreements in science. They observed that science and conflict go hand in hand and found honesty and good-faith behavior key to resolve such situations.

The Vision Thing

Finally, Raymond J. Mooney gave an excellent invited talk (slides) where he discussed the importance of finding and pursuing your research passion. He reviewed his changing research vision over the last 40+ years that started with explanation-based learning, took him to bridging ML and NLP, ML for semantic parsing, and more recently grounded NLP, language and code, and language and 3D animation. It’s rare to get such a personal and inspired account on the motivations behind changes in a research vision from a luminary of the field.

For any aspiring researcher, this talk is a treasure chest full of useful and practical advice. I would highly recommend watching the recording (start: 0:07:00).

Final Takeaways

In summary, this was one of my favorite workshops that I’ve attended or organized. It’s a breath of fresh air when talks are more than just an oral recapitulation of a paper. At its best, research is a collaborative and sometimes argumentative conversation. Scientific publications are a culmination of this process but for various reasons6 do not provide the full picture.

This workshop started as an experiment. We wanted to see whether we can turn scientific talks into something more akin to a debate that sheds light on a topic from different perspectives. Overall, the experiment was a success. The speakers did a stellar job presenting and contextualizing their research. The audience was engaged and speakers fielded a flood of questions for each talk while we received a lot of positive feedback on the overall format. The speakers deserve special thanks, however! They put in extra effort by having multiple meetings to prepare and sync with their counterpart. Some of them had never spoken before so the workshop also served as an opportunity to form new connections.

The Big Picture Workshop won’t return this year at an NLP conference but we hope to bring it back in 2025. Feel free to reach out to Yanai Elazar with ideas and feedback. If you enjoyed the format, you are welcome to organize a similar workshop at another venue. All materials (including the proposal, task tracker, email templates, etc) are available online. While it does take more effort to present a topic from different perspectives, we hope future workshops and presenters will consider taking a more debate-style approach to their talks.

Thank you to the sponsors of the workshop, Amazon, Google, and HuggingFace! Thanks to Yanai Elazar for feedback on this post.

See the arXiv monthly submission stats.

Yanai Elazar, Allyson Ettinger, Nora Kassner, Noah A. Smith, and I.

The workshop was inspired by other workshops that seek to add more nuance to the research conversation such as ML Retrospectives and the Workshop on Insights from Negative Results.

The recording is available here. Approximate timestamps for the talks: Raymond J. Mooney (0:07:00); Sarah & Sarthak (1:53:00); Liwei & Zeerak (4:25:00); Sewon & Junyeob (6:55:00).

These were accompanied by Medium posts by Yuval Pinter and Byron Wallace and more recently revisited in “Is Attention Explanation? An Introduction to the Debate”.

Due to space or time limitations, to create a convincing narrative, bias or lack of awareness of the authors, etc.

NLP Research in the Era of LLMs

Sebastian Ruder — Tue, 19 Dec 2023 09:53:36 GMT

Update Dec 30: Added mentions of BabyLM and the Languini Kitchen.

NLP research has undergone a paradigm shift over the last year. A range of large language models (LLMs) has validated the unreasonable effectiveness of scale1. Currently, the state of the art on most benchmarks is held by LLMs that are expensive to fine-tune and prohibitive to pre-train outside of a few industry labs.

In the past, a barrier to doing impactful research has often been a lack of awareness of fruitful research areas and compelling hypotheses to explore2. In contrast, NLP researchers today are faced with a constraint that is much harder to overcome: compute.

In an era where running state-of-the-art models requires a garrison of expensive GPUs, what research is left for academics, PhD students, and newcomers to NLP without such deep pockets? Should they focus on the analysis of black-box models and niche topics ignored by LLM practitioners?

In this newsletter, I first argue why the current state of research is not as bleak—rather the opposite! I will then highlight five research directions that are important for the field and do not require much compute. I take inspiration from the following reviews of research directions in the era of LLMs:

I highly recommend these for different perspectives on current LLM research and for a broader overview of research topics beyond the ones presented in this article.

A Cause for Optimism

Research is cyclical. Computer scientist and ACL lifetime achievement award recipient Karen Spärck Jones wrote in 1994:

Those […] who had been around for a long time, can see old ideas reappearing in new guises […]. But the new costumes are better made, of better materials, as well as more becoming: so research is not so much going round in circles as ascending a spiral.

In the same vein, Saphra et al. (2023) highlight the similarities between the current era of LLMs and the Statistical Machine Translation (SMT) era where translation performance was shown to scale by training a phrase-based language model on more and more web data.

Results slide of Franz Och’s keynote talk at a 2005 MT workshop. Credit: Saphra et al. (2023).

More recently, we have seen the success of scale with the advent of word embeddings in 2013 and the emergence of pre-trained LMs in 2018.3 In all cases, academic research was not left in the dust but went on to make contributions that shaped the next era, from KenLM (Heafield, 2011), an efficient LM library that enabled academics to outperform industry MT systems, to the word2vec alternative GloVe (Pennington et al., 2014), to pre-trained LMs developed in non-profits and academia such as ELMo (Peters et al., 2018) and ULMFiT (Howard & Ruder, 2018).

The main lesson here is that while massive compute often achieves breakthrough results, its usage is often inefficient. Over time, improved hardware, new techniques, and novel insights provide opportunities for dramatic compute reduction.

In his 2018 article, Stephen Merity provides two examples of this trend where the first instance of a method was exorbitantly compute-intensive while only a year later, compute costs were dramatically reduced:

New York Times (2012): "How Many Computers to Identify a Cat? 16,000 (CPU cores)"
One year later: "three servers each with two quad-core CPUs and four Nvidia GeForce GTX 680 GPUs"
Neural Architecture Search: "32,400-43,200 GPU hours"
Just over a year later: "single Nvidia GTX 1080Ti GPU, the search for architectures takes less than 16 hours" (1000x less) (paper)

We could argue why the same trend may not be true for this era of LLMs. After all, new techniques can also be scaled up and scale ultimately prevails as we know. In addition, the current trend of closed-source models makes it harder to build on them.

On the other hand, new powerful open-source models are still released regularly4. Companies are also incentivized to invest in the development of smaller models in order to reduce inference cost. Finally, we are starting to see the limits of scale on the horizon: recent LLMs are reaching the limits of text data online and repeating data eventually leads to diminishing returns (Muennighoff et al., 2023) while Moore’s law is approaching its physical limits.

There are already recent examples that require a fraction of compute by using new methods and insights, demonstrating that this trend also holds in the era of LLMs:

FlashAttention (Dao et al., 2022) provides drastic speedups over standard attention through clever hardware optimization.
Parameter-efficient fine-tuning methods (see our EMNLP 2022 tutorial for an overview) including adapters such as LoRA (Hu et al., 2021) and QLoRA (Dettmers et al., 2023) enable fine-tuning LLMs on a single GPU.
Phi-2, a new 2.7B-parameter LLM released last week matches or outperforms models up to 25x its size.

In the near term, the largest models using the most compute will continue to be the most capable. However, there remains a lot of room for innovation by focusing on strong smaller models and on areas where compute requirements will inexorably be eroded by research progress.

While LLM projects typically require an exorbitant amount of resources, it is important to remind ourselves that research does not need to assemble full-fledged massively expensive systems in order to have impact. Chris Manning made the nice analogy in his EMNLP 2023 keynote that in the same vein, aerospace engineering students are not expected to engineer a new airplane during their studies.

Members of the MIT aerospace engineering lab prepare a model of a super-efficient commercial aircraft for testing.

With that in mind, let’s look at five important research areas that require less compute.

1. Efficient Methods

Rather than waiting for compute costs to go down, making LLMs more efficient can have a wide impact. When we talk about efficiency, we often think about making the model architecture itself more efficient. In fact, most works on efficient Transformers focused on a specific component, the attention mechanism (Tay et al., 2022).

However, when thinking about efficiency, it is useful to consider the entire LLM stack. Important components ripe for improvement are:

Data collection and preprocessing: improving data efficiency by better filtering and data selection.
Model input: faster, more informed tokenization; better word representations via character-level modeling
Model architecture: better scaling towards long-range sequences; more effective use of memory
Training: more efficient methods to train small-scale LLMs via more effective distillation, better learning rate schedules and restarts, (partial) model compression, model surgery, etc.
Downstream task adaptation: improved parameter-efficient fine-tuning; automatic prompt and chain-of-thought design; modular methods; improved RLHF
Inference: early predictions; prompt compression; human-in-the-loop interactions
Data annotation: model-in-the-loop annotation; automatic arbitration and consolidation of annotations
Evaluation: efficient automatic metrics; efficient benchmarks

Given the wide range of LLM applications, it is increasingly important to consider the ‘human’ part in efficiency: from annotation, to learning from human preferences, to interacting with users, can we make the stages where human and LLM data intersect more efficient and reliable?

Sparsity and low-rank approximations are two general principles that have been applied in a wide range of efficient methods (see our modular deep learning survey for an overview) and are thus useful sources of inspiration: are there components that are modeled with an excess amount of parameters that can be approximated instead? Are there computations involving multiple steps that can be shortened?

In the age of LLMs, the clearest indicator that an efficient method works is that it reduces the coefficient (in other words, lowers the slope) of the corresponding scaling law, as seen for instance in Hoffmann et al. (2022).

More compute-efficient training leads to improved scaling of the Chinchilla model (Hoffmann et al., 2022) compared to the predictions of the earlier scaling law by Kaplan et al. (2020).

But how can we validate a scaling law without massive compute? By prioritizing experimentation in small-scale regimes.

2. Small-scale Problems

While it is generally prohibitive to apply a new method directly to the largest model, using it on a smaller, representative model can serve as a useful prototype and proof of concept.5 These days in particular, one should not underestimate the pace of the ML and NLP community, which is receptive to and quick-to-adopt compelling new ideas.

For instance, the recently proposed DPO method (Rafailov et al., 2023) used a relatively small-scale experimental setting in the paper (GPT-2-large fine-tuned on IMDb reviews, among others). As the code was open-sourced and compatible with common LLM frameworks, community members quickly applied it to more recent models such as Llama-2 and Zephyr.

Expect to see more of this mode of operation: academic researchers developing new methods that—after small-scale validation—are shared with the community for further experimentation and scaling up.

Methods can also be developed on benchmarks that measure compute and sample efficiency and are designed with compute constraints in mind. Examples include the BabyLM Challenge (Warstadt et al., 2023)—which focuses on sample-efficient pre-training on a developmentally plausible corpus of 10M and 100M tokens—and the Languini Kitchen (Stanić et al., 2023), which compares models based on equivalent compute.

Another setting where a focus on a small scale is increasingly valuable is analysis and model understanding. Through pre-training, models learn a wide array of natural language understanding capabilities—but under exactly what conditions these capabilities emerge remains unclear.

Large-scale pre-training, due to the massive nature of most of the components involved, mostly resists a controlled examination.6 Instead, controlled small and synthetic settings that allow probing of specific hypotheses will be increasingly important to understand how LLMs learn and acquire capabilities. Such settings can include synthetic language such as bigram data (Bietti et al., 2023) or “fake” English (K et al., 2020), highly curated and domain-specific data, and data satisfying certain (distributional) characteristics; as well as more interpretable models such as small Transformers, backpack language models (Hewitt et al., 2023), and neural additive models (Agarwal et al., 2021).

LLM mechanisms whose emergence is still poorly understood include the following:

in-context learning: ‘burstiness’ and the highly skewed distribution of language data are important (Chan et al., 2022) but the in-context learning ability can also disappear again during training (Singh et al., 2023)
chain-of-thought prompting: local structure in the training data is important (Prystawski et al., 2023) but we don’t know how this relates to natural language data
cross-lingual generalization: limited parameters, shared special tokens, shared position embeddings, and a common masking strategy contribute to multilinguality (Artetxe et al., 2019; Dufter & Schütze, 2020) but it is unclear how this extends to diverse natural language data and typologically diverse languages
other types of emerging abilities (see for instance Schaeffer et al., 2023)

Rather than trying to make large-scale settings smaller to reduce the amount of compute necessary to study them, we can also focus on settings that are intrinsically small-scale due to constraints on the data available.

3. Data-constrained Settings

While the largest LLMs are pre-trained on trillions of tokens, the downstream applications we would like to apply them to are often more limited in the data available to them.

This is true for many interdisciplinary areas such as NLP for Science, Education, Law, and Medicine. In many of these domains, there is very little high-quality data easily accessible online. LLMs thus must be combined with domain-specific strategies to achieve the biggest impact. See Li et al. (2023) for a brief review of directions in NLP+X applications.

Another area where data is notoriously limited is multilinguality. For many languages, the amount of text data online is limited—but data may be available in other formats such as lexicons, undigitized books, podcasts, and videos. This requires new strategies to collect—and create—high-quality data. Furthermore, many languages and dialects are more commonly spoken than written, which makes multi-modal models important to serve such languages.

As we reach the limits of data available online, even “high-resource” languages will face data constraints. New research will need to engage with these constraints rather than assuming an infinite-scale setting.

While few-shot prompting enables seamless application to many downstream tasks, it is insufficient to teach a model about the nuances of more complex applications and is limited in other ways. Alternatively, parameter-efficient fine-tuning enables a more holistic adaptation using little compute. Such fine-tuning—when updates are constrained to a subset of model parameters—gives rise to modular models.

Given the diversity of LLM application areas and capabilities to master, another interesting direction is thus to leverage multiple modular ‘experts’ by learning to disentangle and combine the skills and knowledge learned across different domains.

Such modeling advances, however, are of little use if we do not have reliable means to evaluate them.

4. Evaluation

"[...] benchmarks shape a field, for better or worse. Good benchmarks are in alignment with real applications, but bad benchmarks are not, forcing engineers to choose between making changes that help end users or making changes that only help with marketing."—David A. Patterson; foreword to Systems Benchmarking (2020)

In 2021, a common sentiment was that NLP models had outpaced the benchmarks to test for them. I reviewed the situation in this article; not much has changed since then. More recent benchmarks designed to evaluate LLMs such as HELM (Liang et al., 2022) and Super-NaturalInstructions (Wang et al., 2022) still mainly consist of standard NLP tasks—most of them sentence-level—while others such as MMLU (Hendrycks et al., 2021) and AGIEval (Zhong et al., 2023) focus on exams. These benchmarks do not reflect the diverse range of tasks where we would like to apply LLMs.

Another phenomenon to be aware of is leaderboard contamination: benchmark data that is available online is likely to have been included in the pre-training data of LLMs, making evaluation unreliable. Benchmarks should thus keep evaluation data secret or receive regular updates.

"When you can measure what you are speaking of and express it in numbers, you know that on which you are discussing. But when you cannot measure it and express it in numbers, your knowledge is of a very meagre and unsatisfactory kind."—Lord Kelvin

In addition, existing automatic metrics are ill-suited for more complex downstream applications and open-ended natural language generation tasks. LLMs can be incorporated into automatic metrics (Liu et al., 2023) but one must be aware of—and mitigate—their biases. For complex tasks, it may be useful to decompose them into subtasks that are easier to evaluate, for instance, via behavioral tests (Hlavnova & Ruder, 2023).

As applications become more elaborate, even human evaluation, traditionally perceived to be the gold standard for any data, becomes less reliable. Disagreements may be less an indicator of ‘annotation noise’ and rather a sign of different perspectives (Pavlick & Kwiatkowski, 2019). For specialized applications, only domain experts may be qualified enough to provide accurate feedback. Leveraging and aggregating the feedback of a diverse set of annotators from different backgrounds is thus more important than ever.

5. Reasoning

Reasoning requires the use of logic to draw conclusions from new and existing information to arrive at a conclusion. With LLMs demonstrating surprising arithmetic and logical reasoning abilities, reasoning has received renewed attention and was well-represented in NeurIPS 2023 papers (see my previous newsletter):

Given that LLMs frequently hallucinate and struggle to generate code or plans that are directly executable, augmenting them with external tools or small domain-specific models is a promising direction to make them more robust. For instance, Parsel (Zelikman et al., 2023) decomposes a code generation tasks into LLM-generated subfunctions that can be tested against input-output constraints using a code execution module.

Many complex real-world applications require different forms of reasoning so evaluating models’ reasoning abilities in realistic scenarios is an important challenge. Given that many real-world problems require weighing different options and preferences, it will be crucial to enable LLMs to present different solutions to users and to incorporate different cultural backgrounds into their decision-making. Ignat et al. (2023) highlight other interesting research directions related to reasoning.

This post presented a selection of five research directions that are particularly important in my opinion—but in truth there are a plethora of potential opportunities to explore (see the other reviews at the beginning of this post). Now is the time to look beyond standard NLP tasks and be ambitious. After all:

“Shoot for the moon. Even if you miss, you'll land among the stars.”
—Norman Vincent Peale

Also known as Rich Sutton’s bitter lesson, in other words, “the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great”.

I shared ‘requests for research’ in 2018 where each task required a few GPUs at most for training a neural network from scratch.

For an overview of NLP milestones until 2018, check out my “Review of the Neural History of NLP”.

Hugging Face referred to 2023 as the year of open LLMs.

For example, we recently trained mmT5 (Pfeiffer et al., 2023), a modular version of multilingual T5 that dramatically outperforms its counterpart at small and base parameter sizes, demonstrating the benefits of modularity at this scale.

Nevertheless, certain areas such as memorization can be studied during large-scale pre-training by modifying the pre-training scheme appropriately, for instance via the insertion of ‘canaries’ as in PaLM 2 (Anil et al., 2023).

📄 EMNLP 2023 Primer

Tue, 05 Dec 2023 07:36:08 GMT

EMNLP 2023, one of the biggest NLP conferences takes place this week from Dec 6–10 in Singapore.

In this newsletter, I’ll discuss a selection of exciting papers and workshops I’m looking forward to at the conference. Here are the main trends I observed (based on the small sample of papers I discuss here and those I came across online):

Instruction-tuned LMs and LLMs are everywhere. Similar to earlier years where BERT was ubiquitous, instruction-tuned language models (LMs) and large language models (LLMs) are used in almost every paper.
Evaluation based on LLMs is increasingly common. While some papers employ automatic evaluation based on GPT-4, new metrics that are proposed are based on LLMs in zero-shot prompted or fine-tuned settings.
Prompt usage is getting more creative. Beyond a standard prompt template, prompts are getting increasingly complex and specialized to the desired setting. Techniques such as chain-of-thought prompting are common tools.
Multilinguality is increasingly popular. I came across a substantial number of papers studying multilingual settings, which indicates that LLMs are still limited in non-English settings and that making LLMs more multilingual is an important direction.

On the other hand, I did not come across many papers that tried to analyze LLM properties (using a synthetic setup, for instance) or that used external models or tools to augment LLMs (please point me to papers that I missed). This is in contrast to NeurIPS 2023 where such papers were more common (see the below newsletter).

I’ll be attending the conference in-person so say “hi” if you’re there.

Papers with a † are presented in Findings of EMNLP (rather than the main conference). I am an author/co-organizer on papers/events indicated with a *.

Workshops

GenBench Workshop, Dec 6. Generalization is crucial to ensure robust model behavior but how good generalization looks like and how it should be evaluated is still not well understood. The GenBench workshop on (benchmarking) generalization in NLP aims to catalyse research on generalization and how to measure it in the NLP community. Accepted papers study generalization or are BIG-bench-style collaborative benchmarking tasks (CBT). The program consists of invited talks, CBT spotlights as well as oral presentations and posters.

Workshop for NLP Open-Source Software (NLP-OSS), Dec 6. In light of the increasing number of closed-source LLMs, it is important to continue to promote an open culture of sharing knowledge, data, and software, from which the NLP community has benefited greatly. This workshop aims to further the sharing of insights regarding the creation and development of NLP open-source software. Invited talks feature important NLP open-source projects including trlX, a framework for large-scale open-source RLHF and SEA-LION, LLMs pre-trained for Southeast Asian languages.

The Big Picture Workshop: Crafting a Research Narrative, Dec 7*. In research, we “stand on the shoulders of giants”. However, given the number and rapid pace of published papers, it has become increasingly difficult, to recognize the larger story to which a paper is connected. The Big Picture Workshop aims to explore and distill such broader research narratives. We have a diverse set of accepted papers that provide insightful syntheses of different threads of research. On the workshop day, we’ll try out a new presentation format where we have researchers from different groups working on the same topic critically reflect on and discuss their work.

Multilingual Representation Learning Workshop (MRL), Dec 7*. This workshop provides a forum to discuss work to improve NLP in low-resource and under-represented languages. The large number of accepted papers and Findings papers explore a diverse set of methods, from meta-learning to tokenization and instruction tuning. In addition, shared task on multilingual multi-task information retrieval provided new data for NER and QA for a typologically diverse set of languages. The workshop day is jam-packed with excellent invited talks, poster, shared task and best paper sessions.

Unanswerability and attribution in QA

In question answering (QA), a crucial challenge for current LLMs is hallucinating answers. A scenario where such hallucinations are common is when questions do not have an answer. To deal with hallucinations, a promising strategy is to train the model to attribute the answer to relevant references.

The Curious Case of Hallucinatory (Un)answerability: Finding Truths in the Hidden States of Over-Confident Large Language Models (Slobodkin et al.). This work shows that LLMs are aware of the concept of (un)answerability and that the representation of the first decoded token provides a strong indicator whether a question is answerable (removing this information in the first token significantly decreases performance). Furthermore, mentioning that the question is unanswerable in the prompt improves performance. These results highlight that developing better decoding methods will also help make LLMs more factual.

Evaluating and Modeling Attribution for Cross-Lingual Question Answering (Muller et al.)*. This paper introduces attribution for cross-lingual question answering where the document supporting the generated answer may be in a different language than the question and answer. It creates the XOR-AttriQA dataset to measure attribution of SOTA QA models across 5 languages. Surprisingly, a large portion of generated answers are not attributable to any retrieved passage (up to 47% of correctly predicted answers in Japanese are not attributable). Current QA systems are thus often right but without any evidence, making them untrustworthy. Multilingual LLMs can be used to accurately detect attribution (which can complement string-based evaluation metrics) and can be used to rerank generated answers, improving QA performance. Key research directions are a) improving retrieval of cross-lingual passages and b) designing robust LLM-based metrics for QA evaluation.

Instruction tuning

Instruction tuning is a common way to improve LLMs for downstream settings and to align them to human behavior. However, current instruction tuning datasets still have their limitations (see the below newsletters for an overview).

The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning (Kim et al.). This paper introduces the Chain-of-Thought (CoT) Collection, which augments the Flan collection covering 1,060 tasks with 1.84M chain-of-thought rationales. This amount of chain-of-thought instruction tuning data is particularly useful for smaller LMs and improves their performance on reasoning tasks including BIG-bench Hard and the multilingual MGSM benchmark.

Task Adaptation

While LLMs achieve very strong performance in a zero-shot setting, it is necessary to fine-tune them on task data to achieve the best performance. Keeping models updated as the distribution changes and encoding task knowledge efficiently across many settings are key challenges in this area.

Meta-Learning Online Adaptation of Language Models (Hu et al.). Keeping LLMs up-to-date is an important challenge as it is prohibitive to re-train these models. This paper hypothesizes that when continual fine-tuning a model on a stream of documents, the learning signal of important documents may be drowned out. To ameliorate this, the authors propose to meta-train a small model to reweigh the LM loss for each token during online fine-tuning in order to maximize the QA model’s performance after a single weighted update. They find that this dynamic weighting significantly outperforms standard fine-tuning and weighting heuristics.

Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning (Poth et al.)*. Full fine-tuning of LLMs has become prohibitive and requires parameter-efficient methods instead. This demo paper presents Adapters, a library for parameter-efficient and modular learning with LLMs and the successor to adapter-transformers. Adapters integrates 10 diverse modular methods such as prompt tuning, prefix tuning, Compacter, LoRA, and (IA)³ into 20 state-of-the-art models for NLP, vision, and multi-modal applications. It supports a range of operations on these modules such as grouping, stacking, fusing, splitting, and parallelizing, among others, which enable a variety of modeling approaches and research directions.

Outlier Dimensions Encode Task-Specific Knowledge (Rudman et al.). This paper shows that outlier dimensions (dimensions with a variance that is significantly higher than the average) in LLMs persist during fine-tuning. They also find that just using the embedding value of such a high-variance dimension with a linear threshold can achieve performance similar to using the full model for some tasks and models. We already know that LLMs capture task knowledge in a low-dimensional subspace (see Aghajanyan et al., 2021, for instance)—but the observation that the subspace can be 1D for some settings can motivate the development of new efficient methods.

NLG Evaluation

As LLMs are increasingly applied to generate natural language text, we need better metrics to evaluate their performance. One of the most promising directions is to use LLMs themselves as part of the metric, whether in a zero-shot setting or fine-tuned on relevant data.

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment (Liu et al.). This paper proposes G-Eval, a framework for NLG evaluation using LLMs as reference-free metrics. Given a description of the task and the evaluation criteria, they first generate a more detailed CoT-style description of the evaluation steps using an LLM. All descriptions are then concatenated with the input example and fed to the LLM. Rather than directly predicting a score for each evaluation criterion, the authors observed that they obtain better measurements if they instead take the sum of all candidate scores weighted with their probability. On summarization, the framework achieves a higher correlation with human judgements than existing metrics.

INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback (Xu et al.). This paper proposes InstructScore, a fine-grained reference-based explainable metric for NLG evaluation using LLMs. To fine-tune the LLM as the metric, they first collect unlabeled sentences. They then specify the number of errors, error types, and their severity labels for each sentence and ask GPT-4 to generate an incorrect sentence containing the errors matching the criteria and an explanation for each error. LLaMa is then fine-tuned on the generated data to identify and explain the errors in the incorrect sentence compared to the reference. LLaMa is further refined using feedback from GPT-4 regarding the correctness of the generated explanations. In practice, InstructScore achieved similar or higher correlation with human judgements than existing metrics on translation and NLG tasks.

Multilingual Models

While current LLMs excel on many tasks for English, performance is still much worse on languages with limited data. We thus require models that perform well for such languages and new methods to effectively scale models to these languages.

FinGPT: Large Generative Models for a Small Language (Luukkonen et al.). This paper is a comprehensive study of training LLMs for a small language (Finnish) including the collection of a diverse dataset, monolingual training at different model sizes (up to 13B parameters), adaptation of an existing multilingual language model (BLOOM) to the new language, and creation of a language-specific benchmark. The trained models outperform all previous models for Finnish while the language-adapted multilingual model outperforms the monolingual models. Overall, this is a nice blueprint of how LLMs can be trained for medium-resource languages.

mmT5: Modular Multilingual Pre-Training Solves Source Language Hallucinations (Pfeiffer et al.)*†. This paper proposes mmT5, the first modular multilingual generative model. The mT5-style model is pre-trained with language-specific modules and dramatically outperforms mT5 and at similar parameter sizes while matching or outperforming XLM-R. Importantly, the model’s modularity enables more direct control over its outputs. While mT5 generates text in the correct language in only 7% of cases for zero-shot cross-lingual summarization, mmT5 generates text in the correct language in 99% (!) of cases.

XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models (Liang et al.). This paper proposes XLM-V, an XLM-R-style model covering 100 languages that is pre-trained with a 1M vocabulary. To create the vocabulary, vocabularies of languages are first clustered (Chung et al., 2020), clusters are allocated capacity corresponding to their average log probability (Zheng et al., 2021), and sentencepiece models are trained for each cluster and then combined. While pre-training with a 1M vocabulary is 2.5x slower than with a 250k vocabulary, the resulting model outperforms a (reimplemented) XLM-R.

Romanization-based Large-scale Adaptation of Multilingual Language Models (Purkayastha et al.)*†. This paper explores the potential of large-scale transliteration to enable multilingual LMs to deal with under-represented languages. In particular, the paper romanizes (i.e., maps UTF-8 characters to Latin characters) text using uroman across 14 diverse languages, which is then used to adapt multilingual LMs. Romanization is particularly useful in the most challenging setups: on languages with unseen scripts and with limited training data.

Multilingual Datasets and Evaluation

A key challenge for multilingual NLP is the lack of evaluation datasets and studies that accurately assess the performance of multilingual models. The creation of new datasets and the development of new evaluation measures and analyses is thus an important research direction.

Multilingual Large Language Models Are Not (Yet) Code-Switchers (Zhang et al.). This paper evaluates LLMs on three code-switching tasks: sentiment analysis (English-{Spanish, Malayalam, Tamil}), translation (English-Hindi), and word-level language identification (English-Hindi, Standard Arabic-Egyptian Arabic). They observe that smaller fine-tuned multilingual LMs (XLM-R, mDeBERTa) still outperform zero-shot prompted LLMs on these tasks.

XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages (Ruder et al.)*†. XTREME-UP is a new benchmark focusing on user-centric tasks in under-represented languages with realistic amounts of available data. The benchmark includes impactful multi-modal tasks such as ASR and OCR, which we make accessible for text-only models by providing baseline system outputs (in addition to the original audio and image inputs). We created new data for a range of different tasks and updated standard tasks such as QA and NER to make them more practically relevant. We find that multilingual fine-tuned models still outperform few-shot prompted models on most tasks and that character-level modeling is beneficial. Overall, there is still a lot of headroom left to improve performance on under-represented languages.

The Skipped Beat: A Study of Sociopragmatic Understanding in LLMs for 64 Languages (Zhang et al.). This paper introduces SPARROW, a multilingual multi-task benchmark spanning 169 datasets from different online platforms to measure sociopragmatic understanding in LLMs (i.e., how well they perform on tasks related to social interactions such as sentiment analysis, emotion detection, etc). They observe that the fine-tuned models outperform zero-shot prompted models as well as ChatGPT. LLMs perform particularly poorly on humor and antisocial language detection and ChatGPT performs poorly across most languages in comparison to the best model.

AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages (Muhammad et al.)*. This paper introduces AfriSenti, a sentiment analysis benchmark consisting of 110k+ tweets in 14 African languages. The dataset was used in the AfriSenti SemEval-2023 Shared Task. Data collection and annotation challenges included a lack of support for African languages by the Twitter API, lack of tone markings, frequent code-mixing and dialects, sarcasm and ambiguities, and a lack of annotators and a reliable Internet connection. The strongest model, AfroXLM-R, achieves 67.2 accuracy across all languages, leaving ample room for improvement.

Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLU (Koto et al.). This paper introduces IndoMMLU, the first benchmark on Indonesian language and culture consisting of 15k questions from primary school to university entrance exams. Among the 24 evaluated models, GPT-3.5 is the only one that passes primary school exams while no LLM demonstrates familiarity with local Indonesian languages and culture. The language exams also enable assessing the level of Indonesian language proficiency. For grades 7 and above, GPT-3.5 fails to pass the exam while other models only pass grades 1–3.

TaTA: A Multilingual Table-to-Text Dataset for African Languages (Gehrmann et al.)*†. This paper proposes Table-to-Text in African languages (TaTA), the first large multilingual table-to-text dataset with a focus on African languages. TaTA was created by transcribing figures and associated text in bilingual reports by the DHS Program, which were then professionally translated to make the dataset fully parallel. We find that less than half of the outputs from an mT5-XXL-based model are understandable and attributable to the source data. We also observe that existing metrics perform poorly for multilingual table-to-text generation and introduce a new learned metric that achieves a high correlation with human judgements.

What papers did you find exciting at EMNLP 2023? Let me know in the comments.

📄 NeurIPS 2023 Primer

Sebastian Ruder — Fri, 01 Dec 2023 15:51:54 GMT

NeurIPS 2023, arguably this year’s biggest AI conference takes place in two weeks from Dec 10–16 in New Orleans. 3586 papers were accepted to the conference, which are available online.

In this newsletter, I’ll discuss a selection of 20 papers related to natural language processing (NLP) that caught my eye, with a focus on oral and spotlight papers. Here are the main trends I observed:

Most NLP work at NeurIPS is related to large language models (LLMs). While there are some papers that do not employ LLMs or use a different setting (see Suhr & Artzi below, for instance), papers still presented their contributions in the context of LLMs.
Synthetic setups to analyze LLM properties are becoming more common. This is because it is computationally prohibitive to run many different pre-training experiments. Investigated properties range from the emergence of in-context learning and learning using global statistics to chain-of-thought reasoning.
Aligning models based on human preferences received a lot of attention. Papers particularly focused on improving RLHF and studying alignment to specific personality traits and beliefs.
A comprehensive understanding of in-context learning still remains elusive. Papers studied different aspects of in-context learning such as whether it persists during training and using a Bayesian perspective.
Reasoning is still challenging with current models. Papers focused on improving performance on various types of reasoning tasks including pragmatic, graph-based, algorithmic, compositional, and planning-based reasoning.
External tools are increasingly used to improve LLMs’ reasoning abilities. These range from external verifiers to code execution modules.

Note that some of the methods proposed in these papers such as DPO and QLoRA have already been successfully used in LLM applications.

Rethinking LLMs

This is one of my favorite topics as these papers encourage us to rethink our fundamental assumptions regarding LLMs and provide new insights and perspectives on their inner workings.

Lexinvariant Language Models (Huang et al.). One of the fundamental characteristics of LMs that hasn’t changed since the first neural LM paper is the one-to-one mapping between tokens and embeddings. This paper studies whether language modeling can also be done with models that are ‘lexinvariant’, i.e., that do not have fixed token embeddings but assign the lexical permutation of each sequence the same probability.1 This seems like a strong limitation—but it can serve as a useful inductive bias for recovering substitution cyphers (via an MLP probe) and in-context symbol manipulation. In practice, tokens are encoded using random Gaussian vectors and sampled so that the same token has the same representation within a sequence but different representations across sequences. While this method is mainly of theoretical interest, using it as regularization by using random embeddings only for a subset of tokens improves results on some BIG-bench tasks.

Learning from Human Feedback

Given the proliferation of different pre-trained LLMs, researchers and practitioners are increasingly looking to improve the next step in the LLM pipeline: learning from human feedback, which is important for maximizing performance on downstream tasks but also for LLM alignment.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al.). Reinforcement learning from human feedback (RLHF) is the preferred approach to update LLMs to align with target preferences but is quite complex (it requires first training a reward model and then updating the LLM with RL based on the reward model) and can be unstable. This paper proposes Direct Preference Optimization (DPO), which shows that the same objective can be optimized via a simple classification-based objective on the preference data—without any RL! An important component of the objective is a dynamic, per-example importance weight. DPO has the potential to make aligning LLMs with human preferences much more seamless—and will thus be important for safety research.

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training (Wu et al.). This paper addresses another limitation of RLHF, which does not allow the integration of more fine-grained feedback regarding which parts of the generated response are erroneous. The paper proposes Fine-grained RLHF, which a) uses a dense reward model (a reward for every output sentence rather than for the entire output) and b) incorporates multiple reward models for diverse feedback. They experiment on detoxification and long-form question answering where they see improved results compared to RLHF and supervised fine-tuning. Importantly, as providing human preference judgements for RLHF is a complex annotation task, providing more fine-grained feedback is actually not more time-intensive. Expect to see more approaches experimenting with various reward models at different granularities.

Continual Learning for Instruction Following from Realtime Feedback (Suhr & Artzi). This paper tackles continual learning from human feedback in a collaborative 3D world environment. They demonstrate a simple approach using a contextual bandit to update the model’s policy using binary rewards. Over 11 rounds of training and deployment, instruction execution accuracy improves from 66.7% to 82.1%. Empirically, in this setting, the feedback data provides a similar amount of learning signal as the supervised data. While their setting differs from the standard text-based scenario, it provides a sketch for how instruction-following agents can continually learn from human feedback. In the future, we will likely see approaches that utilize more expressive feedback such as via natural language.

LLM Alignment

In order to ensure that LLMs are most useful, it is crucial to align them with the specific guidelines, safety policies, personality traits and beliefs that are relevant for a given downstream setting. To do this, we first need to understand what tendencies LLMs already encode—and then develop methods to steer them appropriately.

Evaluating and Inducing Personality in Pre-trained Language Models (Jiang et al.). This paper proposes to assess the personality of LLMs based on the Big Five personality traits known from psychology. Building on existing questionnaires, they create multiple-choice question answering examples where the LLM must choose how accurately statements such as “You love to help others” describe it. Each statement is associated with a personality trait. Crucially, it is less important whether a model scores highly on a specific trait but whether it exhibits a consistent personality, that is, whether it responds similarly to all questions associated with the trait. Only the largest models exhibit consistent personality traits that are similar to those of humans. It will be interesting to better understand under what conditions personality traits emerge and how consistent personalities can be encoded in smaller models.

In-Context Impersonation Reveals Large Language Models' Strengths and Biases (Salewski et al.). There has been a lot of anecdotal evidence that prompting LLMs to impersonate domain experts (e.g., “you are an expert programmer”, etc) improved models’ capabilities. This paper studies such in-context impersonation across different tasks and finds indeed that LLMs impersonating domain experts perform better than LLMs impersonating non-domain experts. Impersonation is also useful to detect implicit biases. For instance, LLMs impersonating a man describe cars better than ones prompted to be a woman (based on CLIP’s ability to match an image to a category using the generated description of the category). Overall, impersonation is a useful tool to analyze LLMs—but may reinforce biases when used for (system) prompts.

Evaluating the Moral Beliefs Encoded in LLMs (Scherer et al.). This paper studies how moral beliefs are encoded in LLMs with regard to both high-ambiguity (“Should I tell a white lie?”) and low-ambiguity scenarios (“Should I stop for a pedestrian on the road?”) scenarios. They evaluate 28 (!) different LLMs and find that a) in unambiguous scenarios most models align with commonsense while in ambiguous cases, most models express uncertainty; b) models are sensitive to the wording of the question; and c) some models exhibit clear preferences in ambiguous scenarios—and closed-source models have similar preferences. While the evaluation relies on a heuristic mapping of output sequences to actions, the data is useful for further research of LLMs’ moral beliefs.

LLM Pre-training

Pre-training is the most compute-intensive part of LLM pipelines and is thus harder to study at scale. Nevertheless, innovations such as new scaling laws improve our understanding of pre-training and inform future training runs.

Scaling Data-Constrained Language Models (Muennighoff et al.). As the amount of text online is limited, this paper investigates scaling laws in data-constrained regimes, in contrast to the scaling laws by Hoffmann et al. (2022), which focused on scaling without repeating data. The authors observe that training for up to 4 epochs on repeated data performs similarly to training on unique data. With more repetition, however, the value of additional training rapidly diminishes. In addition, augmenting the pre-training data with code meaningfully increases the pre-training data size. In sum, whenever we don’t have infinite amounts of pre-training data, we should train smaller models for more (up to 4) epochs.

LLM Fine-tuning

Fine-tuning large models with back-propagation is expensive so these papers propose methods to make fine-tuning more efficient, either using parameter-efficient fine-tuning methods or without computing gradients (zeroth-order optimization).

QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al.). This paper proposes QLoRA, a more memory-efficient (but slower) version of LoRA that uses several optimization tricks to save memory. They train a new model, Guanaco, that is fine-tuned only on a single GPU for 24h and outperforms previous models on the Vicuna benchmark. Overall, QLoRA enables using much fewer GPU memory for fine-tuning LLMs. Concurrently, other methods such as 4-bit LoRA quantization have been developed that achieve similar results.

Fine-Tuning Language Models with Just Forward Passes (Malladi et al.). This paper proposes a memory-efficient zeroth-order optimizer (MeZO) as a more memory-efficient version of a classic zeroth-order optimizer that uses differences of loss values to estimate gradients. In practice, the method achieves similar performance to fine-tuning on several tasks but requires 100x more optimization steps (while being faster at each iteration). Nevertheless, it is surprising that such zeroth-order optimization works with very large models in the first place, demonstrating the robustness of such models, and is an interesting direction for future work.

Emergent Abilities and In-context Learning

Certain abilities of LLMs such as in-context learning and arithmetic reasoning have been shown to be present only in the largest models. It is still unclear how these abilities are acquired during training and what specific properties lead to their emergence, motivating many studies in this area.

Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al.). Emergent abilities are abilities that are present in large-scale models but not in smaller models and are hard to predict. Rather than being a product of models’ scaling behavior, this paper argues that emergent abilities are mainly an artifact of the choice of metric used to evaluate them. Specifically, nonlinear and discontinuous metrics can lead to sharp and unpredictable changes in model performance. Indeed, the authors find that when accuracy is changed to a continuous metric for arithmetic tasks where emergent behavior was previously observed, performance improves smoothly instead. So while emergent abilities may still exist, they should be properly controlled and researchers should consider how the chosen metric interacts with the model.

The Transient Nature of Emergent In-Context Learning in Transformers (Singh et al.). Chan et al. (2022) have previously shown that the distributional properties of language data (specifically, ‘burstiness’ and a highly skewed distribution) play an important role in the emergence of in-context learning (ICL) in LLMs. Prior work also generally assumes that once the ICL ability has been acquired, it is retained by the model as learning progresses. This paper uses Omniglot, a synthetic image few-shot dataset to show cases where ICL emerges—only to be subsequently lost while the loss continues to decrease. On the other hand, L2 regularization seems to help the model retain its ICL ability. It is still unclear, however, how ICL emerges and if transience can be observed during LLM pre-training on real-world natural language data.

Why think step by step? Reasoning emerges from the locality of experience (Prystawski et al.). This paper investigates why and how chain-of-thought reasoning (Wei et al., 2022) is useful in LLMs using a synthetic setup. Similar to Chan et al. (2022), they study distributional properties of the pretraining data. They find that chain-of-thought reasoning is only useful when the training data is locally structured. In other words, when examples are about closely connected topics as is common in natural language. They find that chain-of-thought reasoning is helpful because it incrementally chains local statistical dependencies that are frequently observed in training. It is still unclear, however, when chain-of-thought reasoning emerges during training and what are the properties of downstream tasks where it is most useful.

Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning (Wang et al.). This paper frames in-context learning with LLMs as topic modeling where the generated tokens are conditioned on a latent topic (concept) variable, which captures format and task information. To make this computationally efficient, they use a smaller LM to learn the latent concepts via prompt tuning on the full demonstration data. They then select the examples that achieve the highest probability under the prompt-tuned model as demonstrations for in-context learning, which improves over other selection baselines. This is a further data point that shows that examples that are probable based on a latent concept of a task are useful demonstrations. This likely isn’t the full picture, however, and it will be interesting to see how this formulation relates to other data similarity and diversity measures.

Birth of a Transformer: A Memory Viewpoint (Bietti et al.). This study investigates how LLMs learn in-context learning as well as the ability to use more general knowledge in a synthetic setup. The setup consists of sequences generated by a bigram LM where some bigrams require the local context to infer them while others require global statistics. They find that two-layer transformers (but not one-layer transformers) develop an induction head consisting of a “circuit” of two attention heads to predict in-context bigrams. They then freeze some of the layers to study the model’s training dynamics, finding that global bigrams are learned first and that the induction head learns appropriate memories in a top-down fashion. Overall, this paper sheds further light on how in-context learning can emerge in LMs.

Reasoning

Reasoning tasks that require the systematic chaining or composition of different pieces of information are one of the most important problems for current LLMs to solve. Their challenging nature and the diversity of domains where they are relevant makes them a fruitful area for research.

On the Planning Abilities of Large Language Models - A Critical Investigation (Valmeekam et al.). Planning and sequential decision making are important for a wide range of applications. This paper studies whether LLMs can generate simple plans for commonsense planning tasks 2. Used on their own, only 12% of the plans generated by the best LLM are directly executable. However, when combined with an automated planning algorithm that can identify and remove errors in the LLM-generated plan, LLMs do much better than using a random or empty initial plan. LLMs plans can also be improved via prompting based on feedback from an external verifier. On the other hand, the benefits of LLMs disappear when action names are unintelligible and cannot be easily inferred with common sense, which indicates a lack of abstract reasoning ability. Overall, while initial plans of LLMs can be useful as a starting point, LLM-based planning currently works best mainly in conjunction with external tools.

Can Language Models Solve Graph Problems in Natural Language? (Wang et al.) The authors create NLGraph, a benchmark of graph-based reasoning problems described in natural language and evaluate LLMs on it. They find that LLMs demonstrate impressive preliminary graph reasoning abilities—37–58% above random baselines. However, in-context learning and advanced prompting strategies (chain-of-thought prompting and others) are mostly ineffective on more complex graph problems and LLMs are susceptible to spurious correlations. More specialized graph prompting strategies, however, improve results. Expect to see combinations of standard graph-based methods (such as those applied to FB15k) and LLMs and research on methods scaling LLMs to larger graphs.

The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs (Ruis et al.). This paper studies whether LLMs exhibit a particular type of pragmatic inference, implicature 3. The authors evaluate whether models can understand implicature by measuring whether they assign a higher probability to a statement that contains the correct inference compared to the incorrect one. They find that both instruction tuning and chain-of-thought prompting are important for such pragmatic understanding and that the largest model, GPT-4 reaches human-level performance. We will likely see more work on different types of pragmatic understanding as these are crucial for seamless and human-like conversations.

Parsel🐍: Algorithmic Reasoning with Language Models by Composing Decompositions (Zelikman et al.). This paper introduces Parsel, a framework to implement complex programs with code LLMs. An LLM is first used to generate natural language function descriptions in a simple intermediate language. For each function description, the model then generates implementation candidates. The function candidates are then tested against input-output constraints and composed to form the final program. Using GPT-4 as LLM, this framework increased performance on HumanEval from 67% to 85%. This is a great example of how LLMs can be used as a building block together with domain-specific knowledge and tools for much improved performance. In addition, expect to see more work on breaking down complex decision problems into subproblems that are more easily solvable using LLMs.

Faith and Fate: Limits of Transformers on Compositionality (Dziri et al.). This paper investigates the compositional reasoning abilities of LLMs. They formulate compositional reasoning tasks as computation graphs in order to quantify their complexity. They find that full computation subgraphs appear significantly more frequently in the training data for correctly predicted test examples than for incorrectly predicted ones, indicating that models learn to match subgraphs rather than developing systematic reasoning skills. Using a scratchpad and grokking (training beyond overfitting) similarly do not improve performance on more complex problems. Overall, current LLMs still struggle with composing operations into correct reasoning paths.

Which NeurIPS papers did you find most exciting? Let me know in the comments.

Under such a model, the phrase “a big banana” has the same probability as the phrase “e cop cekeke” as both are the same given the permutation {a: e, b: c, i: o, g: p, n:k}.

These tasks consists of a problem domain, an initial state, and a goal state. The problem domain consists of a set of actions, for instance, to pick up blocks.

An example of an implicature is the exchange “Did you go to the shop today?” “It was closed” where the second utterance implies “no” as the answer to the question.

Instruction Tuning Vol. 2

Sebastian Ruder — Wed, 15 Nov 2023 10:13:42 GMT

Last month, we covered the first generation of instruction-tuning datasets that have been mainly based on existing NLP tasks.

This month, we cover the latest datasets that are now much closer to real-world use cases—but still have their limitations!

Initiatives to Get Involved in AI Research

Before we get into these, here are a few initiatives to get involved in AI research:

The Aya Project: Help build state-of-the-art multilingual LLMs and make LLMs accessible in your language. Aya launched in January 2023 and is now in the final stage of dataset creation, so now is the time to get involved to make a difference!
SEACrowd: Make NLP and LLM models more accessible in South-East Asian (SEA) languages by gathering and standardizing SEA datasets. Contributions earn merch 👕 and co-authorship 📝!
Tropical ProbAI: Applications are open for the first Tropical Probabilistic AI School from 29 January – 2 February 2024 in Rio de Janeiro, Brazil. Application deadline is November 24. Tropical ProbAI is also looking for sponsors to help bring AI schools to the under-served Latin American community

Are there any other projects or initiatives that people should be aware of? Let me know.

Characteristics of Instruction-Tuning Data

Now, let’s get on with instruction tuning. There are a few things to consider when using these recent datasets:

Data source: How was the data obtained? Most datasets have been generated using ChatGPT. They may thus inherit biases of the source model or may be noisy. Human-written examples are more expensive to obtain but are more high quality.
Data quality: Was any filtering done to improve the quality of the generated data? In most cases, filtering is based on simple heuristics or a pre-trained model, which can result in noisy data. The authors of OpenAssistant Conversations went the extra mile and obtained human-annotated data quality labels. 💪
Domain and language coverage: Most datasets cover general QA-style use cases and are in English. However, similar methods can be used to obtain data in other domains or languages.
Number of dialog turns: A dialog turn is an utterance by one speaker. Most datasets are single-turn, i.e., they consist of a prompt and a single response. Multi-turn data may be necessary to train a more conversational model.
License terms: Data generated using OpenAI models is subject to the OpenAI terms of use, which prohibit using the data to develop competing models. So look for data with a more permissive license to avoid any legal complications.

The Latest Instruction-Tuning Datasets

Let’s now take a look at the latest generation of instruction-tuning datasets:

Alpaca data (Taori et al., March 2023): 52k English instruction examples generated using OpenAI’s text-davinci-003 with self-instruct (see the previous post for a discussion):
The authors applied some modifications to simplify the data generation pipeline and lower costs—the final data cost less than $500 to generate!
Evol-Instruct (Xu et al., April 2023): A rewritten set of 250k English instruction-response pairs based on the Alpaca data. Instructions are rewritten a) to make them more complex or b) to create a new, more specialized instruction by prompting ChatGPT. In a second step, ChatGPT is used to generate the corresponding responses. Low-quality instruction-response pairs are filtered using heuristics. This process is repeated three times.
Vicuna ShareGPT data (Chiang et al., March 2023): 70k English conversations shared by users and scraped from sharegpt.com. Pre-processing involved converting HTML to markdown, filtering out low-quality samples, and splitting lengthy conversations into smaller segments. Compared to the above single-turn datasets, the ShareGPT conversations often consist of multiple turns and are thus more useful for training a model to leverage the context of the conversation. The conversations may be owned by the users so their use is potentially problematic.
Baize data (Xu et al., April 2023): 54k and 57k English multi-turn dialog examples (3.4 turns on average) generated with ChatGPT using questions from Quora and StackOverflow datasets respectively as seeds. ChatGPT simulates both the human and AI participants of the conversation.1 In addition, they also generated 47k dialogs in the medical domain based on MedQuAD questions.
Multi-turn dialogue data generated using ChatGPT based on a seed from Quora (Xu et al., 2023).
databricks-dolly-15k (Conover et al., April 2023): 15k English instruction-following examples written by Databricks employees. Crucially, both instructions and answers are human-generated. This in contrast to the other datasets above where instruction and/or answers are generated by ChatGPT. Examples cover 7 use cases: open QA, closed QA, information extraction and summarization of Wikipedia data, brainstorming, classification, and creative writing. Compared to the other datasets above, the data is released under a permissive license that also allows for commercial use.
An Open QA example in databricks-dolly-15k.
OpenAssistant Conversations (Köpf et al., April 2023): 11k crowd-sourced multilingual instruction-following conversations (for 52k examples, only the prompts are available). Human annotators generated messages for both the assistant and the human participant. The data differs in several aspects from the other datasets: 1) it is multilingual (42.8% examples are in English, 31.4% in Spanish, and the rest in other languages); 2) annotators annotated the quality of prompts and responses (460k quality ratings overall); and 3) the annotators were provided with detailed guidelines, both for writing prompts and for acting as the assistant. The data uses a permissive license, which allows commercial use.
LIMA data (Zhou et al., May 2023): 1k training and 300 test answer–response pairs mostly sampled from StackExchange, wikiHow and the Pushshift Reddit dataset with around 400 written by the paper authors. A nice observation of this study is that training on this small set of curated instruction data outperforms training on the much larger, noisier Alpaca data.
Examples from the manually authored portion of the LIMA dataset.

Takeaways

✅ Quality > quantity. As Zhou et al. (2023) observe, training on a small set of high-quality data outperforms instruction-tuning on larger, noisier data. Using more diverse prompts and quality filtering both improve performance.

🧑‍🎓 Imitation != mastery. Models that are instruction-tuned on ChatGPT-generated data mimic ChatGPT’s style (and may thus fool human raters!) but not its factuality (Gudibande et al., May 2023). They perform worse on standard benchmarks. Using stronger base models is the best way to address this.

🏛️ The stronger the base, the better. More powerful base models also produce stronger instruction-tuned models (Wang et al., June 2023).

🥇 The combination wins. Combining multiple instruction-tuning datasets results in the best average performance across tasks (Wang et al., June 2023). Dataset mixing and developing modular instruction-tuned models are thus important research directions.

Future Directions

Understanding instruction-tuning. While we have seen a proliferation of instruction-tuning datasets, we still lack a clear understanding of what makes a good instruction and good instruction–response pairs. There is much anecdotal knowledge when it comes to creating good model prompts—but to my knowledge it is unclear how instruction–following data can be created at scale in a more principled manner.

Improving data quality. To improve model performance, we need to develop more reliable methods to identify high-quality examples and filter out undesirable ones. In a similar vein, it is important to develop methods that allow us to identify how a particular instance affects model behavior and alignment at test time.

Evaluating instruction-tuned models. In light of the biases of both human and automatic evaluations, there is no clear gold standard for how to evaluate instruction-tuned models. Evaluating a model on a set of tests that can be efficiently and automatically evaluated is one way to side-step this issue, see LMentry (Efrat et al., ACL 2023), M2C (Hlavnova & Ruder, ACL 2023), IFEval (Zhou et al., Nov 2023), etc but these are restricted to a certain set of use cases. In general, it is crucial to design evaluations with a target application in mind.

Are there any exciting developments or directions that I missed? Let me know.

Xu et al. (2023) refer to this setting as ‘self-chat’. This is a continuation of work on machine-to-machine dialogue modeling (see, e.g., Budzianowski et al., 2018) and dialogue self-play (Shah et al., 2018).

🌍⏳ Do LMs Represent Space and Time?

Sebastian Ruder — Mon, 09 Oct 2023 08:25:20 GMT

In this post, we’ll take a closer look at the question “Do LMs represent space and time?” inspired by a recent paper. We’ll look at how spatial and temporal information has been encoded in LMs, what this means for practical applications, as well as other aspects such as encoding of fine-grained spatial information and across cultures.

Language Models Represent Space and Time (Gurnee & Tegmark, Oct ‘23)

In a recent paper, Gurnee and Tegmark show that LLMs (Llama-2 models specifically) learn linear representations of space and time. What does this mean exactly?

The general setup looks like this:

The authors process the names of places and historical figures with Llama-2. They create their own dataset, sourced from Wikipedia, for this.
They then take the hidden state of the last token of the entity (for each layer) as a representation of the entity name.
Finally, they train a linear probe (a one-layer MLP) on the representation to predict its coordinates (latitude and longitude) or year of death.1

They find that spatial and temporal information in Llama-2 can indeed be recovered with a linear probe, larger models are better at encoding this information, and that representations in the upper layers (from the middle layer to the last) achieve the highest accuracy. In other words, models learn a representation of places that—after a linear transformation—is more or less consistent with their location on a map.

Coordinates of place names of different continents predicted by a linear probe on Llama-2-70B’s layer 50. Predicted locations are close to their actual locations (R²=0.92).

That’s a pretty cool finding and visualization. However, how surprising is it that current LLMs encode a map-like representation of places? On the whole, not very.

Spatial Relations in word2vec

Alexander Doria already pointed out on Twitter that this is not a new observation and that geographic relationships have been encoded already by much older models. The classic example is the word analogy task for word embeddings, which identifies representations in the embedding space using simple vector offset such as
Paris - France + Italy = ? where the answer (i.e., the closest nearest neighbor in the embedding space) is expected to be Rome in this case.2 In the embedding literature, models generally performed very well at encoding such semantic relations.

Skip-gram vectors of countries and corresponding capitals projected by PCA show their linear relationship (Mikolov et al., 2013).

These well-known relationships are between countries and their capitals. What about more fine-grained information such as the spatial coordinates investigated by Gurnee and Tegmark?

Analyzing Geographic Knowledge

There is an existing thread of research that has focused on injecting and analyzing geographic knowledge in models. Hovy and Purschke (2018) learn continuous representations of German cities using doc2vec. More recently, Hofmann et al. (2022) adapt pre-trained BERT models in different languages with geographic knowledge by predicting geolocations on geo-labeled data while Faisal and Anastasopoulos (2022) probe the geographic knowledge in GPT-2, mGPT, and BLOOM. In these studies, models do well at predicting place coordinates and produce map-like representations.

Clustered city embeddings with 3, 5, and 8 clusters (Hovy & Purschke, 2018).

There is other work that focuses specifically on the task of geolocation where even simple one-hidden layer MLPs can do well (Rahimi et al., 2017). In light of this prior work, it is unsurprising that the latest LLMs encode spatial information. In addition, encoding spatial information does not seem to be a property that emerges only with sufficient model size. In order to know whether recent LLMs are actually more spatially aware than prior models, it is thus important to compare them to prior models and on established tasks such as user geolocation.

Overall, studies such as the one by Gurnee and Tegmark are crucial to get a better understanding of LLMs. However, rather than focusing solely on work on LLMs, these studies would benefit from being aware of and leveraging prior work as a source of baselines, evaluation datasets, and methods.

LLMs as Geographic Information Systems

As LLMs capture a surprising amount of geographic information, they may be useful for a range of other geography-related applications such as a geographic information system (GIS). A GIS is a computer system that stores, checks, analyzes, and displays geographic data—something that could be emulated by an LLM. Li and Ning (May 2023) show through GIS case studies how LLMs can be used, for instance, to identify the population living close to hazardous waste facilities and to map their distribution, among other applications.

Results generated by ChatGPT for counting the population living near hazardous waste facilities. (a) Solution graph, (b) Python code, and (c) returned population count and generated map (Li & Ning, 2023).

Beyond accurately encoding and reasoning with geographic data, the use of LLMs as GIS thus also requires LLMs to interface with auxiliary tools such as data readers, calculators, code execution, and visualization, which I covered in a previous newsletter:

Encoding Fine-grained Spatial Relations

LLMs encode spatial information on a macro level—related to cities and places—but what about fine-grained spatial relations such as whether something is behind, next to, left to, above, etc.? Are these also encoded in a consistent manner?

Prior work in this area (Ramalho et al., 2018) generated images from their textual descriptions using a VAE to learn to encode spatial relations from natural language. Recently, Ji and Gao (July 2023) evaluated GPT-2 and BERT on their ability to encode geometric attributes, achieving up to 73% accuracy on spatial relations. For the largest LLMs, I am only aware of case studies that show awareness of certain spatial relations such as Bubeck et al. (2023) so there is potential for more work in this area.

GPT-4 navigates a map interactively. Left: True map and exploration path of GPT-4. Right: The map generated by GPT-4 based on the prompt “Can you draw a pyplot showing the position of the rooms/places and connect them using lines?”. GPT-4 is able to track the locations and visualize them correctly (Bubeck et al., 2023).

Encoding Time

Regarding the encoding of time, it is important to look beyond synthetic tasks and to practical applications for evaluation. Given that the world we live in is constantly changing, it is critical to ensure that models reflect up-to-date information about the world. Prior work has used language modeling (Lazaridou et al., 2021) and question answering (Zhang & Choi, 2021) for model evaluation.

More recently, Tan et al. (ACL 2023) introduced a new temporal reasoning QA benchmark that assesses models on three levels of temporal reasoning: 1) relations between different times; 2) relations between times and events; and 3) relations between different events. Particularly time-event and event-event reasoning is still challenging even for the latest LLMs.

Encoding Space and Time Across Cultures

The way spatial and temporal information is expressed differs across languages and cultures. In Swahili, time is based on sunset and sunrise rather than a.m. and p.m. For example, 11.30 am in standard time is 5.30 in the morning in Swahili time. For a recent paper (Hlavnova & Ruder, ACL 2023), we evaluated LLMs on different types of reasoning across languages and found that they did much worse on languages with different time expressions such as Swahili. Similarly, understanding of time expressions can also be evaluated based on models’ ability to ground time expressions, i.e., to map culture-specific time expressions such as “morning” in English or “manhã” in Portuguese to specific hours in the day (Shwartz, 2022).

For spatial information, datasets such as MarVL (Liu et al., 2021) and Crossmodal-3600 (Thapliyal et al., 2022) can be used to investigate models’ visual perception across cultures—but I’m not aware of any datasets that enable an analysis of cross-cultural encoding of spatial information.

Captions in three different languages for an image in Crossmodal-3600. Captions are created by native speakers in each language and are thus free of translation artifacts.

I hope you found this short review of space and time representations in language models interesting. Did I miss any interesting work in this space? What are your favorite observations and insights about how LLMs encode information? Let me know in the comments.

Training a linear classifier on the hidden representations of a model (without further fine-tuning) is standard methodology and has been used extensively to analyze BERT (Rogers et al., 2020). For more information on probing pre-trained models, check out these slides by Mohit Iyyer (Intro to NLP Spring 2023; based on slides from Tu Vu).

While there are well-known issues with the analogy formulation (see, for instance, Garneau et al., 2021), it can help illustrate relationships encoded in the embedding space.

🧑‍🏫 Instruction Tuning Vol. 1

Sebastian Ruder — Wed, 04 Oct 2023 08:00:22 GMT

NLP and ML have gone through several phases of how models are trained in recent years. With the arrival of pre-trained models such as BERT, fine-tuning pre-trained models for downstream tasks became the norm. The increasing capabilities of ever larger models then enabled in-context learning via prompting. Recently, instruction tuning has become the newest method to make LLMs useful in practice.

In this edition, we will cover some of the most popular datasets for instruction tuning. The next editions will cover the latest instruction datasets and instruction-tuned models.

What is Instruction Tuning?

The main difference between instruction tuning and standard supervised fine-tuning lies in the data that the model is trained on. Whereas supervised fine-tuning trains models on input examples and their corresponding outputs, instruction tuning augments input-output examples with instructions, which enables instruction-tuned models to generalize more easily to new tasks.

Pretrain-finetuning vs prompting vs instruction tuning (Wei et al., 2022).

Methods differ based on how the instruction tuning data is constructed. Zhang et al. (2023) provide a good overview of existing instruction datasets. Existing datasets fall roughly into two main categories: a) instructions are added to existing NLP tasks; and b) data from (a) is used to condition a model to generate new instruction-input-output tuples. Let’s now look at some of the most popular instruction datasets:

Natural Instructions (Mishra et al., 2022): 193k instruction-output examples sourced from 61 existing English NLP tasks. Crowd-sourcing instructions from each dataset are aligned to a common schema. Instructions are thus more structured compared to other datasets. Outputs are relatively short, however, which makes the data less useful for generating long-form content.
Two instances of Natural Instructions. The instruction schema covers multiple fields including a definition, things to avoid, and positive and negative examples.
Natural Instructions v2 / Super-Natural Instructions (Wang et al., 2022): A crowd-sourced collection of instruction data based on existing NLP tasks and simple synthetic tasks. It includes 5M examples across 76 tasks in 55 languages. Compared to Natural Instructions, instructions are simplified; they consist of a task definition and positive and negative examples with explanations.
An instance of Super-Natural Instructions. Instructions consist of a task definition and positive and negative examples with an explanation.
Unnatural Instructions (Honovich et al., 2023): An automatically collected instruction dataset of 240k examples where InstructGPT (text-davinci-002) is prompted with three Super-Natural Instructions examples—consisting of an instruction, input, possible output constraints—and asked to generate a new example. The output is generated separately by conditioning on the generated instruction, input, and constraints. The generated instructions are then further paraphrased by prompting the model. Unnatural Instructions covers a more diverse set of tasks than Super-Natural Instructions; while many examples reflect classical NLP tasks, it also includes examples of other interesting tasks.
Examples of eight interesting generated instructions that differ from classical NLP tasks in Unnatural Instructions.
Self-Instruct (Wang et al., 2023): Similar to Unnatural Instructions, Self-Instruct consists of 82k examples automatically generated using InstructGPT conditioned on a set of seed task examples (175 tasks in total; one example per task; 8 examples are sampled for in-context learning). Self-Instruct decouples the example generation by first generating the instruction, then the input (conditioned on instruction), and then the output. For classification tasks, the authors first generate the possible output labels and then condition the input generation on each class label to avoid biasing towards a specific label. While the generated instructions are mostly valid, the generated outputs are often noisy.
Examples of generated instruction-input-output tuples in Self-Instruct.
P3 (Public Pool of Prompts; Sanh et al., 2022): A crowd-sourced collection of prompts for 177 English NLP tasks. For each dataset, about 11 different prompts are available on average, which enables studying the impact of different prompt formulations. Compared to the instructions in the above instruction datasets, P3 prompts are often shorter and less elaborate.
P3 prompt templates for two existing NLP tasks. Prompt templates use fields of the raw data (e.g., {Document}) and template metadata (e.g., {Choices[label]}).
xP3, xP3mt (Muennighoff et al., 2023): An extension of P3 including 19 multilingual datasets and 11 code datasets, with English prompts. They also release a machine-translated version of the data (xP3mt), which contains prompts automatically translated into 20 languages. Fine-tuning on multilingual tasks with English prompts further improves performance beyond only fine-tuning on English instruction data.
Prompt templates in P3, xP3, and xP3mt. Prompts are in English in P3 and xP3.
Flan 2021 / Muffin (Wei et al., 2022): Prompts for 62 English text datasets, with 10 prompt templates for each task. For classification tasks, an OPTIONS suffix is appended to the input in order to indicate output constraints.
Instructions, inputs, and outputs for three tasks in Flan 2021.
Flan 2022 (Chung et al., 2022 ): A combination of Flan 2021, P3, Super-Natural Instructions, and additional reasoning, dialog, and program synthesis datasets. The 9 new reasoning datasets are additionally annotated with chain-of-thought (CoT; Wei et al., 2022) annotations.
Flan 2022 instructions enable fine-tuning with and without exemplars (few-shot vs zero-shot) and with and without chain-of-thought.
Opt-IML Bench (Iyer et al., 2022): A combination of Super-Natural Instructions, P3, and Flan 2021. They additionally include dataset collections on cross-task transfer, knowledge grounding, dialogue, and a larger number of chain-of-thought datasets.
Different prompt formulations of the COPA task in Opt-IML Bench.

Longpre et al. (2023) provide a nice overview of the timeline of some of the above datasets as well as some of their core attributes:

Timeline of public instruction tuning datasets (Longpre et al., 2023).

Important Aspects of Instruction Data

Longpre et al. (2023) and Iyer et al. (2022) ablate several important aspects of instruction data, which we highlight in the following.

Mixing few-shot settings. Training with mixed zero-shot and few-shot prompts significantly improves performance in both settings.

Task diversity. Large models benefit from continuously increasing the number of tasks.

Data augmentation. Augmenting the data such as by inverting inputs and outputs (e.g., turning a question answering task into a question generation task) is beneficial.

Mixing weights. When using a combination of instruction tuning datasets, appropriately tuning the mixing weights is important.

While the above datasets are mainly derived from classical NLP tasks, recent datasets such as Baize (Xu et al., 2023), OpenAssistant Conversations (Köpf et al., 2023) and others cover a more diverse set of applications and domains. We will discuss these in the next edition. Stay tuned! 👋

🛠 Tool-Augmented LLMs

Sebastian Ruder — Mon, 28 Aug 2023 09:51:45 GMT

Hi all,

It’s great to get back to writing regularly. Writing a newsletter provides me with an opportunity to delve into topics I’m excited about. In this edition, we’ll explore tool use—arguably one of the hottest new capabilities of LLMs. We’ll look at types of tools, benefits of tool use, recent developments, and future directions.

Update 24.09.23: 🇰🇷 This article has been translated to Korean by Park Ji Ho. Thanks!

What is Tool Use?

Language models are useful for a wide range of applications such as creative content generation, virtual assistants, customer support, search, etc. By definition, however, they are limited to producing natural language, which does not allow them to interact with the real world.1

This can be ameliorated by allowing the model to access external tools—by predicting special tokens or commands. A tool can take various forms: it can be a) the model itself or another neural network; b) a retrieval component such as a search engine; c) a symbolic computation or code module; or d) a module for controlling a physical robot or virtual agent as in the previous newsletter:

More broadly, a tool can be an arbitrary API. Below, are three examples of tools that can be useful for language modeling: question answering, machine translation, and a calculator. Mialon et al. (2023) provide a great overview of this emerging topic in their survey.

Examples of different tools used by Toolformer (Schick et al., 2023). From top to bottom: question answering, machine translation, calculator.

Benefits of Tools

Tools provide a practical way to address some of the limitations of current LLMs:

❌ LLMs are bad at math (e.g., Hendrycks et al., 2021).2 ✅ Calling a calculator may improve models’ arithmetic capabilities.

❌ LLMs’ pre-training data quickly becomes outdated. ✅ Calling a search engine allows the LLM to produce up-to-date information.

❌ LLMs may hallucinate information. ✅ Allowing an LLM to cite its sources may improve its trustworthiness.

❌ LLMs are black boxes. ✅ A trace of the API calls an LLM used to obtain a prediction provides some degree of interpretability.

How to Teach Tool Use

Many tools are just an API call away—but how do we teach an LLM to use them? Few-shot prompting is a standard way to condition current models. However, a few-shot prompt may not provide enough supervision to enable an LLM to effectively use a tool, particularly if tools have complex arguments or multiple tools are required.

Instead of showing a few demonstrations of tool use to a model, we can instead provide it with tool documentation. While a demonstration showcases how a tool should be used for a specific task, documentation describes the general functionality of different tools. Hsieh et al. (2023) find that tool documentation outperforms few-shot prompting with demonstrations on new domains.

Two types of information for prompting LLMs for tool use. Few-shot prompting with demonstrations (left), i.e., pairs consisting of questions and their corresponding output tool-use plan. Documentation (right) provides descriptions of tool functionality (Hsieh et al., 2023).

Fine-tuning on data augmented with API calls seems like the preferred choice.3 As many API calls are possible for an example, the data can be filtered to only retain ‘correct’ API calls. In practice, an LLM can be prompted in a few-shot manner and API calls that do not lead to the correct final prediction are discarded. Parisi et al. (2022), for instance, generate sample API calls for Natural Questions examples. The API calls are executed and used to produce a model response. Examples where the model produced an incorrect output are filtered and the model is fine-tuned on the updated dataset augmented with API calls.

Schick et al. (2023) use a similar strategy applied to an unlabeled text dataset (a subset of Common Crawl). Rather than only retaining API calls that lead to correct responses, they retain calls that reduce the LLM’s loss over the next tokens. As annotating large unlabeled texts with calls from multiple APIs is expensive, they use heuristics that inform when each API should be selected.4

Schick et al. (2023) augment an unlabeled text dataset with API calls by 1) sampling API calls for random positions in the text via few-shot prompting; 2) executing the API calls; 3) filtering out all API calls that do not reduce the LLM’s loss over the next tokens; and adding all remaining API calls to the text.

Models can also be trained using reinforcement learning with hard-coded reward functions or from human feedback (RLHF) although this may lead to instability issues during training.5

Platforms for Tool-Augmented LLMs

Given the versatility of current models, tool-augmented LLMs have quickly captured researchers’ attention, with multiple recent papers claiming that tool use paves the way towards artificial general intelligence (AGI; Li et al., 2023; Ge et al., 2023). A central challenge for tool-augmented LLMs is the accessibility of APIs and models. The following platforms for tool-augmented LLMs have been proposed recently:

TaskMatrix.AI (March 2023), a vision for an ecosystem that enables LLMs to seamlessly interface with millions of APIs. Their framework includes a base LLM, an API platform, and an API search engine. The authors envision that models mainly learn how to use APIs using RLHF, which may be difficult to scale to millions of APIs. They include a case study using ChatGPT to interface with the PowerPoint API.

API-Bank (April 2023), a benchmark to evaluate the tool use of LLMs in a few-shot prompting setting. In order to make tool use in the few-shot setting feasible, the model needs to produce a query for an API search engine, which returns documentation for the most relevant API.
OpenAGI (April 2023), a benchmark consisting of synthetic multi-step multi-modal datasets that require chaining calls to different domain-specific models. Models can be evaluated in zero-shot, few-shot, fine-tuning, or RL-based settings.
Gentopia (August 2023), a platform for creating and sharing tool-augmented agents.

Overview of TaskMatrix.AI. (1) A foundation model generates an outline of the solution based on which the API Selector selects the most relevant API. (2) The LLM generates an API call, which is executed against the API.

Looking Back

It is inspiring to look back to see how far the field has progressed in just a few years. There are a few trends and developments in particular that have brought us to where we are.

Tool use then and now. The idea of having a model interface with auxiliary modules is not new. For instance, the Neural Programmer-Interpreter (Reed & de Freitas, 2016) required a complex neural network architecture to learn to execute different domain-specific programs; for equipping BERT with a calculator (Andor et al., 2019), vector-based operations for a limited set of arithmetic operations were defined. What has changed is that current LLMs are much more versatile than prior models, which allows the use of arbitrary APIs.

Embeddings→modules→tools. 5 years ago, we had approaches that learned to select the best combination of embeddings for a given task (e.g., Kiela et al., 2018). Last year, approaches selected new parameter-efficient modules for a given task (e.g., Mao et al., 2022). Now we are at a stage where models learn to select and use entire models and arbitrary black-box tools.

Chitchat→goal-oriented dialogue. End-to-end goal or task-oriented dialogue has been a challenging task in NLP for a long time (Bordes et al., 2016). While prior models have already queried database information based on their belief states (Hosseini-Asl et al., 2020), tool-augmented LLMs will be able to more seamlessly transition from chitchat to goal-oriented dialogue.

The Future of Tool-Augmented LLMs

Looking ahead, there are several challenges and directions for tool-augmented LLMs:

Making APIs accessible for model use. There are millions of APIs available that models can interact with. API platforms (see above) as well as ChatGPT Plugins and others aim to centralize access to APIs, which may risk locking in users. To ensure research progress in this area, it will be key to ensure that a standard set of APIs are available openly and freely to use.
API search and extensibility. The problem of finding the most relevant API is similar to finding the most appropriate skill for virtual assistants such as Alexa (Kim et al., 2018). It will be key to have a search component that reliably returns the most relevant API from a growing API pool as well as enabling LLMs to be easily extended with new tools.
Learning to use tools. How to best teach an LLM to use tools remains an open problem. The approach of Schick et al. (2023) is restricted to using a single tool at a time and requires tool-specific heuristics in order to augment a dataset efficiently. It will be important to investigate methods that can provide (multi-step) supervision and scale to 100s and 1000s of APIs.
Pre-training tool-augmented LLMs. Given the diversity of APIs and their use cases, it makes sense to dedicate larger budgets to training tool-augmented LLMs. While pre-trained models can be fine-tuned for tool use, pre-training a tool-augmented LLM allows the model to off-load certain behavior early in training and focus on learning what is not captured by the tools.
Improving reasoning and problem decomposition. Reasoning and tool use are closely intertwined (Mialon et al., 2023). In order to call the right APIs for a problem, it needs to be decomposed into potentially simpler subtasks. How to best decompose open-ended problems is an open challenge.
Compensating for API errors and preventing error cascades. API calls to other models or tools such as search engines may produce erroneous results, which can lead to downstream failures. LLMs should learn to assess the reliability of APIs and recover from API failures.
Gaining a better understanding of tool use. Many aspects of how models learn to use and interface with tools are poorly understood. For instance, it is unclear to what extent models use predicted reasoning steps to support the final prediction (Yu et al., 2022). It is thus important to develop analysis methods and diagnostic tools together with new tool-augmented LLMs.

Overall, tool use allows us to address some of the current models’ limitations and has the potential to make them more capable and more interpretable at the same time. I’m excited to see what future progress in this area will look like.

The lack of grounding to the real world, which tool use enables, has been highlighted as a limitation of LLMs in the past (Bender & Kohler, 2020).

While recent models have much improved mathematical capabilities, they are not yet able to solve graduate-level math problems (Frieder et al., 2023). They are, however, useful as assistants to mathematicians and to guide human intuition (Davies et al., 2021).

We can also refer to this as ‘behavioral fine-tuning’ as we aim to teach the model something about its intended target behavior.

For instance, texts should only be considered for the calculator tool if they contain at least three numbers.

See Mialon et al. (2023) for an overview of this area.

🤖🗣 Generative Agents, 🏛 Forums for Foundation Models

Sebastian Ruder — Mon, 21 Aug 2023 08:44:12 GMT

Hi all,

This newsletter discusses components and implications for building generative agents and publication norms and venues for large language models (LLMs).

🤖🗣 Generative Agents

Throughout the history of AI, people have been fascinated with the prospect of interacting with AI agents. Traditionally, interactions have been restricted to short time horizons and individual conversations such as with ELIZA or the Turing test. The rise of large language models (LLMs) has made it easier to develop persona-based bots (with ChatGPT or Character.ai, for instance) that have a consistent personality but these bots similarly adhere to the persona one conversation at a time.

At the same time, we have already seen glimpses of LLM-powered world-building in applications such as AI dungeon. While it is possible to interact with non-player characters and events in text adventures, characters only exist in relation to the player. What does it take then to generate a world where agents pursue their own goals and objectives regardless of player input? It turns out, LLMs (+ a game engine) are all you need—at least for the small town sandbox created by Park et al. in Generative Agents: Interactive Simulacra of Human Behavior.

The Smallville town sandbox where LLM agents go about their day and interact with each other. An LLM is used to convert an action description to an emoji sequence that is displayed to the user (Park et al., 2023).

In the paper, Park et al. simulate a population of 25 LLM agents in a The Sims-like sandbox environment. Each agent consists of a one-paragraph natural language description depicting its identity and multiple components, each enabled by a LLM (ChatGPT)1:

Memory and retrieval. A memory stream records events perceived by the agent (with a natural language description, event timestamp, and most recent access timestamp). A retriever retrieves a set of memory objects using a query description based on three factors:

recency (based on exponential decay since the memory was last retrieved);
importance (the LLM assigns an importance score based on the memory’s description);
and relevance (based on embedding similarity with the query).

It is nice to see the return of explicit memory mechanisms, similar in spirit to the key-value memory of Neural Turing Machines and later work. An advantage of this type of memory is interpretability: The contents of the memory of each agent are human-readable and can be inspected at every point in time.

Reflection. Beyond simply remembering past events, an agent should be able to reflects on its experiences and generate higher-level thoughts. To this end, the LLM is queried using the 100 most recent memory records and asked to suggest 3 high-level questions to ask about the subjects in the memory records. For each question, memory records are then retrieved and the model is asked to generate high-level insights based on the retrieved records. The generated insights are then added to the memory, enabling the agent to recursively generate higher-level observations about prior reflections.

A reflection tree for Klaus Mueller. Reflections such as “Klaus Mueller is dedicated to research” are generated based on observations from the memory and can be generated recursively based on prior reflections.

Planning and reacting. To plan what an agent does each day, the authors generate a plan for each day recursively by first generating a rough sketch based on the agent’s description and a summary of their previous day. The LLM decomposes the plan then first into hour-long and again into 5–15 minute chunks.2 During the day, the LLM is prompted with the perceived observations to decide if and how it should react to a situation. If a reaction is to engage in dialogue with another agent, then the LLMs generate the utterances conditioned on their memories about each other and the previous dialogue history.

LLMs are truly ubiquitous in this work: They are used for retrieval, to assign importance to memories, to suggest what to reflect on, to generate insights, to convert actions to emojis, to generate plans and dialogue, etc. Even the game environment itself is powered by LLMs: In order to determine the appropriate location for each action, the LLM is queried recursively. When an agent executes an action on an object, the LLM is asked about the state of the object. If anything, this demonstrates the versatility of current models.

It is sobering that as a field, we are at a point where many elements of a complex ML agent stack can be replaced with an LLM and an appropriately worded prompt. Most of the above use cases required training a specialized system not long ago, from producing descriptions and explanations of video game actions (Ehsan et al., 2019) to mapping from text to emoji (Eisner et al., 2016). Of course, it will be useful to specialize the model to improve performance for specific use cases such as retrieval or dialogue. While the authors ablate each of the above three components via a human evaluation, I am missing a more fine-grained analysis that highlights for which use cases the LLM is most brittle, which can inform future investigations.

It is unsurprising that the project in its current form costs thousands of dollars to run just with 25 agents. With smaller open-source models, the costs should drop dramatically. The code for the framework in the paper can be found here. Inspired by the paper, researchers from Andreessen Horowitz also made available AI Town, an MIT-licensed starter kit for building and customizing your own virtual town populated by AI characters.

AI Town, an open-source framework for building virtual towns populated by AI agents.

The paper acts as a blueprint of how a framework of AI agents interacting may look like. It is worth pointing out that current models have a range of biases and may exhibit other forms of undesirable behavior that require further study. Nevertheless, there are compelling research directions in the area of multi-agent LLM systems:

Emergent communication. The emergence of communicative behavior between AI agents has been mainly studied using simple reference games (Chaabouni et al., 2022). What can richer simulated environments tell us about how agents learn to communicate? Can we simulate emergent communication in other settings such as across different languages?
Nature of communication. Can we use these environments to study aspects of how humans communicate such as speaker accommodation, register change, and code-switching?
Nature of human behavior and social interaction. Can we study factors involved in other types of higher-level behavior such as planning, organization, co-operation, collaboration, and deception?
Multi-modal environments. Can we extend environments to incorporate audio and visual inputs and assess how these additional modalities affect the agents’ behavior?
The expressiveness of simulations. What kind of methods are necessary to simulate even richer interactions? What model scale and capabilities would be necessary to simulate agents and events at the scale of our world?

🏛 Forums for Foundation Models

There was some discussion on Twitter recently about initiating a new conference dedicated specifically to LLMs. LLMs are an interesting technology because they are at the intersection of many different areas of computer science and society including NLP and ML, human-computer interaction, ethics, law, government, education, etc.

So far, work on LLMs has been published in a range of different venues. Much of the foundational work on LLMs (for example, ELMo, ULMFiT, BERT) has been published in NLP (*ACL) conferences, which remain the most topically relevant venue.3 The NLP community has studied topics, which are of increasing importance for LLMs, such as automatic evaluation of generative models (Gehrmann et al., 2022) for decades. Recently, researchers have taken issue with the anonymity period4 but there are initiatives to rethink it.

LLMs have also increasingly become more popular at ML venues. GPT-3 won an outstanding paper award at NeurIPS 2020 while NeurIPS 2022 featured outstanding papers on scaling laws and large-scale image–text datasets. There are also a range of dedicated smaller venues such as the Neural Scaling Laws workshop series. There is even a new conference on Generative Pre-trained Transformer Models and Beyond.5

The above venues mainly cater to a computer science audience. As LLMs are increasingly used in user-facing applications, work on LLMs is also published in venues associated with other areas such as the ACM Symposium on User Interface Software and Technology for the paper in the previous section. Given the societal impact of LLMs, it is thus important for any venue to enable an inter-disciplinary dialogue that ensures LLMs are developed in a safe, responsible, and user-appropriate manner.

Besides where to publish, the other issue is who can publish. At the moment, much of the work on LLMs is conducted by labs with large compute budgets. For the previous generation of pre-trained models, an entire array of papers from a diverse range of institutions focused on gaining a better understanding of them.6 Such work is needed for the latest generation of large models but is currently prohibitive for many labs. With the release of new powerful open-source models such as Llama-2 combined with better methods for compute-efficient inference, we are heading in a promising direction, however.

The broader family of LLMs that can learn to act and use auxiliary modules is known as augmented language models. We will discuss these more in-depth in the next edition.

We have also recently seen plans in other areas such as summarization where question-answer pairs have been used as intermediate representation for conditional generation (Narayan et al., 2022), for instance.

ACL 2023 had a track dedicated to LLMs.

Submitted papers are not allowed to be posted, updated, or discussed online from 1 month before the submission deadline until the time of notification (around 4 months in total).

However, a focus on a specific architecture (and a name associated with a specific model family) may be potentially restrictive in the future.

This area of study has been commonly referred to as BERTology.

✨ Flashier Attention, 🤐 Gzip Classifiers

Sebastian Ruder — Mon, 14 Aug 2023 08:07:58 GMT

Hi all,

This newsletter has been on hiatus for some time due to some personal developments (I became a father recently 👶). So besides tackling an avalanche of dirty nappies, I’ll also wade again through the sea of arXiv research papers.

As before, in each edition, I will analzye a couple of topics (each consisting of one or multiple papers) by contextualizing them and reflecting on what they entail for the future of NLP and large language models (LLMs).

Note: If you enjoy synthesizing information and connecting your work to the broader research manifold, then you might also be interested in The Big Picture Workshop, which I’m co-organizing at EMNLP 2023 (submission deadline: September 1).

✨ Flashier Attention: Towards long-range models via hardware optimization

The current generation of large language models is based on the Transformer architecture. One of the limitations of Transformers is that the time and memory complexity of self-attention, one of its core building blocks, is quadratic in the sequence length. This makes scaling to very long sequences such as books, many documents, or rich interaction data prohibitive.

Numerous efficient attention methods have been proposed over the years (see this survey for an overview). Most of these methods seek to reduce the number of operations by approximating the attention algorithm, for instance, by sparsifying the attention matrix. However, the theoretical efficiency gains of these methods rarely translate into practical speed-ups as sparse-matrix operations are still poorly supported by current accelerators.

A taxonomy of efficient Transformer architectures (Tay et al., 2022).

FlashAttention (arXiv May ‘22) is a method that takes into account the hardware constraints of attention and reduces the number of times the slow GPU memory needs to be accessed during the attention computation. Specifically, it a) computes the softmax incrementally and b) stores the softmax normalization factor for faster recomputation in the backward pass. For a more detailed explanation, check out this article by Aleksa Gordić.

FlashAttention achieves speed-ups between 2–3x for different Transformer models and has enabled scaling Transformers to longer sequence lengths. It is the first method that achieved non-random performance on the challenging Path-X task of Long-Range Arena, a benchmark for assessing efficient Transformers. It is now available as a plug-in replacement for attention in many ML frameworks.

A positive and negative example of Long-Range Arena’s Pathfinder task, which requires deciding whether two points are connected by a path of dashes. Path-X scales Pathfinder to sequences of length 16k (i.e., 128x128 images), a sequence length that was until recently prohibitive for existing approaches.

Recently, Tri Dao, the first author of FlashAttention, proposed FlashAttention-2 (arXiv July ‘23), which offers additional hardware optimizations by a) reducing the number of FLOPs; b) parallelizing not only over batch size and number of heads but also over sequence length; and c) further partitioning operations within each thread.

The second optimization in particular enables scaling to very long sequence lengths (as Trio Dao explains in this blog post). Methods like FlashAttention-2 that make modeling of long sequences feasible have several implications for LLMs and efficient Transformers:

Models will be able to use much longer inputs. This lifts restrictions on users and developers. For instance, developers no longer need to artificially split or truncate long documents or come up with elaborate ways to reconcile model outputs from different subsets of the input.
Existing applications will get revisited and updated. Many existing NLP applications such as multi-document summarization and open-domain question answering have been subject to input length restrictions. New benchmarks and methods will reframe these tasks in light of long-context modeling.
Long-sequence modeling enables new applications and research directions. Modeling of extremely long sequences such as entire books, rich user interaction, and other forms of longitudinal sequential data is now possible. Credit assignment over such long sequences is still a challenge, however, as well as understanding and optimizing long-range memory (see, e.g., Rae & Razavi, 2020). Long-range sequence modeling may also be a boon for more granular representations such as character-level models (Clark et al., 2022; Tay et al., 2022), which have been hindered by limits on sequence length.
Hardware constraints will play a bigger role in influencing future advances. This has been the case throughout the history of ML but will become more pronounced in light of the huge compute budgets necessary to train current models. We will see more methods that achieve gains in efficiency through clever hardware utilization; some recent examples: TinyTL (memory reduction by only updating biases), LoRA (adapters without additional inference latency), and QLoRA (quantized adapters).

🤐 Gzip: The case for simple models

In light of the excitement around pre-trained Transformer models, it may be easy to forget that there are a huge range of other architectures and models, which may be more suitable depending on what considerations are important for a given use case.

I am generally a fan of papers that diverge from the beaten track and explore or revisit under-explored methods. During my PhD, for instance, I explored classic semi-supervised learning methods (ACL 2018). Another paper in this vein that I enjoyed is by Tay et al. (ACL 2021) who compared pre-trained convolutions to pre-trained Transformers.

Recently, another contrarian paper attracted attention on Twitter. It proposes using a gzip-based compression and distance metric with a k-nearest-neighbor classifier for text classification (Findings of ACL 2023).1 Remarkably, the authors show competitive performance with more complex non-pre-trained methods on six text classification datasets and even seemingly outperform BERT on five out-of-distribution datasets.

Test accuracy of the gzip-based method compared to other text classification methods on out-of-distribution (OOD) datasets (Jiang et al., 2023).

The authors set a good example by releasing the source code of their method, enabling others to experiment and to try to reproduce their results. However, researchers such as Ken Schutte quickly identified some inconsistencies with their method. Specifically, the authors report an ‘oracle’ top-2 accuracy for their method rather than a standard accuracy as used by the baselines; in addition, there are issues with the underlying source datasets, which lead to unreliable results while computing the distance metric for large training sets is excessively slow. Juri Opitz also observes that a bag-of-words-based distance metric outperforms gzip-based distance.2

Unfortunately, as usual, there is ‘no free lunch’ in machine learning—at least when it comes to non-pre-trained models. It is unlikely that a simple method such as gzip-based compression outperforms pre-trained models that have learned much richer representations unless it has been designed with specific use cases in mind.

Nevertheless, simplicity is a virtue. The promise of simple models is interpretability, which has proven elusive in the current era of black-box behemoths. Two of my favorite ‘classic’ examples of deceptively simple methods that punch above their weight are the Deep Averaging Network (ACL 2015) and the feature-based classifier used to examine the CNN/Daily Mail dataset (ACL 2016). The latter demonstrates the advantage of interpretability, highlighting that simple features are sufficient for many reading comprehension questions.

A family of models that expertly embodies this notion of simplicity and interpretability are Generalized Additive Models (GAMs)3, which consist of a linear combination of single-feature functions:

where g is the ‘link’ function such as a logistic function. Recently, Neural Additive Models (NAMs; NeurIPS 2021), a neural extension of GAMs have been proposed, which learn a linear combination of single-feature neural networks. We have not seen much of these models in practice for NLP applications due to the high dimensionality of language features and the complexity of natural language.

An example of understanding individual NAM predictions for credit scores. For an input, each feature net returns a contribution term, which are added up and passed through the link function for prediction, allowing for easy interpretability.

Nevertheless, as we continue to design models that are most useful to real-world users, it is important to remember that huge pre-trained models may not be the silver bullet for every problem. We should be aware that there are glass-box models that are more interpretable and efficient and may thus be a better fit for applications where these characteristics are important.

How to optimally compress text is a long-standing problem. The Hutter Prize, for instance, asks participants to compress enwik9 as much as possible. compares alternative compression algorithms to gzip in this post.

It is important to remind ourselves that this follow-up work was mainly possible because the authors released their code—kudos to them!

Generalized linear models such as logistic regression are a special form of GAMs.

Moving to Substack ➡️, Scaling Up 📈, Image Generation 🖼

Sebastian Ruder — Sun, 06 Nov 2022 21:31:50 GMT

Hi all,

Welp. Twitter plans to shut down Revue, my previous newsletter platform. I hope you like our new home, Substack. 🤗 I’m actually quite excited about the change. Substack allows comments on posts. We can now have actual discussions rather than just one-on-one email exchanges. Yay!

If you are already subscribed, everything should stay the same (I hope). If you are not, have a look around and see if you like things. All newsletter issues will stay free as before.

Here is some coverage of recent work on scaling up language models and text-based image generation that I did not get around to publishing (updated with some recent results).

Scaling up results 📈

As language models have become larger, they have outgrown some of the datasets we used to evaluate them. BIG-Bench, a two-year collaboration consisting of 204 tasks created by 442 authors aims to provide a diverse collection of tasks for the evaluation of current and future models.

Using 204 tasks for evaluation can be quite unwieldy, however, so recent work has already focused on 23 particularly challenging tasks where prior models did not outperform human annotators. It turns out that if you use chain-of-thought prompting—a recent method where you prompt a method to predict intermediate reasoning steps before producing the final answer—then recent large models surpass the average human annotator on up to 17/23 tasks. It seems we need to keep looking for more challenging evaluation tasks for large language models.

One thing that we have learned from training and evaluating large language models, is that "scale generally works". In other words, making models larger and training on more data usually leads to improved performance.

In some cases, this can lead to new behaviour. A small model may be bad at adding two numbers while a larger model may suddenly be able to perform the task with reasonable accuracy. In such cases, the linear performance curves that we would expect based on scaling laws are instead step-functions that transition from random to above-random performance at a certain model scale.

Emergent few-shot abilities of recent large language models on three benchmarks. 3-digit addition and subtraction, 2-digit multiplication (left); transliteration from the International Phonetic Alphabet (IPA) (middle); performance on the Word in Context (WiC) task (right) (Wei et al., 2022).

Such behaviour has been observed by recent large model papers such as LaMDA, GPT-3, Gopher, or PaLM. Wei et al. (2022) provide an overview of other types of emergent abilities that have been identified so far in current models.

Besides such emergent abilities, is there any other behaviour that may be surprising as we scale models up? The Inverse Scaling Prize, a recent competition awarded a prize of up to $100,000 if you find a task where performance goes down 📉 as models get larger.

A good way to narrow the search space is to look for tasks where an inductive bias is more important than more parameters. A minimal example for this is highlighted by @hardmaru. Given that the evaluation will focus on large language models, anything that cannot be learned from large amounts of text may be a good starting point. Nevertheless, large language models provide a good initialisation for various sequence modelling tasks such as RL (Reid et al., 2022) or can be easily learned together with other sequence tasks (Reed et al., 2022). So simply the fact that something is hard to learn from text alone does not mean that large LMs will not be able to learn it.

Scaling effects may also disappear or appear differently when investigated in a larger-scale setting. For some of the Inverse Scaling Prize winners, Wei et al. (2022) observe that the inverse scaling effect goes away when evaluated with 2x larger models and 5x more training compute. They also highlight the usefulness of chain-of-thought prompting to protect against inverse scaling. Inverse scaling thus depends on the specific training and test setting of a model.

For some tasks where inverse scaling has been observed, the effect disappears with increasing model scale (Wei et al., 2022).

Image Generation is Heating Up 🔥

Imagen and Parti

After DALL-E 2, two new text-to-image models have been released by Google, Imagen (pronounced "imagine") and Parti (pronounced "par-tee"). Similar to DALL-E 2, Imagen is a diffusion model. Parti, on the other hand, is auto-regressive and more like a standard encoder-decoder language model (LM). It learns to decode images into visual tokens, which are then converted by a ViT-VQGAN model into an actual image.

As the blog post highlights, limitations of current text-to-image models include their inability to count reliably, to follow precise spatial instructions, and to deal with complex prompts.

Given the potential of these models, it is not difficult to envision strategies to shore up each of these weaknesses. These can include collecting data that explicitly represents each phenomenon, using distant supervision with large amounts of data, and more explicit representations of number or spatial attributes in current models.

Another thing that can be challenging with current models is to generate natural text. For instance, DALL-E often produces gibberish as can be seen below.

"Two farmers talking about vegetables, with subtitles"—as imagined by DALL-E 2.

Daras et al. (2022) fed some of the gibberish text it produced back to DALL-E 2 as a prompt, discovering that the model indeed generates images related to vegetables. They argue that many such terms are consistent across prompts and thus DALL-E 2 has a "secret" vocabulary (a "secret language" in the original Twitter thread). In practice, such studies should be taken with a large grain of salt 🧂. Benjamin Hilton highlights that most terms are not consistent across different prompts. So while terms like "Apoploe vesrreaitais" may kind of look like a biological name for birds and thus result in mostly bird-related generations, most other terms produced by DALL-E 2 are just noise—and definitely do not have the properties of a language such as grammar, morphology, etc.

After all, we know that models learning to play referential games often come up with degenerate communication protocols consisting of arbitrary symbols (Chaabouni et al., 2022 is a nice recent study of the importance of scale for emergent communication). So I hope we go can turn down the hype and clickbaiting; let’s not go back to a time of newspapers reporting of AIs inventing a language that humans don't understand 🙄.

Larger variants of more recent models like Parti are also better at producing natural text—it seems this is another emergent ability of large multi-modal models.

Emergent ability of producing natural text in images by Parti.

If you are envious of the high-quality generation of these closed large models, fear not. DALL-E mini is an open-source alternative created by Boris Dayma that reproduces DALL-E with a smaller architecture.

The model has become very popular lately with many websites covering how to get around high traffic errors. My favourite part about DALL-E mini are a reddit page and Twitter account that chronicle the often hilariously weird and surreal creations of the model.

PaLM 🌴, DALL-E 2 👨‍🎨, Chinchilla 🐭, Chain-of-thought prompting ⛓💭✍️, Values and Culture in NLP 🏛

Sebastian Ruder — Sat, 16 Apr 2022 09:00:04 GMT

Hi all,

This newsletter covers PaLM, DALL-E 2, and Chinchilla, chain-of-thought prompting, and the role of values and culture in NLP.

This edition is somewhat delayed as I've been busy with planning a move (I'll be flying 🛫 to Germany tomorrow; say hi 👋 if you're in Berlin) and exhausted by current events. I hope that you are all staying safe in these trying times 🇺🇦.

I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.

Click here to view the newsletter in your browser.

If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.

This Model Can Understand Your Jokes 🤪

The emergence of large pre-trained models has fundamentally changed the face and nature of progress in ML and NLP. The underlying methods have not changed dramatically; neural networks have already been pre-trained more than 15 years ago. However, the recent scale of model size and data have enabled unprecedented—and indeed unexpected—capabilities.

Two recent models showcase the impressive progress in vision and NLP: OpenAI's DALL-E 2 and Google's PaLM. Both can be seen as the most recent milestone in a line of ever larger pre-trained models such as DALL-E, T5 and GPT-3, among others.

DALL-E 2

DALL-E 2 consists of two components: a prior that generates a CLIP image embedding based on a text description and a diffusion-based decoder that generates an image conditioned on an image embedding (see this overview of CLIP and diffusion models). The paper, unfortunately, is light on details regarding the composition and amount of training data. The resulting model produces more photorealistic and faithful images than its predecessor (see below).

Images generated by DALL-E 2.

While DALL-E 2 is able to generate impressive images, it still has weaknesses. As it relies on image–caption pairs for training, it may perform poorly when generating images that require more fine-grained visual reasoning such as counting.

In NLP, there has been a debate on whether a language model trained only on unsupervised data can ever truly understand natural language (see this overview and this recent principled paper). Multi-modal modals are grounded by definition. So are there any intrinsic limitations to the capabilities of language-and-vision models trained on image–caption alignments? One could argue that in order to learn truly multi-modal representations, a model must not only learn from depictions of the real world but must be able to interact with it.

PaLM

On the language side, PaLM is a 540B parameter decoder-only pre-trained Transformer model trained on multilingual—but heavily skewed towards English—data from the web as well as GitHub code. The model is evaluated in a few-shot setting on a battery of tasks where it generally outperforms the prior SOTA. On fine-tuning on SuperGLUE, the model handily outperforms the best decoder-only model and is competitive with encoder-decoder models (which generally perform better in such a fine-tuned setting).

Joke explanations

What I found most impressive, however, are some of the qualitative examples of model behaviour. For instance, the model is exceptionally good at explaining jokes. You can judge for yourself below. In each case, the model was prompted with just two example joke explanations and then had to generate its own.

Joke explanations by PaLM.

It will be interesting to see what this means for tasks such as sarcasm and irony detection, which have been mainstays of competitions such as SemEval. I had previously considered these tasks to be still far out of reach of current model capabilities. Such anecdotal evidence naturally does not mean that these tasks are solved but that we may need more sophisticated and robust benchmarks to assess model performance.

Similarly, explaining jokes is not something that I would have expected current models to be able to do. Consequently, there may be an array of applications that have so far been infeasible where models might be able to add value. We can thus expect to see more work that explores how we can leverage such models for previous unexplored applications. For a large publicly available model to experiment with, check out GPT-NeoX-20B.

Buoyed by these latest advances in NLP, there is a wave of new NLP startups that tackle a diverse set of applications, from search to writing assistants, content moderation, and many more.

Training Compute-Efficient LMs 🐭

While large language models are becoming more powerful, they are also becoming increasingly hard to use due to their huge size. In conjunction with scaling models, it is thus key to make advances in a) compression large models to smaller sizes and b) training more compute-efficient models to begin with.

Regarding the latter, researchers from DeepMind recently observed that current large language models are significantly under-trained. They noticed that for the most compute-efficient training, when doubling the model size the number of training tokens should also be doubled. This is a much larger scaling rate than that predicted by previous scaling laws. Their new 70B-parameter model, Chinchilla outperforms models of up to 530B parameters by training on much more data (1.4T vs 300B tokens).

Such an under-training phenomenon of large models is not entirely new. For RoBERTa, the authors similarly observed that BERT was significantly under-trained and that longer training improves its performance.

Given the non-linear improvement and emergence of new capabilities with large model sizes, it will be key to investigate what is necessary to retain such impressive few-shot and reasoning capabilities at smaller model sizes.

Another direction I am excited about is the modularization of huge models: For most practical applications, not all capabilities of a huge model are truly relevant. How then can we isolate and compress a huge model's domain and task-specific knowledge in a small model that excels on the downstream task? Similarly, how can we efficiently leverage only the parts of the pre-trained model that are necessary for the downstream setting or bootstrap a strong small model using a large pre-trained model? For more thoughts on such a modular perspective of model development, check out Colin Raffel's call to build open-source models like we build software.

Chain-of-Thought Prompting ⛓💭✍️

Despite the increasing power and capabilities of pre-trained models, the way we use and interact with them has not changed much. In-context prompting, pioneered by GPT-3 (see this overview) has been one of the most significant recent developments. However, we are still only scratching the surface of how to best extract information from pre-trained LMs and how to prime them for downstream tasks.

A method that enabled PaLM to perform particularly well on reasoning tasks is chain-of-thought prompting. Rather than training a model to predict the answer, chain-of-thought prompting augments the prompt with an explanation of the reasoning steps to arrive at the answer as can be seen below.

Standard prompting vs chain-of-thought prompting. The "chain of thought" is highlighted.

In a few-shot setting, these explanations can be manually written for a few examples. Prompted this way, the model learns to generate a similar explanation, which is particularly useful on more challenging reasoning problems.

Related approaches

Chain-of-thought prompting can be seen in line of several prior research areas. While explanations have been most commonly used to improve interpretability, Rajani et al. (2019) train a model to automatically generate explanations during training and inference, achieving a new SOTA on a commonsense reasoning dataset.

In a similar vein, Nye et al. (2020) train a model to write the intermediate computation steps of an arithmetic problem to a "scratchpad". For summarization, Narayan et al. (2021) train a model to generate an entity chain (an ordered sequence of entities mentioned in the reference summary). At test time, the model first generates the entity chain before generating the summary.

There are other ways to improve learning with such intermediate outputs. Wang et al. (2022) exploit the diversity of reasoning paths by sampling multiple chains of thought and then ensembling the final model predictions. As obtaining explanations for a large number of examples is expensive, Zelikman et al. (2022) generate explanations for a large dataset by bootstrapping a model in the few-shot setting and only retaining explanations that lead to correct answers.

Using explanations, rationales or a description of reasoning steps works empirically but a more principled theory of how models leverage such rationales is still missing. In particular, it would be interesting to investigate to what extent a model's reasoning conforms to the reasoning steps preferred by humans (although the model can also be trained to perform more human-like reasoning, similar to InstructGPT).

Interventions

Beyond interpretability, generating an intermediate output enables the user to intervene on a model's predictions. Narayan et al. (2021) demonstrate this by removing entities from the entity chain that were not seen in the original input, which improves the faithfulness of the generated summary. As a side-effect, such intermediate-output methods provide an interface and the potential to modulate and steer the predictions of otherwise black-box models. We can thus expect work focusing on whether such rationales truly explain model behaviour, similar to the debate around the explainability of attention.

Outlook

Overall, chain-of-thought prompting and related methods offer a glimpse of the untapped potential of current models. They also present an important and relatively compute-efficient research direction that can bring large improvements on top of state-of-the-art models. In this research, domain expertise is particularly important as it enables the development of strategies, reasoning steps, or alternative input methods that are particularly suited to an application. Prompts also do not need to be restricted to input–output pairs or explanations and can be much richer, including things to avoid, rules of thumb, positive or negative examples, etc as in the schema of Mishra et al. (2022) below.

The schema used in the Natural Instructions dataset of Mishra et al. (2022).

Values and Culture in NLP 🏛

One of the most important and far-reaching recent insights is that language models inherit the biases of the data they are trained on (see this overview). Over time, our understanding of such biases has become more nuanced. Beyond generating toxic language when conditioned with certain prompts, recent work has turned to investigating a model's ideology and values.

Gururangan et al. (2022) aim to identify the language ideology encoded in GPT-3 by analyzing what type of language the quality filter used in GPT-3 is biased against. They replicate it and apply it to a corpus of U.S. high school newspapers (augmented with demographic information). They find that the filter favours text from authors who originate from regions with better educational attainment, urban centres, larger schools, and higher valued homes.

Looking at specific cultural values, Arora et al. (2022) converted the questions of Hofstede's cultural dimensions survey and of the World Values survey into prompts that were presented to multilingual language models. They find that the models exhibit differences in cultural values and that the values exhibited by the models are not in line with the values of the survey participants.

Johnson et al. (2022) investigate the values of GPT-3 by prompting it to summarize a range of culturally diverse texts. They examine the generated summaries and highlight problematic summaries and ones where the expressed values conflict with the original text.

These works demonstrate that beyond investigating bias related to specific lexical terms in current models, we also must be aware of the underlying values encoded in the model and expressed in the generated text. After all, we would not want our models to hold views that are outdated or disrespectful in certain cultural settings. However, the best way to investigate and robustly identify such values is still an open question.

The impact of culture in NLP goes beyond ideology and values. For a great overview of the cultural dimensions that are relevant in NLP, have a look at this survey. The authors define four cultural dimensions of relevance: linguistic form and style (how things are expressed in language), common ground (shared knowledge based on which people reason and communicate), "aboutness" (what information is relevant or meaningful to people), and objectives or values (what people strive for).

The role of culture in NLP.

Highlights, new tasks & graph ML in 2021; Safer pre-trained models; Embeddings: Larger ≠ better

Sebastian Ruder — Mon, 31 Jan 2022 18:00:05 GMT

Hi all,

I hope you've had a good start to the new year. This newsletter covers my and others' highlights of 2021. I also discuss recent pre-trained models that put more emphasis on safety and recent text similarity models where large is not always better.

I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.

Click here to view the newsletter in your browser.

If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.

Looking back at 2021 👀

ML and NLP Research Highlights of 2021 💡

I wrote up some of my research highlights in 2021 in this post. Overall, most of the trends I observed revolved around pre-trained models and their capabilities—how to train them more effectively, how to do few-shot learning with them, how to use them efficiently, how to evaluate them, using them for new applications such as program synthesis, etc. What were your highlights? You can share them by replying to the tweet below and I'll summarize them in the next newsletter.

New ML Tasks in 2021 💽

Another area I'm quite excited about is when ML is used to do new things. While such applications can be practically useful such as using AlphaFold 2.0 to accelerate the drug discovery process, I particularly enjoyed unconventional tasks or tasks provide a new perspective on existing research areas. Here are my favourites from 2021:

BIG-bench contains a smorgasbord of diverse, sometimes quirky tasks for probing language models (LMs). Predicting checkmate? ✅ Guessing movies based on emojis? ✅ Reasoning in a fantasy world? ✅ The latter task includes examples like:

As an amputee you experience phantom arm syndrome. Then one day you realize you can use it to punch ghosts. Your left arm is amputated but you still have your right arm. Do you use your left arm to hit the late Elvis Presley to make him stop bothering you? Answers: Yes / No

I don't know about you but I would prefer my ML model to resolve their problems without resorting to punching ghosts 👻.

Cryptic crossword puzzles Solving cryptic crossword puzzles is a task that has attracted recent interest in the form of two datasets and an associated BIG-bench task. Crossword AIs have recently surpassed humans in a tournament but cryptic clues are still very challenging for current models as these require both an understanding of semantics as well as wordplay (see below for an example from Cryptonite) that requires correctly identifying and resolving an anagram.

Reconstructing ancient texts The task of masked language modeling, filling in missing tokens in a text, lends itself directly to predicting missing tokens in the transliterated texts of ancient Akkadian clay tablets (Lazar et al., 2021). Such a setting is arguably more interesting than language modeling on the Penn Treebank—and trained models are practically useful by assisting experts in transcribing texts in extinct languages.

Decontextualization is a new NLP task that requires rewriting an in-context sentence to be interpretable out of context (Choi et al., 2021). This means dealing with various phenomena such as resolving coreferences and anaphora, adding relevant modifiers or necessary background information. Decontextualization is useful, for instance, in the context of question answering: instead of providing a sentence answer, which may be difficult to understand without the surrounding context, models can produce a sentence that stands on its own. See below for an example of how decontextualization looks like in practice.

An example decontextualization. The sentence to decontextualize is in gray.

Text-based NP enrichment is a new information extraction task that focuses on extracting all relations (that are mediated by prepositions) between noun phrases in a text. NP enrichment unifies and complements many existing entity-related tasks such as relation extraction, semantic role labeling, entity linking, coreference resolution, etc. You can see how the annotation for this task looks like in the example below (Elazar et al., 2021).

Preposition-mediated relations between noun phrases in a text.

I am particularly excited by the newer tasks that explicitly go beyond core NLP tasks such as coreference resolution, which can be relatively narrow. Coreference resolution, for instance, is still useful to probe a model's reasoning abilities, e.g., as part of the Winograd schema and later instantiations such as Winogender for gender bias and WinoGrande for commonsense reasoning. However, as models become more powerful, we can apply them to a broader, more general set of problems, which may also be more practically useful.

I am constantly amazed by the emerging capabilities of models in ML and NLP and the new settings where they are applied and excited for the new things that we will be able to do this year.

Geometric & Graph ML in 2021 📐

Graph machine learning is one of the hottest emerging areas in ML. Graph ML methods are useful in a variety of domains, from modelling network data to molecules, interactions in physics, relations between entities, mathematical graphs, etc. Michael Bronstein and Peter Veličković interviewed experts in the area on their impressions of 2021 and predictions for 2022. The article is a great read for anyone who wants to get up to speed in this area. For a quick overview, you can check out their take-home messages highlighting, among others, the importance of message-passing—networks that update the hidden states of nodes based on information from adjacent nodes—, challenges of reasoning and generalisation, the combination of Transformers with graph neural networks, etc.

Papers with Code 2021 👩‍💻

Papers with Code, one of the best resources for finding results, papers, and code in ML—which is also recently integrated into the ACL Anthology (see the code and data section at the bottom of a paper, such as this one)—highlights the top trending papers, libraries, and datasets of 2021. The most talked about paper proposes a method to synthesize new views for an image from arbitrary camera angles, which captured people's attention with the below demo featuring impressive synthesized camera shots.

My 2021 👨‍💻

My main threads of research in the past year were parameter efficiency (how can we make pre-trained models more efficient?), cross-lingual generalisation (how can multilingual models generalise better to under-represented languages?), and multilingual evaluation (see my Google Scholar for the detailed publications). Some of the most fulfilling work was collaborating with passionate researchers such as from the Masakhane community on building datasets in their own languages. I'm looking forward to doing more of this in 2022.

Like many, I had ups and downs. I've had less energy ⚡️ to do things outside of work, so have been less active on Twitter and only written blog posts and newsletters infrequently. I also regretted not being able to meet people from our community in person.

Overall, I'm hopeful for the new year and that things will slowly start going back to normal. I'm excited to write more in my spare time again and I'm looking forward to seeing many of you in person, at conferences or similar events.

Safer Pre-trained Models 😷

Prior work has found that pre-trained models are biased and can generate discriminatory or even toxic language. Ensuring safe responses is thus an important aspect of the development of such models. Recent models such as LaMDA, InstructGPT, and Gopher developed by Google, OpenAI, and DeepMind respectively emphasize safety in their model evaluation and training. A common recipe is to fine-tune pre-trained models on data labeled with safety ratings by human annotators—using a reward model + RL or by training a detector and filtering out unsafe responses.

For LaMDA, crowdworkers annotate model responses based on different safety criteria. The model is then fine-tuned both to generate dialogue responses as well as to predict the annotated safety labels. This multi-task setting is not only more efficient but also enables sharing information between the tasks. At test time, candidate responses where the model predicts a low safety rating are filtered out. The authors find that this fine-tuning setting significantly improves the safety of generated responses.

For InstructGPT, GPT-3 is first fine-tuned on the demonstrations of annotators following instructions in a supervised setting. In a second step, raters rank multiple outputs of the fine-tuned model, which is used to train a reward model. Finally, the model is fine-tuned based on the output of the reward model using reinforcement learning. In an evaluation, the outputs of InstructGPT are significantly preferred over GPT-3's outputs while InstructGPT replaces GPT-3 in the API.

For Gopher, the authors perform an extensive analysis of the toxicity and bias of the model. They find that larger models increase the toxicity of toxic input but do not amplify training data toxicity when unprompted. They also observe that large models are prone to bias against subgroups in a few-shot setting and that larger models are not able to overcome limitations in the coverage of dialects.

Overall, prior work as well as these recent efforts demonstrate that we cannot just pre-train models and expect them to produce safe or harmless responses. Instead, safety and inclusion need to be key design criteria that are included as part of the development of such models. This requires clearly enumerating and defining potential safety risks, collecting and annotating relevant data as well as explicitly training models to demonstrate safe behaviour. For recent reviews that highlight potential risks associated with language models, have a look here. I hope to see safety being considered as a design criterion and evaluation dimension in more work going forward.

Embeddings: Larger ≠ better 🏋️‍♀️

Nils Reimers analyzes embeddings from OpenAI's recently released embeddings endpoint. OpenAI provides embeddings in different sizes, from 1,024–12,288 dimensions. He evaluates them on three downstream tasks—text similarity, text search, and code search.

He finds that the text similarity models perform much worse than state-of-the-art models such as all-mpnet-base-v2 and all-roberta-large-v1—MPNet and RoBERTa models respectively fine-tuned on 1B sentence pairs. They are also 6 points weaker than extremely small models with just 22M parameters that can run in a browser such as all-MiniLM-L6-v2. On text search, they perform competitively but not quite at the level of the state of the art.

At the same time, due to their high dimensionality, the OpenAI embeddings are much slower than existing embedding models that have up to 768 dimensions and take up much more memory. He highlights that encoding the 21M passages of English Wikipedia in 384-dimensional embeddings requires about 16 GB while using 12,288 dimensions requires 516 GB of memory. Not only does retrieval using high-dimensional embeddings consume much more memory but it is also much slower than using smaller models.

Retrieval is also important for recent retrieval-augmented models such as Retro, which retrieve from corpora of up to 2T tokens using frozen BERT representations (1,028 dimensions for BERT-large). Encoding such corpora with 12,288 dimensions would be prohibitive. Text similarity and retrieval-style tasks are one of the few settings these days where more parameters does not give you more bang for your buck; instead, for most realistic applications, low-dimensional performant embeddings are the way to go. Check out Nils' library sentence-transformers as well as the above models for efficient, powerful sentence representations.

Pre-training + Massive Multi-tasking, Benchmarking in NLP, EMNLP primer, 🤗 NLP Course, ACL 2021 recap,

Sebastian Ruder — Sat, 06 Nov 2021 21:55:57 GMT

Hi all,

First off, some personal news: I've moved from DeepMind to Google Research this week. Because of this move, the past months have been quite busy. In light of this, I decided to pause the newsletter over the last couple of months. I plan to continue with it in a more sustainable manner.

I'll be continuing to work on multilingual NLP, with a focus on under-represented languages, particularly those in Sub-Saharan Africa. On this note, if you are thinking of doing research in this area, I can't think of a better thing to do than apply for a funding opportunity via the Lacuna Fund. The call this time includes a mentorship program in collaboration with the amazing people from Masakhane.

I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.

Click here to view the newsletter in your browser.

If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.

Pre-training + Massive Multi-tasking 💑

Multi-task learning (MTL), training a model on several tasks at once and sharing information is a general method that is fundamental to training neural networks. Rich Caruana's 1997 paper is one of the best introductions to this topic and as relevant today as it was back then. For more recent overviews, you can check out my survey from 2017 or a survey from 2020 that I enjoyed.

Research in multi-task learning has long shown that models trained on many tasks learn representations that generalize better to new ones. A common problem in multi-task learning, however, is minimizing negative transfer, i.e. how to make sure that tasks that are dissimilar do not hurt each other.

In recent years despite much work on alternative training objectives, the NLP community has gravitated to a single pre-training objective to rule them all, masked language modelling (MLM). Much recent work has focused on ways to adapt and improve it (e.g., Levine et al., 2021). Even the next-sentence-prediction objective used in BERT has become slowly phased out (Aroca-Ouellette & Rudzicz, 2020).

Recently, there has been a flurry of papers that show not only that multi-task learning helps pre-trained models, but that gains are larger when more tasks are used. Such massive multi-task learning settings cover up to around 100 tasks, going beyond earlier work that covered around 50 tasks (Aghajanyan et al., 2021).

A key reason for this convergence of papers is that multi-task learning is much easier with recent models, even across many tasks. This is due to the fact that many recent models such as T5 and GPT-3 use a text-to-text format. Gone are the days of hand-engineered task-specific loss functions for multi-task learning. Instead, each task only needs to be expressed in a suitable text-to-text format and models will be able to learn from it, without any changes to the underlying model.

The newly proposed approaches differ in terms of how and when multi-task learning is applied. One choice is fine-tuning an existing pre-trained model on a collection of multiple tasks, i.e. behavioural fine-tuning. This is done by T0 (Sanh et al., 2021), one of the first outcomes of the BigScience workshop, using T5 and FLAN (Wei et al., 2021) using a GPT-3-like pre-trained model. Both papers describe a unified template and instruction format into which they convert existing datasets. BigScience open-sources their collection of prompts here. Both papers report large improvements in terms of zero-shot and few-shot performance compared to state-of-the-art models like T5 and GPT-3.

Min et al. (2021) propose a different fine-tuning setting that optimizes for in-context learning: instead of fine-tuning a model on examples of a task directly, they provide the concatenation of k+1 examples to a model as input x_1, y_1, ..., x_k, y_k, x_{k+1} and train the model to predict the label of the k+1-th example, y_{k+1}. They similarly report improvements in zero-shot transfer.

In contrast to the previous approaches, ExT5 (Anonymous et al., 2021) pre-trains a model on a large collection of tasks. They observe that using multiple tasks during pre-training is better than during fine-tuning and that multi-task pre-training combined with MLM is significantly more sample-efficient than just using MLM (see below).

SuperGLUE score of ExT5-LARGE vs T5-LARGE as a function of number of pre-training steps

On the whole, these papers highlight the benefit of combining self-supervised pre-training with supervised multi-task learning. While multi-task fine-tuned models were always somewhat inferior to single-task models on small task collections such as GLUE—with a few exceptions (Liu et al., 2019; Clark et al., 2019)—multi-task models may soon hold state-of-the-art results on many benchmarks. Given the availability and open-source nature of datasets in a unified format, we can imagine a virtuous cycle where newly created high-quality datasets are used to train more powerful models on increasingly diverse task collections, which could then be used in-the-loop to create more challenging datasets.

In light of the increasingly multi-task nature of such models, what then does it mean to do zero-shot learning? In current training setups, datasets from certain tasks such as NLI are excluded from training in order to ensure a fair zero-shot scenario at test time. As open-source multi-task models trained on many existing tasks become more common, it will be increasingly difficult to guarantee a setting where a model has not seen examples of a similar task. In this context, few-shot learning or the full supervised setting may become the preferred evaluation paradigms.

Benchmarking in NLP 🥇

—Most ML and NLP models, probably

Many talks and papers at ACL 2021 made reference to the current state of NLP benchmarking, which has seen existing benchmarks largely outpaced by rapidly improving pre-trained models, in spite of such models still being far away from true human-level natural language understanding.

My favourite resources on this topic from the conference were:

Chris Pott's keynote on Reliable characterizations of NLP systems as a social responsibility.
Rada Mihalcea's presidential address where she emphasises evaluation beyond accuracy.
Samuel Bowman's and George Dahl's position paper asking "What Will it Take to Fix Benchmarking in Natural Language Understanding?"

I've also written a longer blog post that provides a broader overview of different perspectives, challenges, and potential solutions to improve benchmarking in NLP.

EMNLP 2021 primer 🏝

EMNLP papers are now available in the ACL anthology. I'll be attending the conference virtually. I'm particularly looking forward to the virtual poster sessions (as these feel closest to an in-person conference experience).

In terms of the conference program, I was particularly excited to see Visually Grounded Reasoning across Languages and Cultures (Liu et al., 2021) being awarded the Best paper award. The paper highlights biases in the concepts and images in ImageNet and proposes a new multilingual dataset for Multicultural Reasoning over Vision and Language (MaRVL; see below for examples). We need more work that creates such high-quality datasets in a culturally diverse setting.

Examples from MaRVL in Tamil (a) and Swahili (b).

If this is your first time at a conference, I would recommend you to prioritize meeting and connecting with people. Approach others and introduce yourself (whether in-person or at the virtual GatherTown). Go to posters, both those which are well-attended and those where only the presenter is present; the latter often make for a more insightful and stimulating discussion. Above all else, have fun (we all need it after the last two years). If you are presenting something at the conference or just want to say hi, feel free to send me a message.

You can also find me doing the following things during the conference:

November 9: presenting a poster (poster session link) and talk (oral session link) for XTREME-R (paper link)
November 10: presenting a keynote on multilingual evaluation at the Eval4NLP workshop (conference link, external link)
November 11: co-presenting a hybrid tutorial on Multi-Domain Multilingual Question Answering (conference session link; slides link) as well as co-hosting The 1st Workshop on Multi-lingual Representation Learning (conference link; external link)

Also check out the following EMNLP 2021 together with collaborators:

IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation: The first benchmark (and new models) for NLG in Indonesian.
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts: New methods + a system evaluation to adapt pre-trained models to unseen scripts.
Efficient Test Time Adapter Ensembling for Low-resource Language Varieties (Findings): A simple ensembling-based method to use existing language adapters for adapting to unseen languages and language varieties.
MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer (Findings): A method based on hyper-networks to generate language adapters for cross-lingual transfer.

Upcoming NLP course by Hugging Face 🤗

One thing that is great about doing NLP these days is that there are a lot of resources to help you get started as well as a lot of tooling and infrastructure to work easily with state-of-the-art models.

A nice resource for learning about using NLP is the NLP course by Hugging Face. Four chapters are currently available, with more to be released by November 16. The new chapters discuss how to create datasets and tokenizers as well as how to deal with many standard NLP settings, such as fine-tuning a model for translation, summarization, or question answering (see below).

The Question answering chapter of the NLP course

ACL 2021 recap 🏛

"Hot topics" in ACL 2021 papers compared to ACL 2018

This should come as no surprise but it's still interesting to see that among the 14 "hot" topics of 2021 (see above) were 5 pre-trained models (BERT, RoBERTa, BART, GPT-2, XLM-R) and 1 general "Language models" topic. These models are essentially all variants of the same Transformer architecture. This serves as a useful reminder that the community may be overfitting to a particular setting and that it may be worthwhile to look beyond the standard Transformer model (see my recent newsletter for some inspiration).

It seems everyone is exhausted by virtual conferences at this point as I wasn't able to find any write-ups of people's highlights of ACL 2021, in contrast to past conferences (I also didn't manage to finish mine).

If you're attending EMNLP 2021, I hope you'll share your highlights, experience, and insights with the community.

ICML round-up, Open collaboration, CLIP art, Internet augmentation, New GLUE-style benchmarks

Sebastian Ruder — Mon, 02 Aug 2021 08:30:01 GMT

Hi all,

This newsletter covers some of my favourite papers from ICML 2021, a discussion of open collaboration, art generated by the CLIP model, how to leverage information from the Internet in your models, and new benchmarks in the style of GLUE.

FYI, I'll be at ACL 2021 virtually this week. Ping me on Gathertown or send me an email if you would like to chat. I'm co-author on two papers on parameter-efficient multi-task learning and monolingual vs multilingual models, which will be presented by the first authors on 12 pm, August 2 and 11 am, August 3 respectively.

I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.

Click here to view the newsletter in your browser.

If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.

ICML round-up 📑

Straight to the Gradient: Learning to Use Novel Tokens for Neural Text Generation Neural generative models, despite their popularity, are known to suffer from some deficiencies, such as a tendency to generate frequent tokens. Popular methods to address this such as top-k sampling (Fan et al., 2018) or nucleus sampling (Holtzman et al., 2020) focus on decoding. This paper proposes ScaleGrad, which re-scales the token probabilities during training to encourage the model to focus on novel tokens, i.e. ones that have not been generated before. ScaleGrad seems to improve performance on some open-ended as well as directed generation tasks. Of course, just focusing on novel tokens may be too simplistic. Overall, modifying the loss function with regard to particular sets of tokens may be a useful way to inject additional inductive biases into a model, such as which entities or attributes to focus on.

Dense for the Price of Sparse: Improved Performance of Sparsely Initialized Networks via a Subspace Offset This paper nicely highlights the competing priorities when training sparse networks: The current trend of identifying 'lottery tickets', i.e. sparse subnetworks that can be trained on their own from scratch and which perform similarly to full networks, is motivated by computational efficiency. However, such methods require computing a score for all parameters in the full model to determine whether they should be pruned. It is thus still necessary to store and compute with the full model on device. To reduce on-device storage costs, the authors propose a network layer as the sum of a sparse matrix and a fast transform. Another thing that I found interesting was the notion of matching vs extreme sparsity (Frankle et al., 2021): the former is the sparsity setting where pruned models perform comparatively to the full model while in the latter setting, the performance of pruned models deteriorates. Strong pruned models should aim to strike a Pareto-optimal sparsity–performance trade-off.

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases This paper is a nice example of how different inductive biases can be combined to make a model more expressive depending on how much data is available. Vision Transformers using self-attention have outperformed CNNs when trained on large datasets as self-attention is more expressive. But CNNs still perform better when trained on smaller datasets due to their inductive bias. The authors propose a slightly modified self-attention layer, which is initialized to act as a convolutional layer. This way, the model retains the useful inductive bias early during training and can become more expressive if necessary later on. Such a soft inductive bias may also be of interest for efficient language Transformers, which could limit the attention span early in training (Sukhbaatar et al., 2019).

Calibrate Before Use: Improving Few-Shot Performance of Language Models This paper highlights the instability of prompt-based learning. In particular, prompt-based models are sensitive to the format of the prompt, training examples, and order of examples. A key problem is that the model favours certain answers over others, e.g. answers that are frequent in the prompt, appear towards the end of the prompt, or are frequent in its pre-training data. To address this, the authors propose to first estimate the model's bias using a content-free prompt. The model's predictions can then be recalibrated so that the class scores for the content-free prompt are uniform. While calibration doesn't alleviate the need for prompt engineering, it reduces the variance when dealing with different prompts. Given the current popularity of prompt-based methods, this may make working with prompt-based models a lot easier.

Catformer: Designing Stable Transformers via Sensitivity Analysis This paper introduces the concept of the sensitivity of an architecture, which measures how an architecture's output varies when its parameters are randomly perturbed. The authors also relate this measure to how difficult to train certain architectures are. They then propose a simple modification to the Transformer, which replaces residual connections with concatenation and is more stable to train on a set of reinforcement learning tasks. So far, training difficulty was often discussed anecdotally. A more principled measure of training difficulty such as sensitivity is a step towards designing not only models that are more powerful but that are also easier to use in real-world applications.

WILDS: A Benchmark of in-the-Wild Distribution Shifts This is a very diverse benchmark to test how well ML methods generalize across distribution shifts on a wide variety of domains and data modalities. It covers domains as diverse as camera trap photos, cell images, molecular graphs, online comments, and code. If you are working on robust, modality-agnostic ML methods, then this is the dataset to evaluate on.

Open collaboration 🤝

At ICML, I attended the Social on Open Collaboration in ML Research, hosted by ML Collective, among others. During the event, people shared a diverse range of external collaboration experiences, many of them relating to work done as part of independent research collectives.

Connor Leahy talked about EleutherAI, a grassroots collective of researchers who developed not only GPT-Neo, an open-source LM in the style of GPT-3 but also worked on BioML research and ML-generated art (read more about the art below)—all of this in the past year. This blog post provides a great overview of their progress so far. To join or contribute, you can head over to their Discord.

Edward Elson Kosasih talked about his research as part of ML Collective (MLC), a nonprofit organization dedicated to making ML research accessible. He led a team that worked on graph neural networks as part of the Open Graph Benchmark Large Scale Challenge. In order to get involved with MLC, you can join their Discord.

Matthias Gallé discussed the BigScience project, also known as The Summer of Language Models 21, a one-year long research workshop on very large language models. The project aims to create and share a large multilingual dataset and to train a very large language model. A diverse set of working groups are dedicated to different parts of the data and model creation process, from data sourcing to prompt engineering, dealing with metadata, and retrieval. To get up to speed on the progress so far, you can watch updates from the first event from July 30, 2021 here. To join the project, fill out the form here.

Salomon Kabongo talked about the work of Masakhane. Masakhane is a grassroots organisation that aims to strengthen NLP research in African languages. So far, they have released models and datasets for diverse tasks such as machine translation, named entity recognition, and others in many African languages. To get involved, join the Google group and Slack channel.

On the whole, my impression is that ML and NLP have become much more accessible, in part thanks to research collaborations such as the above, which are open to anyone as long as you're excited and motivated to contribute. Other collaboration opportunities are the fast.ai or the HuggingFace communities. If you are looking to work in ML or NLP and need collaborators and guidance, I encourage you to join one of the above collaborations.

For conducting academic collaborations, I shared some lessons of my first external collaboration (and first long paper during my PhD) with Barbara Plank (see below).

A rough timeline and takeaways of my first external collaboration

CLIP art 🎨

How CLIP Generates Art (Credit: Charlie Snell)

CLIP art, not to be confused with the sometimes slightly cheesy type of graphic art often used for illustration purposes, relates to art produced using the CLIP model by OpenAI. CLIP was trained with a contrastive objective to match text with corresponding images. As a result, CLIP is very good at judging which caption best reflects an image, which can be used for zero-shot classification on ImageNet. Alternatively, the model also can be used to gauge which image best suits a description. This is how CLIP is used to generate art (see above), by steering the output of a separate generative model through back-propagation until the model generates an image that matches the description as closely as possible, according to CLIP.

This article by Charlie Snell does a great job of charting the development of the art scene that has evolved around this method—and the often dreamy, impressionistic or psychedelic images that it has produced. The cool thing is that CLIP works with any generative model so the possibilities the method presents develop and grow more diverse as generative models become more powerful. One of my favourite images is the one below of a "a beautiful epic wondrous fantasy painting of the ocean" generated by @RiversHaveWings using CLIPDraw + CLIP.

"a beautiful epic wondrous fantasy painting of the ocean" by @RiversHaveWings

Internet augmentation 💻

Current large language models are trained on large amounts of unlabelled data, mainly from the web. However, they do not yet leverage everything the Internet has to offer. In other words, there are many types of signals on the web that are currently not used for learning. Barbara Plank has called this fortuitous data, as such data is often available by accident or good fortune.

A great example of such fortuitous data is the HTML structure underlying web pages. Such structure can provide both useful learning signal for a model as well as be useful for generating prompts, by letting the model auto-complete the HTML structure of a document. Aghajanyan et al. (2021) recently proposed the cleverly named HTLM, a large language model trained on HTML structure. They show that the model excels at zero-shot natural language generation using structured HTML-based prompts. In addition, they propose to control the size of the generated output sequence by using size hints, noisy estimates of the length of the generated span inserted right after the MASK token.

Another recent example of leveraging more of what the web has to offer is an extension of retrieval augmentation to the Internet. Specifically, rather than learning to retrieve relevant information only from a large corpus of unlabelled text, a model can learn to retrieve from the entire Internet. To make this feasible, Komeili et al. (2021) learn to generate a search query based on the context of a dialogue. They then condition on the search results to generate a response. The resulting Internet-augmented dialogue model outperforms both the use of retrieval augmentation and no augmentation.

Other forms of information that so far have been neglected are a) the information from hyperlinked pages, which could be used for conditioning during training; b) hyperlink patterns, to learn which information to trust; c) multi-modal context on webpages, to ground representations; d) snapshots of webpages over time, to learn re-writing, etc; e) timestamps for modelling time (Dhingra et al., 2021), f) content by the same users across multiple websites for authorship and style modelling, etc.

New GLUE-style benchmarks 🏛

Since the development of models that learn general representations, mainly via self-supervised learning, it has become common to evaluate such models on benchmarks comprising a diverse set of different tasks. The most prominent of these are arguably the GLUE and SuperGLUE benchmarks. Following in their footsteps, benchmarks serving a wide array of settings and languages have been proposed.

I'm particularly excited about two recent additions to this ever-growing evaluation environment. Few-shot Language Evaluation across (X) many transfer types (FLEX; Bragg and Cohan et al., 2021) is a benchmark focused on few-shot NLP, something I hoped to encourage in 2018. It not only covers the standard meta-learning/few-shot learning setup with separate meta-training and meta-test portions but also manages to capture the current zeitgeist by evaluating zero-shot evaluation based on textual descriptions.

The second benchmark is the Speech processing Universal Performance Benchmark (SUPERB; Yang et al., 2021), which aims to do for speech what GLUE has done for NLP, by providing a general platform to evaluate self-supervised speech models on 10 different tasks. The benchmark covers core speech tasks, from modelling content (phonemes, transcription, keywords) and speakers (identification, verification, and diarization) to dealing with semantics (intents, slot filling) and paralinguistic features (emotions). Such a standardization will likely open the door to the development of more powerful self-supervised speech models.

Another type of benchmark I'm excited about is one that covers many tasks in a given language. I've recently had the chance to contribute to two such benchmarks: LiRo for NLU tasks in Romanian and IndoNLG for NLG tasks in Indonesian. Facilitating evaluation on a diverse set of tasks in a given language is one of the best ways to incentivise progress in that language in my opinion.

GitHub Copilot, The Perceiver, Beyond the Transformer, Data augmentation, NL augmenter 🦎 → 🐍, Research communication

Sebastian Ruder — Mon, 19 Jul 2021 09:00:02 GMT

Hi all,

This newsletter is a bit delayed. I had to skip the last one as I had to take a break after a busy period in the end of May (EMNLP and NeurIPS deadlines). At the same time, there's so much happening that I've found it hard to catch up. Now I'm back, feeling more energized, and updating myself (and you) on what's new.

I'll discuss the biggest advances over the last months including GitHub Copilot, the Perceiver, and non-self-attention models.

I'll also talk about something that is challenging for me when writing this newsletter: striking the right balance between content that is both timely but also relevant in the long-term. TL;DR: I'm planning to keep newsletters somewhat shorter in the future to have more time for in-depth blog posts.

I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.

Click here to view this newsletter in your browser.

If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐥.

May–July round-up

OpenAI Codex / GitHub Copilot

If you are working with software, then you've probably heard about the release of GitHub Copilot, a coding assistant based on Codex, a GPT language model fine-tuned on code on GitHub (see the paper). As far as I'm aware, this represents one of the first products of a large company where a user directly interacts with a large generative language model. Large language models are also used in many other applications such as Google Search, but such applications typically include a wide array of other signals.

There are a couple of interesting take-aways from the paper: One is that the model was not trained from scratch but an existing GPT model (up to 12B parameters, so not the largest GPT-3 model; the deployed model may be larger, however) was adaptively fine-tuned on code from GitHub. In addition, they further fine-tune two variants of the model (Codex-S and Codex-G) to generate stand-alone functions and docstrings respectively.

Given that language models are prone to reproduce inputs (Raffel et al., 2020) people have already found memorization issues with Copilot such as copy-pasting a person's contact info. One issue is that the Codex model was trained on all code on GitHub including code with potentially problematic licenses. GitHub claims that code produced by an AI model is "fair use"—it's controversial whether this is actually the case, given that the model may reproduce passages verbatim.

Another question is whether Copilot will be able to make a meaningful difference in the workflow of programmers. In a previous newsletter, I discussed current ML on code work. In particular, I highlighted a study, which found that current models did not improve productivity or code quality when used for in-IDE code generation. A practical limitation of Copilot is that it only considers the code in the current file (rather than in the entire codebase) and can thus only generate relatively self-contained code. So it remains to be seen whether it will provide meaningful benefits beyond the capabilities of existing models.

GitHub markets Copilot as an "AI pair programmer". Pair programming is essentially a form of dialogue grounded in code. Similar to interacting with a dialogue agent, one requirement for successfully completing a task is a shared foundation of meaning. A task-oriented dialogue agent needs to know about the relevant entities and intents that are necessary for, say, booking a restaurant. In the same vein, an effective pair programming assistant should also have knowledge of the underlying codebase and its functions and variables.

Similar to how conversational question answering has been a focus in the community of late (see this recent paper for an overview), a conversational pair programming task would be a great way to measure progress regarding not just whether a model can produce a given function but whether it can effectively collaborate with a human. Given the promise of such models for augmenting the programming workflow, expect to see much more work in this space.

The Perceiver

The Perceiver uses cross-attention to project a large input byte array to a small latent array, which is processed with a regular Transformer stack. Cross-attention and Transformer stacks are interleaved throughout the model, with optional parameter sharing.

The Perceiver (Jaegle et al., ICML 2021) is one of the recent models that I'm most excited about. The main motivation of the work is to enable a Transformer-like architecture to scale to very high-dimensional inputs (Vision Transformers are typically applied to image patches to overcome the computational complexity of self-attention). There have been a lot of recent more efficient Transformer architectures (see this paper for an overview) but these still depend on the length of the input, typically linearly.

In contrast, the Perceiver uses a latent array of a fixed dimensionality as its base representation (see above). This representation is then conditioned via cross-attention (as in a standard encoder-decoder model) on the much larger input array and then processed with a Transformer stack, in alternating fashion. If parameters are shared across Transformer blocks and cross-attention layers, the Perceiver can essentially be seen as an RNN with a Transformer at its core. It is also similar in spirit to the Universal Transformer (Dehghani et al., ICLR 2019), a model that applies the same Transformer block to an input multiple times.

The authors apply the Perceiver to three datasets across different modalities (ImageNet, video and audio, and 3D point clouds) and report performance competitive with the state of the art on all of them. You can also check out Yannic Kilcher's video for a more visual introduction and contextualisation of the Perceiver.

Beyond the Transformer

Another recent trend has been the emergence of models that seek to replace the ubiquitous self-attention layer, most notably using multilayer perceptrons (MLPs). The MLP-Mixer (Tolstikhin et al., 2021) applies MLPs independently to image patches as well as across patches and achieves competitive results on image classification tasks. Liu et al. (2021) propose gMLP, a gated MLP architecture that achieves performance similar to Transformers on NLP and vision tasks.

A non-MLP based recent model is FNet (Lee-Thorp, 2021), which uses 1D Fourier Transforms instead of self-attention to mix information at the token level. While the model is less expressive than self-attention based models such as BERT, it is much faster and still achieves competitive results in many settings.

Another thread of work in this area revisits the dominance of self-attention by applying the same treatment to convolutions (Tay et al., ACL 2021): It turns out that if CNNs are pre-trained the same way as Transformer models, they achieve competitive performance on many NLP tasks. They mainly underperform on tasks that require modelling relations across sentences (such as paraphrasing, NLI, or QA), tasks that are notably over-represented on standard benchmarks such as GLUE.

On a similar note, a recent paper (Dehghani et al., 2021) by some of the same authors argues that the tasks we focus on as part of a benchmark induce a bias in terms of the models that will succeed. If standard benchmarks such as GLUE were constructed differently, would we still have ended up with self-attention-based models dominating or would CNN-based models be much more common?

In sum, an MLP may unfortunately not be all you need. However, while the hegemony of self-attention may still endure, recent challengers based on MLPs, convolutions, and various other transformations encourage us to rethink the fundamental building blocks of our models.

Data augmentation, NL augmenter 🦎 → 🐍

Data augmentation is a common tool used in computer vision but much less common in NLP. NLP is more challenging for augmentation due to the discrete nature of language, which also makes it harder to preserve meaning across transformations. This recent survey (Feng et al., ACL Findings 2021) gives an overview of recent approaches in this area. In particular, one thing that is missing for current data augmentation approaches in NLP is a unified benchmark and framework where many different approaches can be tried and compared to each other.

Towards this goal, NL-Augmenter is a collaborative effort that aims to collect a wide range of transformations, perturbations, and filters that generate additional data either for training or to test model robustness.

It is motivated by recent efforts such as the Beyond the Imitation Game Benchmark (BIG-bench), a collaborative project that crowd-sourced tasks to probe large language models. BIG-bench project has attracted a large amount of interest with people proposing 100s of tasks.

NL-Augmenter invites submissions via GitHub pull requests. Submitted transformations may augment data in diverse ways such as introducing spelling errors, translating to a different language, randomizing names and numbers, paraphrasing, etc. Some of my favourite submissions introduce transformations that randomly swap words that sound similar to each other, replace names with more gender and culturally diverse ones, or translate random words to another language.

The submission deadline is September 1, 2021. If you are interested in data augmentation for NLP, this is a great chance to contribute to a large community effort.

The research communication continuum 🗣

I think a lot about how to communicate research effectively. One thing that is on my mind lately is what I call—for lack of a better word—the 'research communication continuum', essentially to what extent and in what form to discuss a given research topic. I have depicted different levels of content and the most common formats below in order of increasing complexity, but of course different formats are possible for each.

The research communication continuum

There are a lot of great sources to get highlights or digests of interesting new articles such as SkyNet Today's Last Week in AI, Yannic Kilcher's ML News, The Batch, Import AI, and many more.

To complement these resources, I have tried to focus more on in-depth discussions with this newsletter. However, such content is more time-consuming to write and takes time away from writing more comprehensive blog posts.

Going forward, I'm planning to focus more on slightly lighter, opinionated takes on current research themes rather than super in-depth discussions in this newsletter. That should give me more time go deep in blog posts and to explore exciting research topics with you more regularly. Stay tuned!

Rethinking ML Papers 📝

Talking about research communication, the Rethinking ML Papers workshop at ICLR 2021 explored just this topic and featured many luminaries of the ML communication space (if you registered at ICLR, you can view the content of the workshop here). My highlights were:

Lilian Weng who writes excellent overview articles about a diverse set of topics, from contrastive representation learning to controllable text generation. She talked about how you can catch up with the field by writing a high-quality ML blog post (also one of my motivations for writing blog posts).
Terence Parr who writes visual, lucid articles explaining core ML concepts. His post on visualizing decision trees is still one of the most visually intuitive illustrations of the method that I have seen. He spoke about the role of visualization in ML.
David Ha, interactive visualization virtuoso, talked about interactive web demos of ML models.
Jay Alammar, of Illustrated Transformer fame, discussed how to communicate ML research via illustrated and interactive web articles.

Fun papers

And Now For Something Completely Different...

If I fits, I sits.

If I fits I sits: A citizen science investigation into illusory contour susceptibility in domestic cats (Felis silvestris catus) (Applied Animal Behaviour Science, 2021) From a different subject area, this article is a large-scale study that capitalizes on two important trends: 1) citizen science, which emphasizes public participation and collaboration in research, and 2) the Internet's love of cats. It turns out that cats not only prefer to sit in physical box-like spaces but also tend to do so if enclosures are illusory, such as the Kanizsa square visual illusion.

Lecturers going the extra mile for their students by dressing up in NLP-themed costumes (here: ELMo)

Teaching a Massive Open Online Course on Natural Language Processing (Teaching NLP Workshop 2021) Teaching a new course, particularly during the COVID pandemic can be incredibly challenging. This is a nice example of a massive open online NLP course that lecturers from Moscow taught. In the paper, they share their 12-week syllabus, consisting of lectures covering both fundamentals as well as recent work, real-time coding sessions, and interviews with experts. One thing that surely contributed to the course's success are the thematically fitting wardrobe choices of the lecturers who dressed in Sesame street kigurumis (see above).