ICML round-up, Open collaboration, CLIP art, Internet augmentation, New GLUE-style benchmarks
Hi all,
This newsletter covers some of my favourite papers from ICML 2021, a discussion of open collaboration, art generated by the CLIP model, how to leverage information from the Internet in your models, and new benchmarks in the style of GLUE.
FYI, I'll be at ACL 2021 virtually this week. Ping me on Gathertown or send me an email if you would like to chat. I'm co-author on two papers on parameter-efficient multi-task learning and monolingual vs multilingual models, which will be presented by the first authors on 12 pm, August 2 and 11 am, August 3 respectively.
I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.
Click here to view the newsletter in your browser.
If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.
ICML round-up 📑
Straight to the Gradient: Learning to Use Novel Tokens for Neural Text Generation Neural generative models, despite their popularity, are known to suffer from some deficiencies, such as a tendency to generate frequent tokens. Popular methods to address this such as top-k sampling (Fan et al., 2018) or nucleus sampling (Holtzman et al., 2020) focus on decoding. This paper proposes ScaleGrad, which re-scales the token probabilities during training to encourage the model to focus on novel tokens, i.e. ones that have not been generated before. ScaleGrad seems to improve performance on some open-ended as well as directed generation tasks. Of course, just focusing on novel tokens may be too simplistic. Overall, modifying the loss function with regard to particular sets of tokens may be a useful way to inject additional inductive biases into a model, such as which entities or attributes to focus on.
Dense for the Price of Sparse: Improved Performance of Sparsely Initialized Networks via a Subspace Offset This paper nicely highlights the competing priorities when training sparse networks: The current trend of identifying 'lottery tickets', i.e. sparse subnetworks that can be trained on their own from scratch and which perform similarly to full networks, is motivated by computational efficiency. However, such methods require computing a score for all parameters in the full model to determine whether they should be pruned. It is thus still necessary to store and compute with the full model on device. To reduce on-device storage costs, the authors propose a network layer as the sum of a sparse matrix and a fast transform. Another thing that I found interesting was the notion of matching vs extreme sparsity (Frankle et al., 2021): the former is the sparsity setting where pruned models perform comparatively to the full model while in the latter setting, the performance of pruned models deteriorates. Strong pruned models should aim to strike a Pareto-optimal sparsity–performance trade-off.
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases This paper is a nice example of how different inductive biases can be combined to make a model more expressive depending on how much data is available. Vision Transformers using self-attention have outperformed CNNs when trained on large datasets as self-attention is more expressive. But CNNs still perform better when trained on smaller datasets due to their inductive bias. The authors propose a slightly modified self-attention layer, which is initialized to act as a convolutional layer. This way, the model retains the useful inductive bias early during training and can become more expressive if necessary later on. Such a soft inductive bias may also be of interest for efficient language Transformers, which could limit the attention span early in training (Sukhbaatar et al., 2019).
Calibrate Before Use: Improving Few-Shot Performance of Language Models This paper highlights the instability of prompt-based learning. In particular, prompt-based models are sensitive to the format of the prompt, training examples, and order of examples. A key problem is that the model favours certain answers over others, e.g. answers that are frequent in the prompt, appear towards the end of the prompt, or are frequent in its pre-training data. To address this, the authors propose to first estimate the model's bias using a content-free prompt. The model's predictions can then be recalibrated so that the class scores for the content-free prompt are uniform. While calibration doesn't alleviate the need for prompt engineering, it reduces the variance when dealing with different prompts. Given the current popularity of prompt-based methods, this may make working with prompt-based models a lot easier.
Catformer: Designing Stable Transformers via Sensitivity Analysis This paper introduces the concept of the sensitivity of an architecture, which measures how an architecture's output varies when its parameters are randomly perturbed. The authors also relate this measure to how difficult to train certain architectures are. They then propose a simple modification to the Transformer, which replaces residual connections with concatenation and is more stable to train on a set of reinforcement learning tasks. So far, training difficulty was often discussed anecdotally. A more principled measure of training difficulty such as sensitivity is a step towards designing not only models that are more powerful but that are also easier to use in real-world applications.
WILDS: A Benchmark of in-the-Wild Distribution Shifts This is a very diverse benchmark to test how well ML methods generalize across distribution shifts on a wide variety of domains and data modalities. It covers domains as diverse as camera trap photos, cell images, molecular graphs, online comments, and code. If you are working on robust, modality-agnostic ML methods, then this is the dataset to evaluate on.
Open collaboration 🤝
At ICML, I attended the Social on Open Collaboration in ML Research, hosted by ML Collective, among others. During the event, people shared a diverse range of external collaboration experiences, many of them relating to work done as part of independent research collectives.
Connor Leahy talked about EleutherAI, a grassroots collective of researchers who developed not only GPT-Neo, an open-source LM in the style of GPT-3 but also worked on BioML research and ML-generated art (read more about the art below)—all of this in the past year. This blog post provides a great overview of their progress so far. To join or contribute, you can head over to their Discord.
Edward Elson Kosasih talked about his research as part of ML Collective (MLC), a nonprofit organization dedicated to making ML research accessible. He led a team that worked on graph neural networks as part of the Open Graph Benchmark Large Scale Challenge. In order to get involved with MLC, you can join their Discord.
Matthias Gallé discussed the BigScience project, also known as The Summer of Language Models 21, a one-year long research workshop on very large language models. The project aims to create and share a large multilingual dataset and to train a very large language model. A diverse set of working groups are dedicated to different parts of the data and model creation process, from data sourcing to prompt engineering, dealing with metadata, and retrieval. To get up to speed on the progress so far, you can watch updates from the first event from July 30, 2021 here. To join the project, fill out the form here.
Salomon Kabongo talked about the work of Masakhane. Masakhane is a grassroots organisation that aims to strengthen NLP research in African languages. So far, they have released models and datasets for diverse tasks such as machine translation, named entity recognition, and others in many African languages. To get involved, join the Google group and Slack channel.
On the whole, my impression is that ML and NLP have become much more accessible, in part thanks to research collaborations such as the above, which are open to anyone as long as you're excited and motivated to contribute. Other collaboration opportunities are the fast.ai or the HuggingFace communities. If you are looking to work in ML or NLP and need collaborators and guidance, I encourage you to join one of the above collaborations.
For conducting academic collaborations, I shared some lessons of my first external collaboration (and first long paper during my PhD) with Barbara Plank (see below).
CLIP art 🎨
CLIP art, not to be confused with the sometimes slightly cheesy type of graphic art often used for illustration purposes, relates to art produced using the CLIP model by OpenAI. CLIP was trained with a contrastive objective to match text with corresponding images. As a result, CLIP is very good at judging which caption best reflects an image, which can be used for zero-shot classification on ImageNet. Alternatively, the model also can be used to gauge which image best suits a description. This is how CLIP is used to generate art (see above), by steering the output of a separate generative model through back-propagation until the model generates an image that matches the description as closely as possible, according to CLIP.
This article by Charlie Snell does a great job of charting the development of the art scene that has evolved around this method—and the often dreamy, impressionistic or psychedelic images that it has produced. The cool thing is that CLIP works with any generative model so the possibilities the method presents develop and grow more diverse as generative models become more powerful. One of my favourite images is the one below of a "a beautiful epic wondrous fantasy painting of the ocean" generated by @RiversHaveWings using CLIPDraw + CLIP.
Internet augmentation 💻
Current large language models are trained on large amounts of unlabelled data, mainly from the web. However, they do not yet leverage everything the Internet has to offer. In other words, there are many types of signals on the web that are currently not used for learning. Barbara Plank has called this fortuitous data, as such data is often available by accident or good fortune.
A great example of such fortuitous data is the HTML structure underlying web pages. Such structure can provide both useful learning signal for a model as well as be useful for generating prompts, by letting the model auto-complete the HTML structure of a document. Aghajanyan et al. (2021) recently proposed the cleverly named HTLM, a large language model trained on HTML structure. They show that the model excels at zero-shot natural language generation using structured HTML-based prompts. In addition, they propose to control the size of the generated output sequence by using size hints, noisy estimates of the length of the generated span inserted right after the MASK token.
Another recent example of leveraging more of what the web has to offer is an extension of retrieval augmentation to the Internet. Specifically, rather than learning to retrieve relevant information only from a large corpus of unlabelled text, a model can learn to retrieve from the entire Internet. To make this feasible, Komeili et al. (2021) learn to generate a search query based on the context of a dialogue. They then condition on the search results to generate a response. The resulting Internet-augmented dialogue model outperforms both the use of retrieval augmentation and no augmentation.
Other forms of information that so far have been neglected are a) the information from hyperlinked pages, which could be used for conditioning during training; b) hyperlink patterns, to learn which information to trust; c) multi-modal context on webpages, to ground representations; d) snapshots of webpages over time, to learn re-writing, etc; e) timestamps for modelling time (Dhingra et al., 2021), f) content by the same users across multiple websites for authorship and style modelling, etc.
New GLUE-style benchmarks 🏛
Since the development of models that learn general representations, mainly via self-supervised learning, it has become common to evaluate such models on benchmarks comprising a diverse set of different tasks. The most prominent of these are arguably the GLUE and SuperGLUE benchmarks. Following in their footsteps, benchmarks serving a wide array of settings and languages have been proposed.
I'm particularly excited about two recent additions to this ever-growing evaluation environment. Few-shot Language Evaluation across (X) many transfer types (FLEX; Bragg and Cohan et al., 2021) is a benchmark focused on few-shot NLP, something I hoped to encourage in 2018. It not only covers the standard meta-learning/few-shot learning setup with separate meta-training and meta-test portions but also manages to capture the current zeitgeist by evaluating zero-shot evaluation based on textual descriptions.
The second benchmark is the Speech processing Universal Performance Benchmark (SUPERB; Yang et al., 2021), which aims to do for speech what GLUE has done for NLP, by providing a general platform to evaluate self-supervised speech models on 10 different tasks. The benchmark covers core speech tasks, from modelling content (phonemes, transcription, keywords) and speakers (identification, verification, and diarization) to dealing with semantics (intents, slot filling) and paralinguistic features (emotions). Such a standardization will likely open the door to the development of more powerful self-supervised speech models.
Another type of benchmark I'm excited about is one that covers many tasks in a given language. I've recently had the chance to contribute to two such benchmarks: LiRo for NLU tasks in Romanian and IndoNLG for NLG tasks in Indonesian. Facilitating evaluation on a diverse set of tasks in a given language is one of the best ways to incentivise progress in that language in my opinion.