COVID-19, Hutter Prize, Compression = AGI?, BERT, Green AI

Mar 27, 2020

Hi all,

This newsletter is a bit delayed due to some adjustments in light of the ongoing coronavirus pandemic. I hope you are all safe. Hopefully this newsletter can brighten your day a bit.

This edition includes new results from NLP-Progress, a discussion about COVID-19 and what you can do to help, an update of the venerable Hutter Prize, which uses compression as a test for AGI, the latest resources around BERT and monolingual BERT models, an introduction to Green AI, and as usual lots of other resources, blog posts, and papers.

Contributions 💪 If you have written or have come across something that would be relevant to the community, hit reply on the issue so that it can be shared more widely. I'd particularly love to share anything related to using NLP to deal with the ongoing pandemic.

I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.

If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.

NLP Progress 🔢

Updates during the last month include:

New results on English-Hindi machine translation
New results on intent detection (with code)
State-of-the-art results on CNN / Daily Mail summarization
State-of-the-art results on coreference resolution
State-of-the-art language modelling results on WikiText-103
A new corpus for query-based abstractive snippet generation

If you are looking for other datasets, have a look at The Big Bad NLP Database, which currently covers 314 datasets.

COVID-19, you, and the world 😷

The coronavirus pandemic impacts all of us. This post by Our World in Data gives an excellent, data-centric and regularly updated overview of the crisis and the current research on COVID-19. In addition, you should read this post from fast.ai's Jeremy and Rachel, which provides a no-nonsense take on the crisis and outlines in clear terms what needs to be done.

What can you do? The most important thing is to slow the spread of the virus ('flattening the curve'; see below). You should wash your hands frequently, avoid crowds and large gatherings, work from home if at all possible, and generally distance yourself from others.

Flattening the curve (credit: Our World in Data)

Take care of your sanity. With everything bad that is going on, it is important that you take care of yourself, both physically and mentally. Take time off to recharge. Focus on the positive. Work on something that excites you or where you feel you can make a difference. Don't panic. There is always another deadline.

Participate in the COVID-19 Global Hackathon. The hackathon is supported by Facebook, Giphy, Microsoft, Pinterest, Slack, TikTok, Twitter and WeChat and focuses on building something with a technology of your choice to help address the ongoing crisis.

Help explore the CORD-19 dataset. The Allen Institute for AI has recently released the COVID-19 Open Research Dataset (CORD-19), a free resource that contains 44k+ articles about COVID-19 and the coronavirus family of viruses. Kaggle is hosting the COVID-19 Open Research Dataset Challenge, which provides a series of important questions to inspire research using CORD-19. If you are interested in question answering, text mining, summarization, etc then this is probably the most impactful dataset and task to work on these days.

If you are a domain expert using open-source scientific computing software more effectively, then the COVID-19 Open-Source Help Desk may also help.

Compression = Artificial General Intelligence? 🤖

Most people who have worked with text data have heard of the Hutter prize. The prize, named after Artificial General Intelligence researcher Marcus Hutter (disclaimer: Hutter is now at DeepMind), was introduced by Hutter in 2006 with a total of €50,000 in prize money. The goal of the competition was to compress enwik8, 100MB of English Wikipedia to a file size that is as small as possible. The previous record holder managed to compress the file by a factor of about 6.54. Since its introduction, enwik8 has become a standard benchmark for character-level language modelling, with a Transformer-XL achieving a bits-per-character score of 0.94.

In the era of larger models and more compute, after almost 14 years, the Hutter Prize has expanded: The new objective is to compress enwik9, 1GB of English Wikipedia and the new prize pool is 500,000€.

But what is so special about compression in the first place? Compression has been argued to be a more tangible way to measure "intelligence". If you are very good at compressing something, then you have arguably retained something about the underlying structure of the input. In the case of compressing English Wikipedia, one can argue that you would need to understand something about the English language and have acquired some degree of world knowledge in order to compress the data to a very small size.

Researchers have also hypothesized that compression plays a role in language acquisition. Compression is closely linked to generalization in deep learning in the Information Bottleneck theory and has been proposed for evaluating sample efficiency in transfer learning.

BERT things 🦍

BERT is as popular as ever in the NLP community. For a glimpse into the scale of this adoption, have a look at this repo of 350+ BERT-related papers. If you quickly want to get up to speed on what BERT can and cannot do, check out this primer on BERTology that synthesizes 40 analysis studies. For a comprehensive survey on current pre-trained language models, have a look at this recent paper.

BERT is also popular in the multilingual space. New language-specific BERT models emerge almost every week such as Polbert, the Polish language version of BERT. For an overview of the currently available language-specific BERT models, have a look at the Bert Lang Street website and this paper. If you are looking for an evaluation for your language-specific BERT, question answering is a good benchmark. This post gives an overview of the current SQuAD-like datasets in multiple languages.

In terms of evaluation on multiple languages, we have also just released XTREME, a massively multilingual multi-task benchmark for evaluating the cross-lingual generalisation ability of pre-trained multilingual models. A blog post and the website will be released soon.

Green AI ♻️

The growing size of models and the increase in compute has several significant negative consequences such as a surprisingly large carbon footprint and a large financial cost, which makes it difficult for researchers from less well-funded institutions to contribute.

Green AI, which focuses on making efficiency part of the evaluation criteria of our models is thus an important research direction. For an overview of this emerging topic, read this position paper by the Allen Institute for AI.

A recent benchmark for the efficient pre-training and fine-tuning of NLP models is HULK. Similar to previous efficiency-oriented benchmarks such as DAWNBench, HULK measures the time and cost of pre-training a model from scratch or fine-tuning a model from a pre-trained checkpoint to a certain multi-task performance.

You can also take part in the SustaiNLP competition (hosted as part of EMNLP 2020), which will provide a shared task focusing on evaluating a trade-off between performance and efficiency on SuperGLUE.

Talks and slides 🗣

Transfer Learning in NLP 👨‍💻 These slides by Thomas Wolf provide a nice walk-through of recent papers and research directions focusing on important topics in NLP such as model size, computational efficiency, model evaluation, fine tuning, out-of-domain generalization, sample efficiency, common sense and inductive biases.

Resources and datasets 📚

Text Summarization Papers 📖 An exhaustive list of papers related to text summarization from top NLP and ML conferences of the last eight years. It even includes a paper retrieval system to find the top cited papers (the top one is A Neural Attention Model for Abstractive Sentence Summarization from EMNLP 2015) and papers related to certain topics.

Tools ⚒

mt-dnn ⚙️ This PyTorch package by Microsoft implements state-of-the-art models that were featured in many recent publications from the team, perhaps most prominently the MT-DNN model (ACL 2019) that held state of the art on GLUE.

Articles and blog posts 📰

Huge models such as BERT-large test the limits of a GPU (Credit: Amit Chaudhary)

Visual Paper Summary: ALBERT (A Lite BERT) 🖼 A visual summary of ALBERT by Amit Chaudhary that walks us through how to get from BERT to ALBERT with hand-drawn diagrams.

Isn't tokenization easy? (Credit: Cathal Horan)

Tokenizers: How machines read 📖 A surprisingly in-depth overview of tokenization by Cathal Horan that covers the intricacies of classic tokenization as well as recent variants such as subword tokenization, BPE, unigram subword tokenization, wordpiece, and sentencepiece.

How to Pick Your Grad School 👩‍🎓 An incredibly detailed guide by Tim Dettmers on how to choose where to do your PhD (with a focus on ML and NLP). The post is nuanced and not only looks at what makes most sense from a career perspective but also how to pick a school that helps you grow as a person, that gives you the stability to succeed during a PhD, and that provides you with new experiences.

Questions to Ask a Prospective Ph.D. Advisor on Visit Day, With Thorough and Forthright Explanations 👩‍🏫 On a similar topic, this post by Andrew Kuznetsov focuses on questions that you should ask a prospective supervisor during a visit day, such as whether they are hands-on or hands-off, how the lab is structured, the nature of the lab meetings, etc.

Fast, scalable and accurate NLP: Why TFX is a perfect match for deploying BERT 🛠 This blog post by SAP's Concur Labs focuses on how TensorFlow Extended (TFX), an end-to-end platform for ML models in production, can be used to deploy BERT fast and efficiently.

From PyTorch to JAX: towards neural net frameworks that purify stateful code 🔧 Sabrina Mielke provides a didactic walk-through on how to construct an LSTM language model in JAX with many clear code samples and explanations.

Why do we still use 18-year old BLEU? ⚖️ Ehud Reiter laments the continued use of the flawed BLEU metric—despite metrics that have higher correlation with human judgements having been reported. He argues that NLP researchers are more interested in reporting a better score and do not really care whether the score is meaningful. I feel—or would hope, at least—that if there was a better candidate metric, more people would point this out in their papers and reviews, which would slowly lead to the adoption of said metric.

Transformers are Graph Neural Networks 📊 This post shows the connections between graph neural networks (GNN) and Transformers. It discusses intuitions behind model architectures in the NLP and GNN communities and outlines potential directions where both communities could work together to drive progress.

An updated overview of recent gradient descent algorithms 📉 John Chen gives an overview of the recent generation of gradient descent algorithms, including AdamW, QHAdam, YellowFin, AggMo, and Demon. He evaluates their performance on both vision and language benchmarks and recommends which algorithm to use nowadays.

Papers + blog posts 📑

Speeding Up Transformer Training and Inference By Increasing Model Size (blog post, paper) Eric Wallace discusses the surprising observation that large models train faster than their smaller counterparts, that is, with fewer gradient steps. This difference is meaningful even if we factor in the extra computational cost of larger models. The finding goes counter against the common wisdom that smaller models are easier (and faster) to fit.

Paper picks 📄

Learning the Difference that Makes a Difference with Counterfactually-Augmented Data (ICLR 2020) This paper creates minimal counterfactual training examples (i.e. examples with opposite labels) for sentiment analysis and natural language inference with human annotators (for IMDb and SNLI respectively). Minimal here means that annotators are instructed to make as few changes as possible to an example. As we would expect, they find that models trained on the original data fail on counterfactual examples. The main other important finding is that models trained on the counterfactually augmented data perform better than models trained on comparable quantities of original data. This shows that having "harder" examples or examples that teach the model something useful are better than random "easy" examples.

Break It Down: A Question Understanding Benchmark (TACL 2020) This paper proposes the Question Decomposition Meaning Representation (QDMR), which consists of an ordered list of steps (in natural language) that are required for answering a question. The authors demonstrate that such QDMRs can be annotated at scale and release the Break dataset, which contains 83k pairs of questions and their QDMRs. Training on QDMRs improves performance on open-domain QA on HotpotQA and QDMRs can be converted to a pseudo-SQL formal language, so might be useful for semantic parsing.

NLP News

Discussion about this post