BERT, GPT-2, XLNet, NAACL, ICML, arXiv, EurNLP
Hi all,
A lot has been going on in the past month. This newsletter contains new stuff about BERT, GPT-2, and (the very recent) XLNet as well as things from NAACL and ICML and as always exciting blog posts, articles, papers, and resources. (Edit: Sorry about that. Seems like an earlier version of the intro went out via email.)
EurNLP Registrations and applications for travel grants for the first European NLP Summit will be open soon, so stay tuned. One student author per accepted abstract is eligible for a travel grant. An abstract is only one page and can be ongoing or published work. What are you waiting for?
Contributions 💪 If you have written or have come across something that would be relevant to the community, hit reply on the issue so that it can be shared more widely.
I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.
If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.
Slides 🖼
Troubleshooting Deep Neural Networks 🔧 A field guide to fixing your neural network model by Josh Tobin.
Meta-learning for Computer Vision 🤖 Slides from the tutorial on Meta-Learning for Computer Vision at CVPR 2019.
Language learning and processing in people and machines 🧒 Slides of the NAACL 2019 tutorial of the same name.
Talks 🗣
An AI Pioneer Explains the Evolution of Neural Networks 🐒 This is a fantastic interview with Geoffrey Hinton in which he takes us through the history of neural networks. He discusses a range of other topics, such as reasoning, adversarial examples, consciousness, what neural networks can teach us about the human brain, dreams, and capsule networks.
Self-supervised learning 👩💻 Self-supervised learning is becoming more popular in the NLP community. Here are recordings of talks from the self-supervised learning workshop at ICML 2019 from Andrew Zisserman, Abhinav Gupta, and Alexei Efros. For more on self-supervised learning, here are slides from a tutorial by Andrew Zisserman.
Resources 📚
OpenITI corpus 📖 An open access corpus of thousands of premodern and modern Arabic texts. Now there's no excuse anymore to just work on European languages.
Tools ⚒
InterpretML ⬛️ An open-source package by Microsoft for training interpretable models and explaining blackbox systems. The library is based on an Explainable Boosting Machine model.
semantic 💻 If you're interested in ML on Code then this is for you. A Haskell library and command line tool by GitHub for parsing, analyzing, and comparing source code.
Stable Baselines 📊 A fork of OpenAI Baselines that is meant to be user friendly (with a sklearn like syntax). It allows to define and train RL agents with a single line and has a documentation with examples and colab notebooks, and a RL zoo with more than 100 pretrained agents.
To arXiv or not to arXiv 📄
With conference season in full swing, there've been many discussions—both online and offline—about the nature of blind review in the arXiv era. In this essay, Matt Gardner proposes a series of incremental steps to fix the situation. His main suggestion is to make the posting of anonymous (rather than non-anonymous) preprints prevalent by replacing arXiv with OpenReview.
BERT-ology 🐵
BERT won the best paper award at NAACL 2019 (see my NAACL 2019 highlights). Interest in BERT is so high that many papers have moved from analyzing standard architectures such as LSTMs or Transformers to trying to understand what is going on inside a single model, BERT. Here's a selection of recent tidbits:
Only a few of the attention heads are actually necessary. Recent work (Voita et al., ACL 2019, Michel et al., 2019) shows that many of the attention heads in BERT can be pruned, with only a small performance penalty.
BERT models linguistically relevant features. Lin et al. (2019) analyze the hierarchy of learned representations in BERT, while Coenen et al. (2019) analyze the geometry of learned representations.
Whole word masking. BERT tokenizes the text into subwords. Previously, these were independently masked, which simplifies the objective. Masking all subwords that belong to a word makes the pretraining task harder and improves performance.
Talking GPT-2 🗣
There have been a lot of cool applications of language models over recent months. One of the coolest is Talk with Transformer, which enables you to use the medium-sized GPT-2 as a writing assistant. Say good-bye to writer's block!
If you're wondering what it would look like if GPT-2 was talking with itself, then the r/SubSimulatorGPT2 provides a treasure trove of reddit posts, including ones such as "what do you call two people with the same name? Joe and John".
The original announcement of GPT-2 drew a lot of controversy due to the amount of PR and the decision not to release the model due to it being "too dangerous". One aspect, however, was often ignored in later discussions: Regardless of whether or not the model should be released, it is important to start the debate. It is important to emphasize this again—as done by Connor Leahy who developed a similar model—and to continue this debate as the capabilities of our model grow and we will have to grapple increasingly with ethical questions.
XLNet 👑
The king is dead. Long live the king. BERT's reign might be coming to an end. XLNet, a new model by people from CMU and Google outperforms BERT on 20 tasks (with a similar number of parameters but trained on more data). Instead of predicting masked words independently as in BERT, the model predicts target words based on different orders of source words. This allows it to model more dependencies in the data. The paper is very well written and provides some nice examples and intuition for how the model works, so it's definitely worth a read.
Besides many interesting aspects, it's also interesting to see that it uses discriminative fine-tuning (layer-wise lr decay) for fine-tuning on SQuAD (thanks to Thomas Wolf for pointing this out).
NAACL and ICML 🏛
You can read this blog post for some of my highlights from NAACL. Ted Petersen summarized important points from discussions about the nature of SemEval. David Abel again extensively summarized ICML sessions, mostly focusing on RL.
If you are going to any conference this year, consider writing a summary both for yourself and the rest of us. There are so many parallel sessions these days that everyone's experience and highlights will be different.
Articles and blog posts 📰
Approximating Wasserstein distances with PyTorch ✍️ This blog post by Daniel Daza introduces the optimal transport problem. It describes how it can be solved using Sinkhorn iterations and how they can be calculated in PyTorch. Sinkhorn iterations have been recently used in different settings, such as in bilingual lexicon induction and are a generally useful paradigm for determining the distance between two distributions in a differentiable way.
Learning to Drive Smoothly in Minutes 🏎 An RL approach applied to a small racing car. It combines feature extraction (with a VAE) and RL (Soft Actor-Critic) with a realistic reward function (safety driver feedback). It also adds an additional trick to avoid shaky behavior.
An End-to-End Speech-to-Speech Translation Model 🗣 A blog post about Translatotron, a new attentive sequence-to-sequence model by Google that translates speech-to-speech without using an intermediate text representation. The model, however, is still weaker (in terms of BLEU) than the standard pipeline approach.
Introducing Metadata Enhanced ULMFiT 📰 A case study on how incorporating meta-data into models (by prepending the information with special tokens to the input) can enhance the performance of state-of-the-art text classification models such as ULMFiT. Here's the follow-up post that demonstrates even bigger performance gains.
Collaboration & Credit Principles 👨👩👧👦 Christopher Olah makes recommendations how we can build trust when we collaborate. His most important pieces of advice are:
Be generous.
Use author contribution statements.
Put "author order not finalized" if it hasn't been.
A Transformer Chatbot Tutorial with TensorFlow 2.0 💬 This article shows how to preprocess the Cornell Movie-Dialogs Corpus using TensorFlow Datasets, how to implement MultiHeadAttention a Transformer with the Functional API.
The Best and Most Current of Modern Natural Language Processing 📝 Victor Sanh from HuggingFace provides a current list of some of the cutting-edge papers in current NLP.
The ICML 2019 Code-at-Submit-Time Experiment 🖥 The program chairs of ICML 2019 ran a successful experiment where they encouraged voluntary code submission at submission. Overall, about 36% of submitted papers and 67% of camera-ready papers provided code.
Goodhart’s Law: Are Academic Metrics Being Gamed? 👩🏫 This TheGradient post by Michael Fire finds that traditional citation measures have become targets and that by making papers shorter and collaborating with more authors, researchers are able to produce more papers in the same amount of time.
Papers + blog posts
Papers should be accompanied by blog posts (see Rachel Thomas' advice in the last issue). Here are some particularly well-written paper + blog post combinations. The first two get bonus points for including code samples of the proposed method.
1000x Faster Data Augmentation (paper) Population Based Augmentation is a new algorithm that quickly learns how to augment data during training with 1000x less compute compared to SOTA approaches such as AutoAugment. The key idea is to use population based training to generate an augmentation schedule for each training epoch.
Targeted Dropout (paper) A new method for training a neural network so that it is robust to later pruning. Instead of dropping units randomly, weights are selected based on a sparsity criterion. Weights that frequently occur in the dropped sets can later be easily pruned.
Language, trees, and geometry in neural networks (paper) An analysis of some of the properties of parse tree embeddings both theoretically and in BERT.
Paper picks 📄
Check out this blog post for some of my favourite papers from NAACL 2019.