Bigger vs. smaller models, powerful vs. dumb models

Aug 21, 2019

Hi all,

The theme of this newsletter are juxtapositions: training ever bigger models (GPT-8 8B) vs. making models smaller (via distillation or compression); powerful models (see Tools ⚒) vs. dumb models à la Clever Hans, i.e. that only appear to be able to perform complex tasks (see Articles and Blog Posts 📰). Besides these themes, there are as always many other interesting tools, blog posts, and papers.

Contributions 💪 If you have written or have come across something that would be relevant to the community, hit reply on the issue so that it can be shared more widely.

I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.

If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.

Slides 🖼

RepL4NLP 2019 Speaker slides 👩‍🏫 The slides from the invited talks from the 4th Workshop on Representation Learning for NLP from Marco Baroni, Mohit Bansal, Raquel Fernandez, and Yulia Tsvetkov.

Talks 🗣

ACL 2019 talks 👨‍👩‍👧‍👦 Talks from all sessions from ACL 2019 are available online. If you missed the conference, this is a great chance to catch up with the latest research.

Resources 📚

A Selective Overview of Deep Learning 🤖 An overview of deep learning, from common models and training techniques to more recent directions such as investigations into depth, over-parametrization, and the generalization power of neural networks.

Bigger models, smaller models 💗

Making huge models smaller and faster. Credit: Rasa

As outlined in a recent blog post expanding on the NAACL 2019 Transfer Learning tutorial, we can expect pretrained models to continue getting bigger. Researchers from NVIDIA provide the latest milestone in this line with GPT-2 8B, a language model that has 8.3B parameters, 24x the size of BERT and 5.6x the size of GPT-2. However, simply scaling up these models does not directly translate to significant downstream improvements: GPT-2 8B performs slightly better than GPT-2 on WikiText-103, but worse on the Lambada dataset.

In light of these enormous model sizes, we are seeing more approaches that look to make gargantuan models smaller and usable without a data centre full of GPUs: This post by Rasa's Sam Sucik gives a nice overview of model compression including quantization, pruning, and distillation. It also discusses how quantization can be used in TensorFlow Lite.

For distillation, Dima Shulga provides a clear example of how a big model (BERT) can be distilled into a much smaller model (logistic regression). The resulting model is a lot better than a linear regression model trained from scratch and only slightly worse than the fine-tuned BERT.

Tools ⚒

More powerful BERT models. Credit: GluonNLP

GluonNLP 🐵 The new version of GluonNLP, an NLP package for MXNet features a BERT Base comparable with the original BERT Large, specialized BERT versions, new models (ERNIE, GPT-2, ESIM etc.), and more datasets.

GPT-2 774M 🤖 OpenAI released a bigger version of their GPT-2 language model. They also discuss lessons from coordinating with the research community on publication norms.

XLM 🌍 Not only monolingual models are getting more powerful, but also cross-lingual models: New pretrained cross-lingual language models that outperform multilingual BERT are now available in 100 languages.

spacy-pytorch-transformers 🤖 If you're using spaCy and have been waiting to incorporate pretrained models in your applications, then look no further than spacy-pytorch-transformers. It allows you to use models like BERT in spaCy by interfacing with Hugging Face's PyTorch implementations. The library also aligns the transformer features with spaCy's linguistic tokenization, so you can apply the features to the actual words, instead of just wordpieces.

YouTokenToMe 📝 Talking about tokenization, wordpieces are great, but can be quite slow at times, particularly when they are learned on very large corpora. YouTokenToMe is an unsupervised text tokenizer that implements byte pair encoding, but is much much (up to 90x) faster in training and tokenization than both fastBPE and SentencePiece.

Blackstone 💎 If you are working with spaCy and legal documents, then Blackstone is for you. It is a spaCy model and library for processing long-form, unstructured legal text. As far as I'm aware, it is the first open-source model trained for use on legal text, so should be a great starting point if you're working in this area.

Snorkel v0.9 🏊‍♂️ If you generally want get more out of your data, then you should take a look at the new version of Snorkel, the state-of-the-art toolkit for programmatically building and managing training datasets. It introduces a unified, modular framework that should allow you to manage your training data and leverage weak supervision a lot more easily.

TensorFlow Text 📖 Textual data becomes a first-class citizen in TensorFlow 2.0 with TensorFlow Text. Text is a collection of text related classes and ops, including ops for preprocessing, which are directly done as part of the TensorFlow graph.

Articles and blog posts 📰

NLP's Clever Hans moment has arrived. Credit: Benjamin Heinzerling

NLP's Clever Hans Moment has Arrived 🐴 Benjamin Heinzerling reviews a recent ACL 2019 paper that BERT only exploits superficial cues on an argument reasoning comprehension task and discusses what this means for NLP more broadly. He suggests that in order to prevent our models from exploiting superficial cues, we need to improve and ablate our datasets and consider the consistency of model predictions.

The Illustrated GPT-2 🌈 If you want to deeply understand how GPT-2 and other Transformer language models work, then this post by Jay Alammar (author of The Illustrated Transformer) is your best bet. It provides superb visuals and crisp and clear explanations of the inner workings of the Transformer.

Becoming One With the Data 🧘‍♀️ This blog post by Sayak Paul focuses on important aspects of the data science and machine learning process, in particular familiarizing yourself with the data, transforming the raw dataset, deriving insights from the data, and establishing human baselines.

NAACL ’19 Notes: Practical Insights for NLP Applications 👩‍💻 In this (so far) two-part series aimed at NLP practitioners, Nikita Zhiltsov gives an overview of topics ranging from transfer learning and cross-lingual representations to text similarity, classification, NLG and sequence labelling (Part II).

The Real Challenge of Real-World Reinforcement Learning: The Human Factor 👤 In this extensive post, Stefan Riezler discusses how we can involve humans in RL for NLP. The main approaches are counterfactual learning and estimating reward from human feedback.

How I became a machine learning practitioner 👨‍💻 OpenAI's Greg Brockman muses on what it took for him to become an ML practitioner. Importantly, his "biggest blocker was a mental barrier—getting ok with being a beginner again".

Open-source Xenophobic Tweet Classifier ⚠️ This post by Abraham Starosta and Tanner Gilligan discusses the process of creating a xenophobic tweet classifier, from creating a dataset, to building and evaluating the model.

A Discussion of Adversarial Examples Are Not Bugs, They Are Features 🐛 This discussion article on Distill is a great example of rigorous scientific discourse, featuring six comments on the original paper that address misunderstandings, clarify claims, replicate results, and outline new directions.

Generic Neural Elastic Search: From bert-as-service and Go Way Beyond 🏛 Han Xiao highlights not only important trends in current ML such as the increasing focus on pretrained models and end-to-end applications but also describes a generic neural net-enabled IR system that can scale to an arbitrary number of encoders.

Creating a Pop Music Generator with the Transformer 🎶 Andrew Shaw outlines in this post how you can train a Transformer model to generate (pop) music. He converts music files into a sequence of tokens (music notes) and then treats the generation task as language modelling. You can try the model out here.

Papers + blog posts 📑

The Natural Language Generation task spectrum (Credit: Abigail See)

What makes a good conversation? How controllable attributes affect human judgments (paper) Abigail See provides an extensive discussion of her paper in which she uses chitchat dialogue as a setting to control attributes of generated text and evaluate the conversational quality.

On the Variance of the Adaptive Learning Rate and Beyond (paper) Despite not being from the paper's authors, this blog post gives a nice overview of Rectified Adam (RAdam)—that aims to reduce the variance of Adam in the early stage of training—with experiments using fast.ai.

Paper picks 📄

Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems We know that current metrics for natural language generation (NLG) evaluation such as BLEU or ROUGE do not work very well as they do not capture the diversity of generated text. This paper takes a refreshingly novel approach: instead of evaluating a single turn of a dialogue model, it evaluates multiple turns in a self-play scenario where the dialog system talks to itself. A combination of proxies such as sentiment and semantic coherence on the conversation trajectory serve as the metric, which is shown to correlate better with the human-rated quality of dialogue agent compared to past metrics. Overall, this paper takes a first step towards what might very well become the next step in NLG evaluation: multi-turn evaluation.

Language as an Abstraction for Hierarchical Deep Reinforcement Learning One of the most distinguishing features of human language is its compositionality, which many models in NLP try to—but so far mostly fail—to exploit. Despite this, the compositional structure of language is useful as an abstraction to improve generalization. This paper leverages this in an interesting way for hierarchical RL by learning an instruction-following low-level policy and a high-level policy that can reuse the abstractions across tasks. Another benefit of using language as abstraction is that it may be more interpretable than other compositional structures and would facilitate human-machine interaction.

NLP News

Discussion about this post