GPT-2, Sequence generation in arbitrary order

Mar 11, 2019

Hi all,

This newsletter's spotlight topics are GPT-2, OpenAI's recent language model, and sequence generation in arbitrary order. Besides these, there are again lots of resources, tools, articles, blog posts, and papers to explore.

Some personal news 📰 I have defended my PhD and joined Google DeepMind in London. I'm planning to continue writing this newsletter every month, but future editions might be more compact.

Contributions 💪 If you have written or have come across something that would be relevant to the community, hit reply on the issue so that it can be shared more widely.

I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.

If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.

Slides and talks 🗣

Integrating Domain-Knowledge into Deep Learning 🏛 Russ Salakhutdinov's slide deck discusses how we can incorporate domain knowledge into model architectures and learning algorithms, with many examples from reading comprehension.

Artificial Intelligence needs all of us 👩👱‍♂️ An insightful TEDx talk from fast.ai's Rachel Thomas that explains why AI should be accessible to all of us.

GPT-2

If you don't completely get this meme, read the below paragraph first. Credit: Greg Durrett

Last month, OpenAI released GPT-2 (short for Generative Pretrained Transformer-2), a new language model building on their previous approach. This one is—you guessed it—bigger and trained on a lot more data (by about a factor of 10). The model achieved state-of-the-art results on a number of language modelling datasets (to be taken with a large portion of salt; the comparable state of the art is generally only trained on the corresponding training data). The model also showed impressive zero-shot results on a number of tasks (with some clever tricks such as adding "TL;DR" to the input for summarization). These results follow the recent trend of ever larger pretrained language models, which most recently featured Google's BERT. What personally excited me were the long-form (but cherry-picked) samples generated by the model, such as an article about unicorns in the Andes (the first sample in the blog post). These samples look extremely coherent—if you're not concentrating.

OpenAI decided not to release the parameters of the pretrained model citing potential malicious use (such as generating fake news, impersonation, etc). This decision—which was accompanied by news articles from The Verge, Wired, The Register, and others with fear-mongering headlines such as "The AI Text Generator That's Too Dangerous to Make Public"—sparked controversy online. Many ML and NLP experts such as Anima Anandkumar, Delip Rao, Jeremy Howard, Hugh Zhang, Zachary Lipton, Robert Munro, Ryan Lowe, and Oren Etzioni took position in dedicated posts, which are well worth reading if you are interested in the potential for malicious use of the current level of NLP technology. TWiML&AI's Sam Charrington also hosted a panel on the same controversy that is worth listening to.

Personally, I think openness is critical for AI's continued progress (in terms of accessing papers, sharing data, replicating experiments, and releasing models). Going against this openness prevents good actors from developing defenses and prevents the research community from better understanding the model. It also sets a precedent that will slow progress. Having a discussion about malicious use cases is useful, but we are missing crucial information if we are not allowed to evaluate and assess this potential. I hope OpenAI continues to engage with the community and that this won't be the end of the conversation.

Besides malicious use, bias is still an issue even with these very big models. Until GPT-2 is released or someone has reproduced it, you can play around with a smaller version of GPT-2, which has already been incorporated by HuggingFace in their pretrained BERT framework.

Sequence generation in arbitrary order

If multiple research labs come up with a similar idea concurrently, then it's often worth taking a closer look. One such recent idea is generating text in arbitrary order. With the advent of BiLSTMs, encoders could process text both from left to right and in reverse order. More recently, self-attention models such as the Transformer do not prescribe any particular order at all, but enable looking at all relevant words at once. However, decoders are still required to generate text one token at a time from left to right. Let's look at the three recent papers on this topic in more detail:

Insertion Transformer: Flexible Sequence Generation via Insertion Operations Stern et al. propose a Transformer that can insert tokens anywhere during decoding using a binary tree.

Non-Monotonic Sequential Text Generation Welleck et al. propose a method that generates a word at an arbitrary position and then recursively generates words to its left and right in a binary tree. The model is trained with imitation learning.

Insertion-based Decoding with Automatically Inferred Generation Order Gu et al. propose to learn the generation order in a Transformer by modelling it as a latent variable.

While they differ based on the execution of the method, all three follow the same idea—allowing an arbitrary generation order via insertions based on some form of binary tree. These ideas are also similar to non-autoregressive NMT, an exciting approach from last year that proposed to generate all output words in a sentence in parallel. A related idea is to generate a sentence and then edit it iteratively (Guu et al., TACL 2018; Wang et al., ACL 2018). We'll likely see more approaches experimenting not only in the way the input should be processed, but also how the output should be produced by a model. Some of these might be closer to the way humans write text—for instance by starting with a main message or a sketch and then iteratively expanding it on—and might enable novel interactive applications.

Resources 📚

awesome-machine-learning-interpretability 📄 A list of ML interpretability resources including a nice flowchart on which method to use for which use case.

MIT Science of Deep Learning Course 👩‍🎓 The course aims to bridge ML theory. It is ongoing and new detailed lecture notes are regularly uploaded.

How Google Fights Disinformation 🗞 A 30 page white paper on how Google fights disinformation across Google Search, Google News, YouTube and Google Ads. For a digest, read Dare Obasanjo's Twitter thread.

The Definitive ‘what do I ask/look for’ in a PhD Advisor Guide 👩‍🏫 Andrew Kuznetsov provides a useful guide on what questions to ask while interviewing for a PhD program.

Data Visualization—A practical introduction 📊 In this free online book, Kieran Healy introduces the ideas and methods of data visualisation in a lucid way.

Tools ⚒

Lingvo 🛠 A Tensorflow framework for sequence modelling ("lingvo" means "language" in Esperanto), which started out with a focus on NLP; it also supports distillation, GANs, and multi-task models. Many recent state-of-the-art NLP and speech papers have been implemented in Lingvo.

Predicting Movie Review Sentiment with BERT 🍿 A Colab notebook that gets you started predicting movie review sentiment with BERT.

scispacy 🔬 A Python package containing spaCy models for processing biomedical, scientific or clinical text.

mindsdb 🏛 A Python framework that strives for simplicity, certainty, and explainability in training neural networks. It enables users to train models and provides them with information to understand when they can trust the predictions that they’ve made.

LIGHT 🎮 A large-scale fantasy text adventure game research platform for training agents that can both talk and act, interacting either with other models or with humans.

Giant Language model Test Room 👩‍🔬 A tool to inspect the visual footprint of a language model on input text to detect whether a text could be real or fake.

Articles and blog posts 📰

10 breakthrough technologies 🔬 MIT features 10 breakthrough technologies in 2019—according to Bill Gates. One of them discusses "smooth-talking AI assistants", AI assistants that can perform conversation-based tasks like booking a restaurant reservation or coordinating a package drop-off. The post estimates them to be available in 1-2 years, which seems reasonable for narrow domains.

How to Choose Your First AI Project 🤖 Andrew Ng gives valuable tips on how a company should choose its first AI project, such as choosing an initial project that can be done quickly and has a high chance of success in order to get the flywheel turning as soon as possible.

Neural Language Understanding of People’s Names 📛 Matthew Henderson describes the approach PolyAI uses to track people's names in contact centre conversations and how they went from using a list of names to an end-to-end neural approach.

Beyond Local Pattern Matching: Recent Advances in Machine Reading ⁉ Peng Qi and Danqi Chen reflect on the rapid progress the NLP community has been making in teaching machines to read and answer questions, discuss recent datasets and potential future directions.

Character Level NLP 🤖 An in-depth blog post about the advantages and drawbacks of working at the character level in NLP.

In Favor of Developing Ethical Best Practices in AI Research 📑 An opinion piece by Stanford AI researchers on promoting ethical best practices, in particular to avoid unintended negative consequences of their work.

Your Next Game Night Partner? A Computer 🎯 This article describes AI2's recent agent, which plays a Pictionary-style game collaboratively with a human partner. Unlike automated players in board games like chess or Go, AI2’s player communicates using pictures, phrases, and concepts. You can play the game yourself here.

Meta-Learning in 50 Lines of JAX 💻 Eric Jang shows how the MAML meta-learning algorithm can be implemented in about 50 lines of Python code using the JAX library.

Seven Myths in Machine Learning Research 🔭 Oscar Chang discusses seven ML myths, such as "ML researchers do not use the test set for validation".

Learning through Auxiliary Tasks 👩‍🎓 In this post, Vivien gives an overview of learning with auxiliary tasks and introduces a simple gradient-based approach that outperforms comparison multi-task approaches and mitigates negative transfer.

Exploring BERT's Vocabulary 🇩🇪🇬🇧🇫🇷 Judit Ács analyzes BERT's multilingual word piece vocabulary—in particular, the consequences of using a shared word piece vocabulary across many languages.

Working AI: In the Lab with NLP PhD Student Abigail See 👩‍🏫 An insightful interview with Abigail See on her daily routine, tech stack, teaching CS224n, and more.

Yann LeCun Cake Analogy 2.0 🎂 An update on Yann LeCun's infamous cake analogy at NIPS 2016. In the new image, unsupervised learning is replaced by self-supervised learning. Language modelling, for instance, can be considered as an example of self-supervised learning.

Conversational AI ‒ but where is the I? 💬 Nikolai Rozanov argues that simply exhaustively enumerating all possibilities in narrow domains is not enough and that we need to solve the hard problems of conversational understanding.

Paper picks 📄

Influence diagram of Deep Learning methods used in video games

Deep Learning for Video Game Playing (arXiv 2019) In this extensive review, Justesen et al. describe how recent advances in Deep Learning have been applied to video games. Definitely worth a read if you're interested in reinforcement learning or video games.

An Empirical Study of Example Forgetting during Deep Neural Network Learning (ICLR 2019) Toneva et al. analyze when an example is forgotten (i.e. misclassified) across datasets. They find that:

certain examples are forgotten with high frequency; others not at all;
a dataset's (un)forgettable examples generalize across architectures;
a significant fraction of examples can be omitted from the training data based on forgetting dynamics.

These results may help us gain a better understanding which examples are useful for neural networks and to mitigate catastrophic forgetting.

NLP News

Discussion about this post