Accelerating science, memorizing vs learning to look things up, Schmidhuber's 2010s, Greek BERT, ARC, Illustrated Reformer, Annotated GPT-2, oLMpics

Feb 25, 2020

Hi all,

This newsletter discusses accelerating science, memorizing vs learning to look things up, and a Schmidhuber-centric view of the last decade. It also features slides on transfer learning and Deep Learning essentials, multiple translation corpora (speech-to-text, comprehensive translations for language learning), a Greek BERT, and ARC. Finally, it includes the blog posts and papers that I particularly enjoyed reading over the past months, including the Illustrated Reformer and the Annotated GPT-2, an analysis of NLP and ML papers in 2019, and oLMpics.

Contributions 💪 If you have written or have come across something that would be relevant to the community, hit reply on the issue so that it can be shared more widely.

I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.

If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.

NLP Progress 🔢

Updates during the last month include:

Code for the Mogrifier LSTM (ICLR 2020), SOTA on PTB and WikiText-2 is now online
New SOTA models for text simplification
New dataset on Gendered Ambiguous Pronoun (GAP) resolution
Results from the Dialogue System Technology Challenge 8, SOTA on the Ubuntu IRC data
New reading comprehension datasets in French, Russian, Chinese, and Korean

Accelerating science 🔬

Machine learning is already used in science, from astrophysics to high energy density physics to train ML models to mimic the output of slower simulators. In practice, large speedups are common—the challenge is for the models' predictions to be accurate enough to be useful in practice. Current ML models typically need large amounts of training data, which is expensive to obtain in this setting as some simulators may take days to produce an output. The latest approach in this line employs efficient neural architecture search. What is particularly exciting is that the model needs only a few thousand examples and in one case—modelling the global aerosol-climate—only a few dozen.

Memorizing vs learning to look things up 📖

If we want to recall a fact, there are typically two strategies: We either try to learn it by heart or remember where to find it (such as by querying StackOverflow or in our paper management system). While the first approach is fast and does not require any additional resources, our memory may occasionally fool us. In contrast, the second approach may take longer but provides us with additional evidence.

Current deep neural networks make a similar trade-off when answering complex questions: They either try to store all knowledge in a huge number of parameters or learn how to retrieve documents to use as evidence. For the first approach, it was recently shown that T5, a huge neural network consisting of 11B parameters can store enough knowledge in its parameters to outperform previous retrieval-based systems on open-domain question answering tasks. Around the same time, REALM—a model of the second approach—achieved another significant improvement by learning the retrieval mechanism as part of pretraining.

While T5 could be made even bigger to store potentially more facts in its parameters, increasing the number of parameters soon gets prohibitively expensive. Retrieving documents also has additional benefits: It is much more interpretable than querying a black box and enables seamless updates of the underlying knowledge corpus.

A decade of Schmidhuber 👨‍🔬

In his second blog post and tweet, Jürgen Schmidhuber (pronounced "You_again Shmidhoobuh") gives a somewhat biased overview of the last decade, focusing mainly on advances enabled by research from his lab. Nevertheless, it is educational to view recent advances through this lens: feed-forward neural networks as limited RNNs and LSTMs; ResNets as a special case of highway networks; and GANs as an application of the Curiosity Principle. It is also useful to remind ourselves that—despite the increasing popularity of attention and Transformers—the LSTM is "arguably the most commercial AI achievement" and has got more citations per year than all other computer science papers of the 20th century.

However, Schmidhuber neglects to mention the importance of attention (he does not cite the milestone attention paper by Bahdanau et al.). He also opines that "today, very few commercial NN applications are still based on unsupervised pre-training", disregarding the recent wave of unsupervised pre-trained models in natural language processing and other domains that power applications such as Google Search.

Talks and slides 🗣

AAAI-20 Tutorial: Recent Advances in Transferable Representation Learning 🤖 In these slides, Muhao Chen, Kai-Wei Chang, and Dan Roth present NLP methods that use retrofitting, joint learning, and self-supervised alignment to learn representations from multilingual and multi-relational data.

Deep Learning Essentials (Part 1, Part 2) 🏛 In these two-part slides, Ruslan Salakhutdinov gives an overview of fundamentals for deep learning, from supervised learning to deep generative models.

Resources and datasets 📚

CoVoST 💬 CoVoST is a diverse multilingual speech-to-text translation corpus by Facebook that includes speech in 11 languages (French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian and Chinese), their transcripts and English translations. If you are interested in speech-to-text applications or translation of spoken language, then this is a great starting point.

The Missing Semester of Your CS Education 💻 An MIT curriculum that teaches you about the tools that you need to do computer science in practice. It teaches you to how master the command-line, use a powerful text editor, version control, and much more.

CS 287 Advanced Robotics – Fundamental Knowledge 🤖 This exam study handout for Pieter Abbeel's course summarizes the main math for key RL techniques, such as Value Iteration, Policy Iteration, policy gradient, TPO, Q-learning, and different optimization methods in around 20 pages.

Tools ⚒

GreekBERT is the latest new BERT model, this time for the Greek language. It was trained on the Greek Wikipedia, the Greek part of the European Parliament proceedings, and on Greek text in Common Crawl.

ARC The Abstraction and Reasoning Challenge, originally proposed by Francois Chollet and now hosted on kaggle with a price of $20,000 tasks models with learning complex, abstract patterns from just a few examples. Some of the rules that need to be learned are quite complex. Can you spot the pattern in the example below? You can explore the tasks in ARC with this interactive website.

The apparent rule of this ARC example: Apply the input pattern to sections in the output that correspond to filled squares in the input example.

Articles and blog posts 📰

An Opinionated Guide to ML Research 🗺 Jon Schulman shares advice for ML research on how to choose problems and organize your time. The advice touches on how to develop a good taste for what problems to work on, climbing incrementally towards high goals, when to switch problems, and the importance of personal development.

Curriculum for Reinforcement Learning 👩‍🏫 Lilian Weng explores four ways to use a curriculum to help RL models learn to solve complicated tasks:

using a teach model to guide a student;
using asymmetric self-play (two agents setting tasks for the other agent to solve);
automatically generating goal with a GAN;
based on latent skills and trajectories in the skill space.

Contrastive Self-Supervised Learning 🤳 Ankesh Anand gives an overview of recent contrastive methods that learn by distinguishing between positive and negative examples (in contrast to generative models) such as Deep InfoMax, Contrastive Predictive Coding, and MoCo with a focus on models for computer vision.

2020 Duolingo Shared Task on Simultaneous Translation And Paraphrase for Language Education 🦉 If you are interested in machine translation and language learning, then consider taking part in this shared task. For language learning, it is often useful to have multiple plausible translation to grade learners' responses against a large set of human-curated acceptable translations. Compared to other datasets, the shared task provides data in 5 languages pairs with comprehensive translations. The task is also interesting from a paraphrasing perspective, as high-quality automatic translations of each input sentence are provided, which can be used to generate paraphrases.

Illustrating the Reformer 🦍 An illustrated version of the Reformer by Alireza Dirafzoon, which provides a nice walk-through of the key ingredients such as locality-sensitive hashing and reversibility of this efficient transformer.

One does not simply implement an approach based on its description in the paper. https://abstrusegoose.com/588

Quantifying Independently Reproducible Machine Learning 🧙‍♂️ Edward Raff gives an account of his efforts to independently reproduce 255 ML papers (which I previously highlighted in this newsletter) and his key findings in this TheGradient article.

ML and NLP Publications in 2019 📑 Marek Rei (with data from Jonas Pfeiffer and Andrew Caines) shares a deep analysis of publication trends in 2019 in ML and NLP venues. This year, for the first time the analysis also contains statistics for individual countries. The data is available here.

How to train a new language model from scratch using Transformers and Tokenizers 🛠 This post shows how to train a new language model from scratch using a "small" model (a 6-layer Transformer) and Esperanto (an artificial language).

Yoshua Bengio’s blog – first words ✈️ Yoshua Bengio has started blogging (if this doesn't convince you to start a blog, then I don't know what does). His first blog post focuses on the importance of remote presentations to minimize air travel in order to reduce the carbon footprint of the community. You can sign the petition here—I did.

Fundamentals of NLP (Chapter 1): Tokenization, Lemmatization, Stemming, and Sentence Segmentation 📒 A great start to a new series by Elvis Saravia that teaches you basic concepts of NLP in an interactive Colaboratory notebook.

Top Trends of Graph Machine Learning in 2020 📈 This post analyzes 150+ papers using graph neural networks (GNN) and highlights key trends:

More solid theoretical understanding of GNN;
New cool applications of GNN;
Knowledge graphs become more popular;
New frameworks for graph embeddings.

The Annotated GPT-2 👩‍💻 Aman Arora follows in the footsteps of The Annotated Transformer to bring us an annotated version of GPT-2 that carefully explains key excerpts of GPT-2 code base.

Papers + blog posts 📑

BERT, ELMo, & GPT-2: How contextual are contextualized word representations? (Blog post, paper) Kawin Ethayarajh and his co-authors analyse contextual representations in recent models and find the following:

The representations of all words in all layers are distributed only in a narrow part of the embedding space.
Upper layers produce more context-specific representations.
Less than 5% of the variance of contextual embeddings is explained by a static embedding (so embeddings are very contextual).
If we create new static embeddings by taking the first principal component of a contextual representation in a lower layer, the resulting embeddings outperform GloVe and fastText on analogies.

Polyglot Word Embeddings Discover Language Clusters (Blog post, paper) Shriphani Palakodety shows how multilingual skip-gram representations can be used for unsupervised language identification via clustering. The approach has been applied to analyse text from a refugee crisis and a crisis between two nuclear adversaries. I particularly appreciate the discussion of how to select the number of clusters, which often goes unsaid.

Paper picks 📄

Massive vs. Curated Word Embeddings for Low-Resourced Languages. The Case of Yorùbá and Twi (LREC 2020) This is an extensive case study on how current embeddings models (both static and contextual) fare on the extremely low-resource African languages Yorùbá and Twi. For both, the authors obtain both noisy and clean data used for pretraining online. For evaluation, they use WordSim-353 (translated) and an NER task. On both languages, current models perform comparatively poorly.

oLMpics - On what Language Model Pre-training Captures The authors propose eight reasoning tasks, which require operations such as comparison, conjunction, and composition to evaluate the capabilities of current pretrained language models. As it is often hard to tell what a probe captures in isolation, they employ zero-shot and control baselines to control for the effect of fine-tuning on the task dataset. Very thoughtful! They find that different LMs have qualitatively different reasoning abilities, e.g. RoBERTa succeeds in tasks where BERT fails. They also find that reasoning abilities are context-dependent (e.g. based on expected scale of numbers) and that current models fail on about half of all tasks. Overall, this is an extensive study and well worth reading.

NLP News

Discussion about this post