Experimental results and error reporting, Ethics and NLP, Distillation vol. 2, SemEval 2020

Sep 16, 2019

Hi all,

This edition is about better ways to report experimental results and model errors, ethics and NLP, and how to make your BERT model more efficient. We also cover the SemEval 2020 tasks and a particularly extensive number of resources, including ones on causality and academic job search. You'll find as usual several high-quality blog posts and interesting papers—all topped off with a pinch of magic 🧙‍♂️.

Contributions 💪 If you have written or have come across something that would be relevant to the community, hit reply on the issue so that it can be shared more widely.

I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.

If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.

Magic 🧙‍♂️

Any sufficiently advanced technology is indistinguishable from magic. —Arthur C. Clarke

Magic 🧙‍♀️ according to Arthur C. Clarke is still far out of reach with current models—although they may on occasion appear more sophisticated than they actually are (the Clever Hans effect, as discussed in the last newsletter).

In the meantime, we can at least enjoy generated text about magic, such as the below delightful Harry Potter 🧙‍♂️–NLP paper 📄 cross-over brought to you by Jonathan Fly and GPT-2 fine-tuned on arXiv papers by HuggingFace:

Harry Potter-NLP cross-over (credit: Jonathan Fly)

Reporting errors and experimental results in NLP 🔬

Two recent EMNLP 2019 papers observe that the current standard procedures for reporting experimental results and model errors respectively are flawed and propose improvements.

Error analysis

Standard method: Randomly select 50–100 incorrect questions and roughly label them into N error groups.

Problems: small sample size, subjective, imprecise, true cause is not tested

Proposed solution: Errudite (Wu et al., 2019). Uses a domain-specific language to extract clearly defined attributes from examples of the entire dataset. Counterfactual analysis via rewrite rules. The accompanying blog post provides more details (and a UI, see below).

Reporting experimental results

Standard method: Training multiple instantiations of each model, choosing the best model of each type based on validation performance, and comparing their performance on test data.

Problems: is a function of the computational budget; comparing models with different budgets for tuning hyper-parameters yields different conclusions

Proposed method: Report expected validation accuracy as a function of hyperparameter tuning budget (Dodge et al., 2019). Recommendations for improving scientific reporting in the form of a checklist (see below).

Presence of checklist items across 50 randomly sampled EMNLP 2018 papers that involved modeling experiments (Dodge et al., 2019).

Ethics and NLP ⚖️

As NLP models become more powerful, we need to be conscious of their societal impact. Ethics and NLP is already starting to play a bigger role in the community, with a dedicated track at ACL 2020. In addition, initiatives such as the Partnership on AI consult on the responsible publication of papers such as the recent CTLR language model by Salesforce. As discussed on Twitter, in order to counter-act future ethical crises, ethics cannot just be an after-thought but must be taking into consideration at the inception of a project. In addition, ethics needs to be integrated into the curriculum.

If you want to learn more about ethics and NLP, have a look at these papers:

Leidner & Plachouras (2017). Ethical by Design: Ethics Best Practices for Natural Language Processing. 1st Workshop on Ethics in NLP.
Hovy & Spruit (2016). The Social Impact of Natural Language Processing. ACL 2016.

Distillation vol. 2 💗

Model size of different pretrained language models (in millions of parameters; credit: HuggingFace)

In the last newsletter, we discussed techniques for compressing big models like BERT such as pruning and distillation. Shortly afterwards, multiple approaches came out to make big Transformer models more efficient:

Distilling BERT Models with spaCy: Yves Peirsman distills multilingual BERT fine-tuned on a sentiment analysis dataset into spaCy's convolutional neural networks, similar to Tang et al. (2019).
DistilBERT: Victor Sanh distills BERT-base in a smaller language model that performs similarly on downstream tasks while being faster. The model, however, still requires a lot of compute for pretraining.
Multilingual MiniBERT: Tsai et al. (EMNLP 2019) similarly propose to train a smaller (3 layer) BERT model by distilling multilingual BERT.
Adaptive attention span: Facebook researchers propose an adaptive attention span that makes it more efficient to scale Transformers to long sequences.

SemEval 2020 🗣

The tasks for the International Workshop on Semantic Evaluation (SemEval) 2020 have been announced. If you are unsure what to work on or want to tackle challenging problems, then these are great starting point as most of them provide reliable (and often novel) dataset. They range from analyzing memes to selecting what part of a text should be emphasized and cover the following areas:

Lexical Semantics (Semantic Change Detection, Cross-lingual Lexical Entailment, Word Similarity in Context);
Common Sense (Common Sense Explanation, Counterfactual Detection, Extracting Textbook Definitions)
Humour (Humour in News Headlines, Analysis of Memes, Sentiment Analysis on Code-Mixed Data, Emphasis Selection for Written Text in Visual Media)
Societal Applications (Propaganda Detection, Offensive Language Detection)

Slides 🖼

How do we get to general purpose NLU? 🤖 Emily Bender argues that models that are trained only on the form of language (e.g. via language modelling) won't learn meaning. Instead, we need to pay attention to linguistic structure and how language is used.

Deep Learning Indaba 2019 NLP session 🌍 Slides of all talks of the NLP session at this year's Deep Learning Indaba, organized by Herman Kamper and me.

Resources 📚

Naki 🌎 Naki is a list of corpora, resources, and scientific papers for NLP for American Native / indigenous languages created by Manuel Mager.

How to get up to speed on Machine Learning and AI 🤖 A great list of high-quality technical and non-technical resources for learning about ML curated by the AI2 team.

ML Retrospectives 📄 A collection of blog posts by authors that honestly discuss a past paper, including its flaws, limitations, and perspectives. There are only two currently published retrospectives at this point (one on an automatic Turing test, the other one on conditional computation), but we'll hopefully see more in the future.

Super Machine Learning Revision Notes 📒 A huge list that contains clear and concise summaries of basic ML concepts, algorithms and popular models, as well as practical tips and exampl

Causality chapter 📖 The Causality chapter of the Fairness and ML book is now freely available. If you're interested in causality, this is arguably one of the most extensive and didactic overviews of the topic.

The academic job search for computer scientists in 10 questions 👩‍🏫 This article by Nicolas Papernot and Elissa M. Redmiles is an in-depth guide to the academic job search, from preparing a job talk, to scheduling on-site interviews and negotiating a job package.

Tools ⚒

CausalML 🔧 This Python library by Uber provides a suite of causal inference methods using machine learning algorithms based on recent research. Typical use cases include campaign targeting optimization or personalized engagement.

FARM 👩‍🌾 FARM by Deepset provides allows you to easily adapt pretrained language models to downstream tasks. Standardized interfaces allow flexible extension and experiment tracking and visualizations support debugging. In addition, FARM enables running inference both via an API or in a nice UI using docker containers. You can see below how it compares to two other popular transfer learning libraries, PyTorch-Transformers and spaCy-PyTorch-Transformers.

Articles and blog posts 📰

A Complete List of Important NLP Frameworks you should Know 🏛 This infographic shows many recent advances in NLP, from the Transformer to multilingual BERT. While some of the dates are inaccurate and ELMo is conspicuously missing, it still nicely highlights the recent progress in NLP.

Planning paper writing ✍️ Devi Parikh who has produced a host of amazing papers with her team shares tips on iterating on a paper with your collaborators:

Iterate on the paper in a hierarchical fashion (coarse to fine).
Iterate on small chunks at a time.
Plan for multiple iterations on every section.
Schedule each iteration.

How to Train Your ResNet 8: Bag of Tricks 👝 This post by David Page discusses a number of standard and not-so-standard tricks to reduce the training time of a ResNet:

Preprocessing on the GPU.
Applying max-pooling before batch norm and ReLU.
Label smoothing.
Using CELU activations.
'Ghost' batch norm (batch norm applied to a subset of a larger batch).
Frozen batch norm scales.
Input patch whitening.
Exponential moving averaging of parameters.
Test-time augmentation.

How the limits of the mind shape human language 🤯 This article nicely explores how our inherent biases make learning certain languages easier than others.

A Rare Universal Pattern in Human Languages 🗣 This article discusses a recent paper that demonstrates that even though some languages are spoken more quickly than others, the efficiency of different languages (in terms of bits as measured by information per syllable) is roughly the same across languages.

AI Is Coming for Your Favorite Menial Tasks 👩‍💻 This article focuses on an underappreciated aspect of the discussion around job loss and transformation due to AI: As decision making is cognitively draining, certain menial tasks may provide a sense of accomplishment. If all of these are done by AI, then only tasks that require very taxing novel decision making will remain.

Humanity + AI: Better Together 👩🤖 A written version of a presentation by Andreessen Horowitz's Frank Chen. He focuses on five aspects of how humans and AI can effectively work together:

Automating routines enables us to be more creative.
ML gives us superpowers in the physical world.
ML helps us make better decision.
Automating dangerous jobs makes us safer.
ML will help us understand each other better.

Evolution Strategies 🐒🚶‍♂️ Lilian Weng reviews classic evolution strategies methods—black-box optimization algorithms that are part of the family of evolutionary algorithms—and discusses applications in deep RL.

On Creativity in Academia 👩‍🏫 This post by Tim Dettmers highlights the dilemma of creativity in academia, which is about finding strange ideas that are still valid. Rather than coming up with valid ideas straight away, one needs to hammer on and reassessing ideas until they work.

Dialogue State Tracking 🗣This post by Wluper gives an overview of dialogue state tracking, in particular how to leverage persona information and dialogue history.

The #BenderRule: On Naming the Languages We Study and Why It Matters 🇬🇧🇩🇪🇿🇦 A great article by Emily Bender on the history of the #BenderRule and why English is neither synonymous with nor representative of natural language.

Papers + blog posts 📑

Universal Adversarial Triggers for Attacking and Analyzing NLP (blog post, paper) A new attack that concatenates a short phrase to the front or end of an input. It is universal in that the exact same phrase can be appended to any input from a dataset to cause a specific target prediction.

Paper picks 📄

A critique of pure learning and what artificial neural networks can learn from animal brains (2019) This paper critiques pure learning and argues that most animal behaviour does not result from learning algorithms, but is encoded in the genome. Animals are born with highly structured brain connectivity, which must be compressed through a “genomic bottleneck”.

NLP News