Highlights of EMNLP 2019, Ethics in NLP vol. 2, AI and Journalism

NLP News
Hi all,
The themes of this newsletter are highlights from EMNLP 2019, ethical uses of NLP, and AI and journalism. At EMNLP 2019, BERT and multilingual models showed a strong presence, while there were also some dubious use cases for NLP. One particularly challenging domain for AI is news.
In addition, we have a treasure trove of high-quality talks from three recent ML and NLP summer schools (Deep Learning Indaba, Khipu, AthensNLP), a lot of resources including paper summaries and lecture slides, a new delicious French BERT model, and articles ranging from the state of the NLP literature to overviews of self-supervised learning and dialogue policies.
Contributions 💪 If you have written or have come across something that would be relevant to the community, hit reply on the issue so that it can be shared more widely.
I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.
If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.

The compute needed for training state-of-the-art ML models has increased tremendously over the last years. To get a feeling for exactly how dramatic this increase has been, it helps to contrast this with the historical growth of compute in AI (see below). In light of the sharp slope of this increase, it is even more important to work on resource-efficient ML methods.
From 1959-2012, compute in AI roughly tracked Moore's law. From 2012, compute doubles at a substantially faster rate (Credit: OpenAI)
From 1959-2012, compute in AI roughly tracked Moore's law. From 2012, compute doubles at a substantially faster rate (Credit: OpenAI)
EMNLP 2019 highlights
I didn’t make it to EMNLP this time. Thankfully, there were a lot of live tweeters. Here are my highlights from spectating from afar: Barbara Plank gave a talk on cross-lingual and cross-domain learning at the DeepLo 2019 workshop. The talk discusses annotation projection, the importance of (a small amount) of labelled target data, and data selection.
In his keynote, Kyunghyun Cho shares his refreshingly SOTA-less journey into neural sequence models (including a lot of almost SOTAs)—from the beginnings of NMT, to multi-way and unsupervised MT.
For a broader overview, have a look at Naver Labs Europe’s highlights post, which focuses on machine translation and multilingual models, more efficient models, and analyses. They note the dominance of BERT and a rising number of cross-lingual papers. For an extensive overview of knowledge graph-related papers at EMNLP, have a look at Michael Galkin’s two-part blog posts (Part I and Part II).
Ethics in NLP vol. 2
We already talked about ethics in NLP in a past edition, but as the number of submissions to conferences grow, hold all papers to an ethical standard is crucial.
This EMNLP had two papers that proposed dubious applications of NLP: charge-based prison term prediction and automatic news comment generation. Both acknowledge ethical concerns only in passing and caused indignation online:
Mark Neumann
Struggling to understand how #emnlp2019 has managed to accept a paper titled “Charge Based Prison Term Prediction with Deep Gating Network”, this is real brain dead stupid stuff
Matthias Gallé
Just attended this talk at #emnlp2019. Pushed on the issues raised in this thread, the author said "there are no ethical issues in this work" https://t.co/DENYJH33dt
As researchers, we need to be sensitive of and raise ethical concerns. With the release of more powerful models, NLP will be applied to more applications and many of these will require grappling with ethical questions.
If you care about ethics, consider submitting a paper to the Ethics and NLP track at ACL 2020. For more on ethical use of AI and fairness, have a look at the first article in Articles and blog posts 📰, which invites you to play a courtroom algorithm game.
AI and Journalism 🗞
Another ethically challenging domain for AI is news. Its potential for generating fake news was the primary reason OpenAI did not release their large GPT-2 model. That has now changed. After findings that it is only marginally more credible than the previously released version, OpenAI finally released the 1.5B version of their GPT-2 language model.
Will we now see a sudden proliferation of GPT-2 generated text online? Highly unlikely. Nevertheless, if you want to be aware whether any website is using GPT-2 or any other language model to generate its text, then check out the GPTrue or False Browser Extension. It aims to detect whether a given portion of a text is fake or real.
However, we should not only talk about how AI can be used for news, but also how news organisations view AI. This extensive report surveys 71 news organisation regarding their understanding of AI and how it is used in their newsrooms. TL;DR: AI is a significant part of journalism already but it is unevenly distributed. It is giving journalists more power, but brings editorial and ethical responsibilities.
A big part of this responsibility is the responsible reporting about AI. News articles about AI are often full of inaccuracies or—worse—Terminator pictures. SkyNet Today shares a list of best practices to ensure high-quality articles about AI:
Talks 🗣
Deep Learning Indaba 2019 Playlist 🌍 Featuring 65 videos, the talks from the four parallel sessions of this year’s Deep Learning Indaba cover a plethora of introductory and advanced topics, which will keep you busy for weeks.
Khipu 2019 Playlist 🌎 Khipu.ai just took place the past week in Montevideo, Uruguay but the talks are already available on the Khipu website. For context, Khipu is an initiative that brings together the South American ML community (khipu are cotton strings used by the Incas to store numeric values as knots).
Athens NLP 2019 Playlist 🏛 All talks of the 1st Athens NLP Summer School are now available. They feature in-depth lectures and tutorials from a diverse array of NLP experts, including Vivian Chen, Ryan McDonald, Angeliki Lazaridou, and Sebastian Riedel.
Improving Search with Natural Language Processing and Deep Learning 🚗 This is a nice case study on how Transformer models are used to translate natural language to structured search queries for semantic search in a specific domain (cars).
Resources 📚
What is tf.keras really? ⚙️ A tutorial that digs deep into what exactly each piece of tf.keras is doing (inspired by What is torch.nn really?).
ERASER ✏ The Evaluating Rationales And Simple English Reasoning (ERASER) benchmark provides a diverse set of NLP datasets to evaluate how interpretable your models. In order to evaluate your model, it needs to provide a rationale (often keywords in the input that support its prediction).
74 Summaries of Machine Learning and NLP Research 📚 Marek Rei shares concise overviews of the core contributions of several impactful NLP and ML papers. The list is a great way to get quickly up to speed and see if you’ve missed any interesting recent paper.
CS330: Deep Multi-Task and Meta Learning 🤖 Slides of Stanford’s CS330 course taught by Chelsea Finn. The slides cover few-shot learning, Bayesian meta-learning, multi-task RL, lifelong learning, and much more.
CS236: Deep Generative Models 🖼 Course notes of Stanford’s course on generative models, which cover autoregressive models, VAEs, normalising flows, and GANs.
Tools ⚒
The delicious CamemBERT
The delicious CamemBERT
CamemBERT 🧀 A RoBERTa model that was trained on the French portion of the multilingual OSCAR corpus (which is based on Common Crawl data and covers 166 languages). The pretrained model is available for download. The OSCAR corpus should be useful for training other (multilingual) models.
Joey NMT 🐨 Joey NMT is a minimalist NMT toolkit for novices that matches the quality of standard toolkits with one fifth of the code. Have a look at this blog post for a great overview and check out the paper for technical details.
Articles and blog posts 📰
Can you make AI fairer than a judge? 👩‍⚖️ If you’re reading one article about algorithmic fairness this month, then read this one. This interactive Wired article shows that making the judicial process less biased is not just about maximising accuracy.
Benchmarking Transformers: PyTorch and TensorFlow 📊 HuggingFace benchmarks inference performance of Transformers with PyTorch and TensorFlow. TL;DR: Similar results. TensorFlow is a bit slower on CPU but faster on GPU. PyTorch runs out of memory earlier (batch size of 8, sequence length of 1024).
Student Perspectives on Applying to NLP PhD Programs 👩‍🎓 This post offers some great pieces of advice on the NLP PhD application process and provides perspectives from twelve recently successful NLP PhD applicants, with a focus on programs in the US. If you’re considering applying for a PhD in NLP or ML more broadly, then definitely read this.
Uncertainty Quantification in Deep Learning ❓ This is a nice overview of how to estimate uncertainty with neural networks, including Monte-Carlo dropout, deep and dropout ensembles, quantile regression, and a comparison to Gaussian Processes. TL;DR: While all Deep Learning approaches can interpolate, they lack the ability to extrapolate.
The State of NLP Literature 📚 This is a great series of blog posts that surveys the NLP literature landscape including its size and demographics (Part I), areas of research (Part II), most cited papers (Part IIIa), and citations by area (Part IIIb).
Evaluation Metrics for Language Modeling ⚖️ Pretrained language models excel at downstream tasks, so it’s useful to remind ourselves what they are actually optimising. This is a great overview of evaluation metrics for language modeling by Chip Huyen, with insights on classical bounds and human predictions.
Shortening papers to fit page limits 📄 Don’t you hate it when a perfectly written paper needs to be cut down to fit a pesky page limit? Devi Parikh shares tips on how to make this process more effective. TL;DR: Tighten the writing before you consider what content to remove.
Self-Supervised Representation Learning 🤖 Lilian Weng gives an overview of self-supervised learning (language modelling is an example). While the overview is focused on computer vision and control, it can provide inspiration for designing self-supervised objectives for other domains such as natural language.
How do Dialogue Systems decide what to say or which actions to take? 💬 This blog post by wluper gives a nice overview of dialogue policies. It covers rule-based, retrieval-based, generative, and knowledge graph-based policies. It illustrates their strengths and weaknesses by providing example from Alexa Prize teams, HuggingFace, DeepPavlov, and others.
Papers + blog posts 📑
Learning to Predict Without Looking Ahead: World Models Without Forward Prediction (Blog post, paper) In current RL research, an agent’s model of the world or world model is often a forward model that predicts future states. This paper questions that forward prediction is required to learn a world model. Instead, they constrain the agent so that at each timestep, it observes its environment only with a certain probability—and augment it with an internal model that generates a new observation. They show that such a model can be useful for learning important skills and behaves sometimes like an imperfect forward model.
Answering Complex Open-domain Questions at Scale (Blog post, paper) Open-domain question answering is challenging because questions often require multiple steps of reasoning to find the correct answer. This paper proposes a model that iteratively generates natural language queries based on the currently retrieved context and retrieves more information if needed before answering the question. The method is based on the intuition that at any given time in the process of finding all supporting, there is strong semantic overlap between what we already know (the question text, plus what we have found so far), and what we want to find (the remaining supporting facts).
CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text (Blog post, paper) CLUTTR is a benchmarks suite that is designed to test inductive reasoning, i.e. reasoning that is not purely extractive. The dataset consists of a large set of semi-synthetic stories about hypothetical family members whose relationship is not explicitly stated and should be inferred. The authors observe that systematic generalisation is a hard problem. Performance decreases for all models (both pretrained language models and models directly operate on the symbolic graph) as the relation length increases.
Did you enjoy this issue? Yes No
Sebastian Ruder
Sebastian Ruder @seb_ruder

Regular analyses of advances in natural language processing and machine learning.

In order to unsubscribe, click here.
Created with Revue by Twitter.