Deep Learning Indaba, EurNLP, ML echo chamber, Pretrained LMs, Reproducibility papers

Oct 22, 2019

Hi all,

This month features updates about recent events (ICLR 2020 submissions, Deep Learning Indaba, EurNLP 2019), reflections on the ML echo chamber, a ton of resources and tools (many of them about Transformers and pretrained language models), many superb posts—from entertaining comics to advice for PhDs and writing papers and musings on incentives to use poor-quality datasets—and compelling papers on reproducibility.

Contributions 💪 If you have written or have come across something that would be relevant to the community, hit reply on the issue so that it can be shared more widely.

I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.

If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.

ICLR 2020 submissions 📑 In contrast to many other ML and NLP venues, ICLR 2020 submissions are available from the date they are submitted. In light of this onslaught of papers (2,600 this year), don't forget to keep a cool head and continue to pursue your own research direction. Above all, don't let yourself get intimidated by the abstract of a submission:

Deep Learning Indaba 2019 🌍

The Deep Learning Indaba took place at the end of August in Nairobi, Kenya. The event brought the African machine learning community together for the third time—and for the first time outside South Africa. The Indaba is unlike most academic conferences in ML: It is wild, diverse, and empowering. For impressions of the event and opinions, read these inspiring articles by Dave Gershgorn, Jade Abbott, and Vukosi Marivate.

It was particularly nice to see what great NLP research is being done on the continent. The poster session featured much work on African languages and unsupervised MT. At the NLP session, common themes were low-resource settings and social impact (all slides are available here). New initiatives such as Masakhane for MT are also helping bring people together.

On the whole, geographic diversity is a big issue in the NLP community. Andrew Caines and Marek Rei counted the distribution of countries based on author affiliations at major NLP conferences in 2018. The resulting numbers below confirm the geographic imbalance, with only a tiny fraction of authors coming from South America and Africa. This is one of the biggest problems of our times and together we should try to balance the scales.

The number of author affiliations by region at 2018 NLP conferences.

EurNLP 2019 🇪🇺

The first European NLP Summit (2019) took place last week. You can view the recordings of all talks of the event here (particularly the panel discussion in the last session pictured above is worth watching). We had a diverse set of speakers and participants hailing from all over Europe. It was particularly great to be able to chat with both junior and senior researchers in the field in a setting that felt more intimate and personal and much less formal than at a regular conference.

We're currently looking at locations for next year. If you're interested in getting involved as an organizer or if you're interested in sponsoring next year's event, let us know via the email on the website.

Distribution of attendees from our around Europe (not pictured: attendees from US, Israel, and SE Asia)

The ML echo chamber 🗣

The machine learning community can sometimes feel like an echo chamber that can incentivize bad research practices such as overclaiming, setting up weak baselines, or poor comparisons in order to be able to publish.

A facet of bad behaviour is taking shortcuts. A recent example of this is Siraj Raval who has been cutting corners in order to keep up with his aggressive publishing schedule. Everyone makes mistakes, but some things such as plagiarizing or not acknowledging others' work are simply not ok.

Vicki Boykis gives a very thoughtful take on how it might have come to this and importantly draws several lessons that we can all take to heart:

Don’t plagiarize code.
Don’t overpromise and underdeliver.
Don’t wade in over your head. Ask for help. Form a network and find mentors.

In light of this, it is important to remind ourselves: Most of what we see publicized every day—whether a paper, a blog post, or a research project or demo—was the result of weeks' or months' of work of (usually) a group of people. Seeing only the highlights might make the work appear effortless, but it is not. Good work and good research takes time.

Resources 📚

An overview of current pretrained language models

PLMpapers 📑 This repo contains a collection of many relevant pretrained language models. The above diagram illustrates how they relate to each other. Several things can be noted:

BERT has catalyzed research in pretrained language models and has sparked many extensions.
In particular, work on multimodal settings has been growing.
We have not been getting tired of muppet references yet (sadly).

Transformers 🤖 Staying with BERT-like models, you can go down the rabbit hole with this collection of resources related to Transformers. It includes resources that cover most of what you ever wanted (and didn't realize you wanted) to know about Transformers including paper reviews, blog posts, lectures, walkthroughs, and follow-up papers.

Topical-Chat 💬 One area that is still challenging for current methods is dialogue. Topical-Chat is a dataset of more than 235,000 utterances released by Amazon, making it the largest publicly available social-conversation and knowledge dataset. Each conversation refers to a group of three related entities, and every turn of conversation is supported by an extract from a collection of unstructured or loosely structured text resources (see the blog post for more information).

Document Embedding Techniques 📃 With all these fancy recent approaches, it is important not to forget methods that were introduced before 2018. This is a nice overview of document embedding methods, from BOW and LDA to more recent unsupervised (doc2vec, word-mover's embedding, sentence-BERT, etc) and supervised methods.

CodeSearchNet challenge 👩‍💻 Another interesting area of research is NLP applied to code. This challenge by GitHub aims to test models on information retrieval on code. It includes a large corpus that covers 6M methods, 2M of which have documentation as well as baseline models for code search. This should be a good starting point for anyone wanting to get their feet wet with ML-on-code.

Notes on interpretability 📚 A collection of explainable AI papers by Ana Marasovic, including overviews, perspectives, and papers about benefits and pitfalls of explainable methods, and many others.

Tools ⚒

To exactly no-one's surprise, many tools that came out over the last month feature (you guessed it) pretrained language models:

Grover-Mega 🦍 Grover-Mega, a large GPT-like language model for defending against fake news is now publicly available and can be downloaded.

CTRL 🕹 This code makes it easy to install and run the large controllable language model by Salesforce. Alternatively, it is also available in 🤗Transformers.

fast-bert ⏩ fast-bert is designed to train and deploy and BERT and XLNet based models. It is built on 🤗Transformers, inspired by fast.ai, and strives to make cutting-edge NLP architectures as accessible as possible.

NLP Architect 🏛 Intels' NLP Architect library now integrates Transformer-based models including BERT, XLNet and XLM. Compared to other libraries, it also includes a quantized version of BERT that reduces the model size by 75% as well as distillation method.

AllenNLP Interpret 🌈 This is a toolkit for interactive model interpretations using gradient-based saliency maps and adversarial attacks. It features multiple demos, such as for masked language modelling and textual entailment.

NeMo 💬 NeMO is a toolkit for conversational AI by NVIDIA in PyTorch. At this point, it supports NMT, pretraining BERT, NER, intent and slot filling, and improving speech recognition with BERT post-processing (see here).

sotabench🏅 SOTA bench is a tool by papers with code that allows you to benchmark your own and others' open-source models in order to make algorithms more reproducible. On the website, you can then see for common tasks which implementations actually replicate the papers' results.

Articles and blog posts 📰

Learning Machine Learning 💥 This is a great comic by Google AI that makes ML more accessible. It goes surprisingly deep and discusses issues such as overfitting, data quality, and more in an entertaining and humorous way.

Current Issues with Transfer Learning in NLP 👩‍🏫 Muhammad Khalifa summarizes some of the main challenges of transfer learning in NLP: computational intensity, reproducibility, task leaderboards, similarity to human learning, shallow language understanding, and a high carbon footprint.

Slice-based Learning ✂️ The authors of snorkel, the data programming toolkit that we've covered in the past, discuss slice-based learning that can be used to improve performance on certain subsets of the data, or slices. Slice-based learning has been used to achieve state-of-the-art performance on the SuperGLUE benchmark.

PhD 101 👩‍🎓 A collection of advice from Volkan Cirik on doing a PhD:

Dealing with failure.
Learn to learn.
You are not your work or ideas.
Research is not linear.
Your relationship to your adviser is important.
Avoid tunnel vision.
Research is hard. You need support.

Gaussian Process, not quite for dummies 👩‍🏫 Gaussian Processes are a powerful tool, but can be hard to grasp if they are tackled directly. Yuge Shi introduces Gaussian Processes from first principles, starting from non-linear logistic regression and 2D Gaussians in this lucid blog post.

Natural Language Processing With spaCy in Python 🐍 This is a nice tutorial that covers many of the things you can do with spaCy on one page, from sentence detection to part-of-speech tagging, dependency parsing, and NER.

The Duolingo CEFR Checker: An AI Tool for Adapting Learning Content 🦉 This is a nice post by Duolingo on how they built a model that can estimate the CEFR (Common European Framework of Reference), i.e. the language proficiency level of an author ranging from A1 to C2 in different languages using cross-lingual word embeddings.

Do We Encourage Researchers to Use Inappropriate Data Sets? 🤯 Ehud Reiter argues that NLP as a field incentivizes using poor quality datasets. In particular, it seems to be accepted that if a dataset has been used before, it's fine to use in the future—despite any issues that it might have. This is in particular concerning given the increasing number of papers that find issues with our current datasets, from question answering (CNN / DailyMail), natural language inference (SNLI / MNLI), to bilingual lexicon induction (MUSE).

Why we switched from Spacy to Flair to anonymize French case law 👩‍⚖️ A nice case study of NER for a particular domain in a non-English language (legal text in French) that gives some insights into common challenges (anonymization, large datasets) and trade-offs (speed vs. accuracy) for applying NLP in this domain.

Tips on how to write a great science paper ✍️ Pulitzer prize winner Cormac McCarthy—who helped edit the work of many scientists—provides many useful pieces of advice, including:

Use minimalism to achieve clarity.
Decide on two or three points you want every reader to remember.
Limit each paragraph to a single message.
Keep sentences short, simply constructed and direct.
Don’t slow the reader down.
Don’t over-elaborate.

Papers + blog posts 📑

Evolution of Representations in the Transformer (blog post, EMNLP 2019 paper) This paper looks at how the representations of individual tokens in a Transformer pretrained with different objectives change. They find differences in what the representations capture that are tied to the objective, e.g. that representations obtained via language modelling tend to gradually forget about the past. This is a great example of a blog post that presents the core insights and analyses of a paper in an easily accessible form with compelling visuals.

What causes bias in word embedding associations? (blog post, ACL 2019 paper) This paper looks at undesirable associations such as gender stereotypes captured by word embeddings and studies what is the main source of bias and how embeddings can be debiased effectively. One of the interesting findings reveals that skipgram does not make most words more gendered, but it actually amplifies bias for gender-stereotyped (e.g. 'nurse') or gender-specific (e.g. 'queen') words.

Discovering Neural Wirings (blog post, NeurIPS 2019 paper) The connectivity patterns of neural networks are typically manually defined. This paper relaxes this notion and allows for the wiring to be learned during training. They manage to train a model that is small during inference but still overparameterized during training.

Paper picks 📄

Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches (RecSys 2019) This RecSys 2019 best paper is in line of recent systematic comparison of existing approaches and calls for improved scholarship and reproducibility. It studies 18 algorithms for top-n recommendation. It finds that only 7 can reproduced with reasonable effort. 6 of those don't beat simple baselines and the 7th one does not consistently outperform a linear ranking method. Overall, not only the NLP but also the ecommender systems community should invest more effort into strong baselines and reproducible comparisons.

A Step Toward Quantifying Independently Reproducible Machine Learning Research (NeurIPS 2019) For this paper, the (single!) author attempted to implement and replicate 255 ML papers published from 1984 until 2017 (64% were replicated successfully), recorded features of each paper, and performed a statistical analysis of the results to determine "what makes a paper reproducible by an independent researcher?". The main take-aways:

Reproducibility rates have not changed over time.
The fewer attempts to read through a paper, the more likely it was to be reproduced.
Highly detailed "code-like" descriptions and no pseudo-code are more reproducible.
More empirical papers are more and more theory-oriented papers are less reproducible.
Bayesian or fairness-related papers were less reproducible; DL and search/retrieval-related papers were more reproducible.
Having more tables and hyper-parameters specified made a paper more reproducible.
The number of equations per page was negatively correlated with reproducibility.
Papers that required a cluster were harder to reproduce, while papers requiring a GPU were easier to reproduce.
If authors of a paper reply to questions, the paper is much more likely to be reproducible.
Neither toy problems nor conceptualization figures were correlated with reproducibility.

NLP News

Discussion about this post