ML and NLP starter toolkit, Low-resource NLP toolkit, "Can a LM understand natural language?", The next generation of NLP benchmarks

Oct 13, 2020

Hi all,

It has been a while... I hope you're doing well in these crazy and strange times.

COVID-19 has affected each of us in unpredictable ways 😷. I'm fortunate to be healthy but my energy has been lower overall ⚡️. Coupled with life and work keeping me busy, this has left less time for writing a newsletter 😩.

There are so many things that happened over the last months. For this edition, I've focused on a) resources that help you get started with NLP in general and low-resource NLP in particular; and b) musings on two important recent topics: What a language model can learn and how the next generation of NLP benchmarks will look like.

The next newsletter (which should come out in a more timely fashion once I've caught up on my reading list backlog 😅) will cover the most interesting papers from ACL 2020, EMNLP 2020 papers, and submissions to ICLR 2021.

I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.

If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.

Who are you in the NLP quarantine? (Credit: Erick Fonsecka)

1. The ML and NLP starter toolkit 🚀

I am amazed every day how dramatically the field has changed over the last years and how many expertly created resources are now out there to help you get started in the field. Here are my favourite resources to kick-start your learning journey:

The NLP Pandect by Ivan Bilan is a fantastically detailed, curated collection of NLP resources on everything NLP—from general information resources, to frameworks, and Youtube channels. I even found a couple of cool NLP podcasts!

nlp-tutorial by Tae-Hwan Jung is a GitHub repo that—with 7.2k ⭐️—might not be a secret tip anymore but is well worth checking out. It includes lots of minimal walk-throughs of NLP models implemented with less than 100 lines of code.

Embeddings in Natural Language Processing is a new book by Mohammad Taher Pilehvar and Jose Camacho-Collados that gives a comprehensive overview of embeddings, the bread and butter of NLP. The authors kindly openly released the first draft.

The Super Duper NLP Repo by Quantum stat has grandiose ambitions. It is the largest collection of NLP demos in colab notebooks (262 as of today) that I'm aware of—covering everything from table parsing to text-to-speech. A great way to get started on a new application! If you want more NLP goodness, then check out their NLP Model Forge and The Big Bad NLP Database for datasets.

Lena Voita's NLP Course | For You is a tour de force of how to bring a physical course to an online audience. Clear, well-structured explanations coupled with beautiful illustrations make going through the extensive materials a joy.

Break into NLP hosted by deeplearning.ai is a virtual event that covers all things NLP. Lukasz Kaiser and Andrew Ng review the progress in MT and in ML in general. Andrew recounts how Deep Learning first transformed speech, then computer vision, and now NLP. The panel with NLP luminaries Marti Hearst and Ken Church offers a lot of interesting perspectives and thoughts for starting into NLP.

In Qingqing Cao's ACL 2020 Adventure, he gives some great tips on what to do when attending virtual conferences such as hosting a watch party and attending mentoring and Q&A sessions as well as some useful advice from mentoring sessions. On that note, if you are at the virtual EMNLP 2020 and would like to chat, don't hesitate to message me.

2. The low-resource NLP toolkit 🌍

Another thing I'm quite excited about is the increased support for low-resource languages for many core NLP tools, which is more important than ever in the COVID-19 era. Some of my favourite projects are the following:

Stanza is an NLP toolkit by Stanford that provides neural network models for the entire NLP pipeline, including tokenisation, lemmatisation, part-of-speech tagging, and dependency tagging in 66 languages. It's easy to use and a great starting point for getting annotations in more languages.

The Low Resource NLP Bootcamp held by CMU's Language Technologies Institute is a one-stop-shop for learning more about NLP fundamentals and areas related to low-resource scenarios such as machine translation and multilingual NLP. The talks are available online.

Masakhane is a grassroots NLP community for Africans, by Africans. It brings people together to work on challenging research problems for African languages. Their recent EMNLP 2020 findings paper demonstrates the impact grassroots efforts can have. I hope to see more such initiatives focused on languages of Asia and the Americas.

Finally, here are several cool resources for different languages:

Arabic: arbml is a GitHub repo that is all about Arabic NLP. It contains Keras models for different tasks, datasets, and Colab demos, from poem generation to sentiment classification.
Turkish: Zemberek-NLP provides a similar array of tools for Turkish. The tools are focused more on core NLP tasks, from morphology to tokenization and are written in Java.
Indonesian: IndoNLU is a comprehensive resource to help you do cutting-edge work on Indonesian NLP. It not only provides a GLUE-for-Indonesian benchmark that covers 12 diverse downstream tasks but also provides an Indonesian pre-training dataset and pre-trained models. Let the fine-tuning begin!

3. Can a LM ever truly understand natural language? 🐙

This summer, two independent developments reignited the debate about the fundamental capabilities and limitations of language models (LMs) that are trained only on raw text data:

The paper Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data by Emily Bender and Alexander Koller won the best theme paper award at ACL 2020. Their main claim is that a system only trained on form (i.e. raw text) can never understand meaning. They argue that for systems to learn meaning, form must be coupled with intent (which is outside of language).
GPT-3, an even larger language model than GPT-2 was announced by OpenAI and made available in a limited fashion via an API and impressed people with its few-shot generation capabilities.

Bender and Koller propose an octopus test, a thought experiment that demonstrates that a system that is only exposed to form cannot learn meaning. In brief, two people A and B live on remote islands and can only communicate via text messages through a trans-oceanic cable. A hyper-intelligent octopus O listens in on their conversations. Eventually, it cuts off B and poses as A. Can O fool A or will A become suspicious? Bender and Koller argue that O will fail when exposed to real-world situations such as a bear attack (see below) as O has no grounding to the world.

Julian Michael provides a great reflection of the theme paper and the test in this blog post. According to Bender:

The point isn’t really whether O could fool A under what circumstances, but rather to use that thought experiment to show what is missing in O’s (and thus modern LM’s) input.

Julian furthermore outlines problems with subjective judgements of intent as humans have a tendency to attribute meaning to language, which is known as the ELIZA effect.

Christopher Potts critically reviews the current evidence and arrives at the conclusion that "we currently don't have compelling reasons to think language models can't achieve language understanding".

Overall, I think we have seen that large language models such as GPT-3 trained on huge amounts of data may be able to uncover connections between intent and pure form that go beyond what we expected. These connections, however, are tenuous as the prompts for such models must be carefully engineered in order to elicit the expected behaviour. Incorporating more explicit signal into such models from semantics and grounding should not only make them more powerful but also more data-efficient.

4. The next generation of NLP benchmarks 📊

One of the appeals of natural language processing to me is its diversity of tasks that cover various aspects of natural language, from text simplification to sentiment analysis. Historically, most tasks have standard datasets on which approaches are benchmarked. Recently, we have seen approaches increasingly compete on public leaderboards, claiming new state-of-the-art results but remaining brittle and perhaps losing track of the ultimate goal: doing well on real-world downstream applications of natural language.

There are four attributes that I think are crucial for benchmarks to better evaluate how well models generalise:

Multi-task: Benchmarks cover multiple general NLP tasks such as GLUE or multiple datasets of a task such as summarisation (see e.g. PEGASUS) and task-oriented dialogue (DialoGLUE). This helps to prevent designing models that only work for a specific domain or a narrow set of tasks.
Multi-lingual: Benchmarks cover multiple languages such as XTREME or provide data in languages beyond English such as IndoNLU. This enables evaluating how well the inductive biases of models generalise beyond English.
Adaptive: Benchmarks improve and grow with the model rather than remaining static. Recent examples of this are Dynabench (for diverse tasks) and LIGHT WILD (for dialogue in a fantasy RPG world). This ensures that models have a smaller chance to overfit to specific biases in the data and makes it more likely that they generalise to a broader set of phenomena.
Fine-grained: Benchmarks enable us to make fine-grained judgements where a model generalises and where it fails, for instance using SuperGLUE's diagnostic dataset or CheckLists. This will allow us to better identify a model's weaknesses and develop ways to mitigate them.

I think we will see all four of these ingredients appearing more frequently in future benchmarks, enabling us to bridge the gap to real-world applications. To better keep track of results on such diverse datasets, being able to automatically extract and aggregate results from papers is key. Lastly, it is important to keep in mind the ecological validity of our benchmarks, i.e. that they are reasonable proxies for scenarios where we ultimately would like to apply our models to make a difference.

NLP News

Discussion about this post