IE—how did we get here?, Large LMs, The human side of ML
Hi all,
I hope you've had a good start to the new year. This newsletter is a bit delayed due to a confluence of conference deadlines. Whether you're struggling with conference deadlines or cabin fever, I hope this newsletter offers some respite.
On another note, Revue, the platform that I've been using for this newsletter has been acquired by Twitter. In case this newsletter will also be restricted to 280 characters in the future, I will provide you with a language model pre-trained on past editions that will auto-complete the rest. Of course, such a language model—if it existed—has not been used to write this or previous editions or previous editions or previous editions or previous editions </SEP>
This newsletter also starts with a new segment "How did we get here?" where I try to piece together as best as I can the progress from the inception of an NLP task to today, mostly by following the guidance of people much more knowledgeable in the respective area than me. Let me know if this is something you'd like to see for other NLP tasks in the future and—if so—what tasks you'd be particularly interested in.
In other news, the NLP chapter of ELLIS (the European Laboratory for Learning and Intelligent Systems) will host a workshop on Open Challenges and Future Directions of NLP on 24–25 February. The five keynotes will be live-streamed. As far as I'm aware, the remaining sessions are restricted to a smaller audience. I will aim to share some highlights later on.
I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.
If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.
This recent xkcd comic struck a chord with me. So much more can be done with language beyond treating it as data for training ML models.
Information extraction—how did we get here?
This section is inspired by and mostly follows Claire Cardie's excellent EMNLP 2021 keynote where she reviews the history of information extraction (IE). IE is concerned with extracting structured information from unstructured data. Notably, because the unconstrained setting of IE is challenging and difficult to evaluate, it has typically been studied with regard to a specific domain. Open IE, in contrast, has been studied only since 2007 and has led to tools such as TextRunner and Reverb.
The first milestones in the area of IE, the DARPA-funded Message Understanding Conferences from 1987–1998 focused on domains such as terrorism incidents in Latin America, management changes, and satellite launches. For each input document, systems were generally required to fill in slots in a template: For MUC-3 whose focus was terrorism, systems needed to fill in 18 slots (see below for an example).
In contrast to many current datasets, the setting generally reflected the intended use case: only ≈ 50% of articles were relevant to the task and ≈ 50% of articles contained more than one relevant event.
The original task as laid out in the MUC challenges required document-level natural language understanding including event and noun phrase coreference resolution. Given the challenging nature of this setting, subsequent years saw IE divided into simpler subtasks such as named entity recognition (NER), coreference resolution, entity linking, relation extraction, event extraction, etc.
Fast-forward to today where state-of-the-art approaches to IE based on neural networks have generally improved performance for NER (Akbik et al., 2019), relation extraction (Soares et al., 2019), and other tasks. However, relations in common datasets such as ACE do not span sentence boundaries (see below for an example), so are still fairly local compared to its MUC predecessor. Even in recent work (Lin et al., 2020) that jointly identifies entities and events, arguments of an event still need to occur in the same sentence as the event trigger.
Going forward, Claire as well as other IE researchers such as Ellen Riloff, Jian Su, and Heng Ji stress the importance of document-level event and relation extraction as well as developing IE approaches that work in low-resource settings. I also think that given how much knowledge large pre-trained language models have been shown to capture implicitly (Xiong et al., 2020), it's worth revisiting the Open IE setting with recent models.
Here's some further reading material if you want to delve in this area:
Twenty-five years of information extraction by Ralph Grishman
Last Words: What Can Be Accomplished with the State of the Art in Information Extraction? A Personal View by Ralph Weischedel and Elizabeth Boschee
Let's talk about large LMs... 🦜
Large pre-trained language models (LMs) are the de facto standard for achieving state-of-the-art performance on tasks in natural language processing. However, with great power comes great responsibility. Bender and Gebru et al. (2021) provide an overview of issues with large LMs, which covers the following topics, among others.
Access 👩💻 Due to the reliance on huge amounts of compute, pre-training of the largest models has mostly been restricted to well-funded corporations, with a few exceptions such as Grover. While checkpoints of models such as BERT and RoBERTa are widely available, the largest recent models (Fedus et al., 2021) go well beyond the capacity of off-the-shelf GPUs. These models cannot easily be fine-tuned by practitioners or are gated by an API. Given the prominent role that such models will likely play in the future of NLP, it is crucial that the community is involved in their design. Community-led initiatives such as EleutherAI thus seek to replicate large-scale modelling efforts such as GPT-3 and to make them more widely available. Other collaborative projects such as BIG-bench focus on making the benchmarking of such models more easily available.
Energy ⚡️Another downside is that the compute required to train such models incurs a large financial and environmental cost. It is thus key to focus on the development of more efficient methods that enable us to lower costs, such as more sample-efficient pre-training methods (Clark et al., 2020; see this post for an overview). In addition, we should benchmark our methods not only in terms of absolute performance but energy efficiency (Henderson et al., 2020). While we have managed to make downstream training more sample-efficient via fine-tuning, pre-training is generally still done from scratch. I'd like to see more work that seeks to lower the cost of pre-training, for instance by warm-starting from the representations of previous iterations, by distilling from similar pre-trained models, etc.
Bias ⚖️There have been many studies focused on the biases that such models inherit from their pre-training data, e.g. (Basta et al., 2019). Some recent discussions online (see e.g. this short essay) focused on whether we should prescribe how a language model should behave, among other things. While large pre-trained language models (LMs) have been likened to many things, from puppet characters 🐒 to uncertain winged animals 🦜, it is up to us to ensure that they do not become yet another metaphorical bird—the canaries in the coal mine of algorithmic bias. To ensure that such models have a positive impact on the largest number of people, the same care and deliberation that goes into their design must be taken when choosing the data used for training. In particular, we should revisit tacit assumptions such as the use of lists of banned words.
The human side of ML 🧑🏫🧑🏾💻👩🔬
With a constant focus on performance and research output, the need for human connection and the importance of growing as a researcher or ML practitioner can often take a back seat. Without being able to meet face to face, it is even harder to form such connections and to get to know like-minded people in the field.
In Humans of AI, Devi Parikh interviews 18 leading AI researchers not about their work but about what their life is like, what they struggle with, their habits, their aspirations, etc. I particularly enjoyed the different answers to some of the questions such as what traits they like to see in collaborators, e.g. clarity of thought and expression, taking time before answering questions, high bandwidth (Dhruv Batra), or humility and passion (Meg Mitchell).
A crucial way to connect in our field is to write cold emails. Writing cold emails is a skill that when honed is a win-win: It can lead to collaborations, opportunities, or research advice and adds value to the recipient (as I mentioned in this post). Eugene Vinitsky gives concrete advice on how to do this in a research setting in A Guide to Cold Emailing.
It is inspiring and helpful to read about individuals' reflections on their ML journey. Maithra Raghu shares some candid thoughts on her PhD experience. She talks about expectations and challenges that often go unsaid such as feeling completely stuck and strategies that she developed to cope with them. One neglected aspect is how to know that you have grown as a researcher. Maithra highlights that one part of such growth is developing a research vision, a rich articulable view of the directions in an area. A related notion is developing 'research taste', which roughly means an ability to choose 'good' problems to work on. Chris Olah provides some exercises on how to develop research taste. I particularly found the collection of advice at the end of his post helpful.
Finally, a great initiative that brings people together is NLP with Friends, which hosts weekly or biweekly talks by students for students in a supportive environment. For everyone who lacks a rich research environment or if you just want to interact with like-minded people, this is a great place to present, get feedback on your work, and establish connections.