Reviewing, Taking stock, Theme papers, Poisoning and stealing models, Multimodal generation

Jun 05, 2020

Hi all,

This newsletter took somewhat longer than usual. This is both due to lower energy ⚡️on my side and because I've been struggling to do justice to all the awesome blog posts that many of you are publishing every month 💪.

Overall, I've realized that trying to provide a comprehensive mix of everything that has been going on is not sustainable for me ♻️. So I'll try to refocus 🧘‍♂️ with this newsletter and to prioritize covering fewer things in-depth 🕳.

Going forward, I'll only highlight a few themes together with resources, articles, and papers that had a lasting impact on me 🤯 . This way, I hope to still be able to provide you with important takeaways and updates every month, while also having more time to write long-form blog posts for deeper discussions ✍️.

I'll still be happy to receive pointers to interesting content and will share them as best as I can. If you want your articles read by a wide audience, consider posting them to Goku Mohandas' Made with ML and dair.ai's NLP Newsletter. Both are excellent resources for staying up-to-date with ML and NLP 👏.

Made with ML is also hosting a Data Science Incubator, which looks awesome; kind of like a Coursera course–meets–Google Summer of Code. If you're a student and COVID-19 disrupted your summer plans, definitely check it out!

Lastly, in light of current events, if you are #BlackInStem please send me an email if there's anything I can do to help. ✊🏿

I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.

If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.

1. Reviewing

With the EMNLP and NeurIPS deadlines almost behind us, the topic of reviewing is of course being discussed again. In particular, the EMNLP Organization team published some great advice on the EMNLP blog. The post gives clear advice on what are often invalid bases for rejecting a paper (emojis mine):

Work on a language other than English (👏); no SOTA results; no use of deep learning; a simple method; a narrow topic; a resource paper.

While we cannot expect a blog post to influence the reviewing landscape dramatically, the post sets the right tone and hopefully leads to less focus on pushing up numbers and more on a paper's actual contribution.

On the topic of resource papers, Anna Rogers published a great post that dispels some myths around the review of resource papers. Resource papers are important contributions and are generally more impactful than an incremental improvement to a model. Creating a resource with a high inter-annotator agreement that tackles an impactful problem takes diligence, patience, and scientific rigour—traits that should be rewarded.

After the reviews generally comes a rebuttal period. Devi Parikh, Dhruv Batra, and Stefan Lee recently published a guide to writing rebuttals. It is rare that leading researchers deliver such didactic insights into their thought processes. I wish this guide existed when I wrote my first rebuttal.

Another new development is the introduction of Findings of EMNLP, a new sister publication to EMNLP for high-quality papers that did not find space in the main conference. Specifically, the proposal highlights four instances of papers that would be appropriate for Findings, including "(1) Papers that make a specific contribution to a narrow subfield, and while not of widespread interest, will have an impact on a small community;" and "(4) Papers that don’t quite fit in EMNLP, but make contributions that are potentially of interest to specific sub-communities." Yoav Artzi highlights potential problems with these criteria and how they might lead to additional biases in reviewing. I personally think that papers that have an outsized impact on a narrow subfield are exactly those that we want to highlight in the main conference. For instance, without advances in the narrow subfield of language modelling, we would still be using just word embeddings.

Finally, while peer review is an integral part of today's scientific process, an instance of peer review is also very common in the tech industry—the code review process. Code review is a core and highly valuable part of software development that can raise small but significant issues and produce profound insights. Shay Palachy advocates for using a process that performs a similar function also for data science projects. The main goal of the process is to catch costly errors early and involves presenting the research process, going through a review and a detailed checklist. A similar, more high-level checklist for data projects has also been put forth by Jeremy Howard.

2. Taking stock

Visualization of trends in ACL 2020 papers by Wanxiang Che (click for the animated version)

With a new decade ahead and the general slowdown at the moment, it is a good time to take stock of the developments of the past years. The above visualization shows at a high level the distribution of trends in ACL papers. Compared to previous years, the current ACL landscape seems to be less dominated by a few areas. Dialogue, generation, QA, and resource papers are all areas that have recently become more popular.

Saif Mohammad has conducted a more in-depth review of trends in NLP based on citation analysis. He finds that only 56% of NLP papers are cited more than 10 times, long papers get about 3x as many citations as short papers (something I didn't expect given influential short papers such as this one), and papers on topics such as sentiment classification, anaphora resolution, and entity recognition have received the highest median number of citations.

In the beginning of the paper, Saif mentions that "there are systematic biases that prevent certain kind of papers from accruing citations". While there are biases against less popular topics, a recent paper on the Diversity-Innovation Paradox in Science shows that there are also systematic biases against underrepresented groups. In particular, novel contributions by minority researchers are taken up by other scholars at lower rates than novel contributions by researchers from a majority. Going forward, we should be mindful not only of the paper we cite but also of the person behind that paper.

Saif is known for another resource, the NRC Word-Emotion Association Lexicon, which just celebrated its 10 year anniversary. Saif highlights some of its interesting applications such as tracking emotions in novels, generating emotive poems, and more. I remember experimenting with it when I briefly worked on emotion recognition. In contrast to its prominent cousin sentiment analysis, emotion detection seems to have faded somewhat. At SemEval-2020, the most similar task asks to judge emotions from memes ('memotion analysis').

3. Theme papers

Another nice development this year has been the theme track at ACL 2020, which explicitly invited papers that "take stock of where we've been and where we're going". The few papers that I have seen so far from this track were among the most refreshing papers I read in a while. My favourite is:

The State and Fate of Linguistic Diversity and Inclusion in the NLP World by researchers from Microsoft Research India.

Language Resource Distribution of The State and Fate paper.

They divide languages in 6 groups based on the amount of labeled and unlabelled data available for them (see above). While unsupervised pre-training may help languages with sufficient unlabelled data (Groups 3-5), they paint a bleak picture for many of the other languages. In particular, they predict that unsupervised pre-training will make the 'poor poorer', so that languages in Group 0, which make up >88% of the world's languages and more than 1B speakers will be left behind. It is up to us to develop more sample-efficient methods and leverage alternative data sources to serve this long tail of languages.

Here are some other theme papers that are well worth reading:

The Unstoppable Rise of Computational Linguistics in Deep Learning (Henderson): James Henderson makes the point that attention-based models induce structure and implicitly do variable binding—in contrast to sequential LSTMs. They are thus more similar to classic models in Computational Linguistics and can generalise with regard to structured representations. What James does not address is that the self-attention mechanism in Transformers has important theoretical limitations. For instance, it cannot model periodic finite-state languages (Hahn, 2020).
A Call for More Rigor in Unsupervised Cross-lingual Learning (Artetxe, me, et al.): In our theme paper, we give a thorough overview of current practices including methodological issues in unsupervised cross-lingual learning. Among other things, we argue that the scenario that is most often used to motivate methods for this setting (no parallel data and abundant monolingual data) is unrealistic in practice.

You can find more theme papers in this Twitter thread.

4. Poisoning and stealing models

The model extraction setup of Krishna et al. (2020) for question answering

These are two recent papers that caught my eye. Both operate in the emerging area of security in NLP and both focus on attacking a pre-trained language model that is fine-tuned on data of a target task.

Thieves on Sesame Street! Model Extraction of BERT-based APIs (Krishna et al., ICLR 2020) This paper focuses on stealing a pre-trained language model such as BERT from a public API (see above; also see the blog post). Many ML models are available via an API where you have to pay for access; if you could successfully extract a model with the same quality as the API model, you would avoid paying. From the distillation literature (Hinton et al., 2014 and more recent papers), we know that we can train an accurate student model from a teacher model given access to the teacher's output probability distribution and training data. In the model extraction setup, we typically don't have access to the training data and only obtain the label the teacher predicted. Krishna et al. show that model extraction works quite well even if we only have access to the label and—more importantly—if we use random sequences. Particularly the latter point—that we can transfer useful knowledge even if we use completely unnatural out-of-distribution text as input—is quite surprising to me. Defending against such attacks while keeping the API usable is still an open problem.

Weight Poisoning Attacks on Pre-trained Models (Kurita et al., ACL 2020) This paper tackles a different security problem: Instead of stealing a public model, we can provide a "poisoned" model for others to download that we can then later exploit. In particular, the authors propose "weight poisoning" attacks that inject vulnerabilities into a model that expose "backdoors" after the model is fine-tuned on a target task. Trigger words are rare words (arbitrary nouns also work) that are inserted in a sequence and enable switching the model's prediction to a target label. For both methods, knowledge of the target task is required but the attack also works if the model is trained on data of a different domain. While there is no real downside to using well-performing publicly available models at the moment, we might want to pay more attention in the future that the public models we use come from reputable sources.

5. Multimodal generation

While there are generally a lot of new development in generation (particularly in language modelling), the last month saw highlights in different modalities.

Language (text)

HuggingFace's Patrick von Platen gave a nice practical overview of different decoding strategies—greedy search, beam search, top-k sampling, top-p sampling—and how you can use them with state-of-the-art Transformers. I found it particularly helpful that he includes tips and recent insights on why greedy search and beam search don't work that well.

The last weeks also saw the release of two large models—Facebook's BlenderBot and OpenAI's GPT-3—for dialogue and language modelling respectively. Both are much larger than previous models. Curiously, OpenAI focuses exclusively on zero-shot evaluation (i.e. evaluation without fine-tuning and at most conditioning on a few examples in the past) of their model, a setting that is not too common at the moment. In this setting, however, their model exhibits quite strong gains compared to smaller models. For a broader overview of GPT-3, see also this post by Yoel Zeldes. Using the 175B-parameter GPT-3 will be a challenge in practice. I expect that we will soon see many compressed versions of this model that trade some accuracy for usability.

Table-to-text

One of the most interesting applications of natural language generation is arguably when the input contains some structure that the model should take into consideration. One area where this is useful is generating reports of sport events such as in basketball or baseball (Puduppully et al., 2019) where large tables of statistics are available. ToTTo (Parikh et al., 2020) is a new large-scale dataset for table-to-text generation based on Wikipedia. An interesting finding of the paper is that state-of-the-art models are able to generate fluid sentences but often hallucinate phrases that are not supported by the table. More grounding is thus necessary!

Audio (music)

OpenAI also created Jukebox, a VQ-VAE-based model that can generate songs (including lyrics, etc) that sound realistic to the untrained ear. Generation is still pretty slow (it takes around 9 hours to generate one minute of audio). This can be sped up but it will probably take a while until your weekly Spotify recommendations will include songs that have been generated exactly to suit your tastes. In the meantime, you can listen to some example generations here. Another example of music generation is Shimon, a robot by Georgia Tech. Researchers have previously improvised with the robot. The neural network used for music synthesis by Shimon was trained on a large database of song lyrics (although it is not clear what architecture the model is based on).

NLP News