Another thing to note is that we now have
Findings of EMNLP 2020, a
collection of 447 long and short papers that in many cases focus on more niche areas. Personally, I enjoyed reading many
Findings papers focusing on under-studied areas and am excited about this new venue. Here are the EMNLP papers that I enjoyed most so far:
What Can We Do to Improve Peer Review in NLP? 👩🏫 This meta-research
Findings paper discusses the pros and cons of peer review. It highlights lucidly many of the points that are being raised in the ongoing debate. I particularly found their characterisation of peer review as an “
apples-to-oranges comparison” compelling where reviewers are forced to weigh papers with completely different merits against each other. They also highlight lessons from psychology: For instance, the
proclivity of reviewers to resort to heuristics can be explained by the human tendency to “unconsciously substitute [a difficult question] with an easier one (Kahnemann, 2013)”.
MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale 📚 One thing that I generally appreciate is that as a community, we have increasingly moved to evaluating methods on multiple tasks or domains. This paper takes this to the extreme by considering
transfer from 140 (!) English StackExchange domains using adapters. The models are evaluated on answer selection and question similarity datasets and largely outperform IR baselines. What I found particularly interesting is that neither domain similarity nor training data size consistently predicted the best models—instead,
having access to diverse source models is important. Indeed, combining information from all source domains performs best. The pre-trained adapters are available at
AdapterHub and can be easily downloaded in a plug-and-play fashion.
Which *BERT? A Survey Organizing Contextualized Encoders 🦍 This survey is a great starting point to catch up on what has been going on in Transformer land. It synthesises many important points and take-aways from the recent literature and makes a number of recommendations. Specifically, I second their suggestion to
publish and publicise negative results. Venues such as the
Workshop on Insights from Negative Results in NLP or even
Findings should be a good fit. Alternatively, if you have a method that
works, consider including a section in the appendix describing
what did not work. I also really like the idea of leaderboard owners periodically publishing surveys of their received submissions (stay tuned for an update on
XTREME). Overall, choosing which BERT to use requires
trading off task performance and efficiency for your application, deciding whether
leaderboard performance reflects that of your downstream task, opting for
monolingual or multilingual models, etc.
On Losses for Modern Language Models 📄A new pre-training objective has been one of the most common modelling contributions in papers that seek to advance self-supervised pre-training. To date, however,
it has not been clear how these different losses interact and if they provide any substantial benefit over the now standard masked language modelling (MLM). This paper conducts a thorough study of both existing and new pre-training objectives (including next-sentence prediction). They find that next sentence prediction does not help as it is too easy but
identify several auxiliary tasks that outperform just doing MLM—including predicting sentence order or adjacency, prediction tf-idf statistics, and efficiently predicting sentence continuations. Combining them makes for more data-efficient pre-training. Overall, besides better pre-training objectives,
future pre-trained models may thus rely on many objectives in order to be more sample-efficient.
Identifying Elements Essential for BERT’s Multilinguality 🌍 One of the most intriguing problems in recent NLP for me has been
the mystery of how pre-trained multilingual models can generalise effectively across languages without any explicit cross-lingual supervision (see
our recent study as
well as
others). This paper sheds more light on this problem through a controlled study in a synthetic setting: learning representations between English and Fake-English (where token indices are shifted by a constant). They find that
underparameterisation, shared special tokens, shared position embeddings, and masked language modelling with random word replacement all contribute to multilinguality. Perhaps most interestingly,
the model completely fails when the word order of English is inverted, which indicates a challenge for multilingual representation learning. Overall, while such a synthetic setting can only approximate the messiness of real-world multilingual data, any approach that fails here won’t be successful under realistic circumstances. Recent papers that employ a similar synthetic setting based on modifying English are (
Hartmann et al., 2018;
Ravfogel et al., 2019;
K et al., 2020;
Vulić et al., 2020; disclaimer: I’m a co-author on the last one).
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision 🖼 There have recently been a
flurry of pre-trained multimodal approaches. These models reported substantial gains on multimodal tasks such as image captioning or visual question answering. Grounded language agents have also been observed to encode spatial relations (
Ramalho et al., 2018) and to be capable of fast generalisation (
Hill et al., 2020).
What has remained elusive so far, however, are gains on standard text-only tasks in NLP with multimodal models. This paper proposes to ground a language model via token-level supervision of token-related images (visualised tokens, or
vokens). Specifically, the authors pre-train a BERT model to additionally classify the relevant image for each token (which is retrieved via a token-image retrieval model trained on image captioning data). They report gains on GLUE, SQuAD and SWAG. In a sense, this paper also demonstrates the
usefulness of multi-view learning over learning from unrelated datasets in multiple modalities. I hope this paper will lead to
more creative uses of visual data for text, extensions to other modalities, and to a multilingual setting.
With Little Power Comes Great Responsibility 💪 This paper studies an under-appreciated method in the NLP literature, the
use of statistical power to determine the probability whether a test will detect a true effect. The power depends on both the sample size (e.g. the number of examples in a test set) and the expected difference in performance. The authors find that
many standard tasks such as WNLI, MRPC, and SST-2 in GLUE are underpowered, i.e. their test sets are not large enough to conclusively detect whether a new method actually improves over an existing one. They also find that
the most common design for human rating studies (3 workers, 100 items) is underpowered. Overall, this study highlights that as our models become more powerful and improvements on tasks become slimmer, we
need to design tests with larger sample sizes in order to decisively detect advances.