EMNLP 2023, one of the biggest NLP conferences takes place this week from Dec 6–10 in Singapore.
In this newsletter, I’ll discuss a selection of exciting papers and workshops I’m looking forward to at the conference. Here are the main trends I observed (based on the small sample of papers I discuss here and those I came across online):
Instruction-tuned LMs and LLMs are everywhere. Similar to earlier years where BERT was ubiquitous, instruction-tuned language models (LMs) and large language models (LLMs) are used in almost every paper.
Evaluation based on LLMs is increasingly common. While some papers employ automatic evaluation based on GPT-4, new metrics that are proposed are based on LLMs in zero-shot prompted or fine-tuned settings.
Prompt usage is getting more creative. Beyond a standard prompt template, prompts are getting increasingly complex and specialized to the desired setting. Techniques such as chain-of-thought prompting are common tools.
Multilinguality is increasingly popular. I came across a substantial number of papers studying multilingual settings, which indicates that LLMs are still limited in non-English settings and that making LLMs more multilingual is an important direction.
On the other hand, I did not come across many papers that tried to analyze LLM properties (using a synthetic setup, for instance) or that used external models or tools to augment LLMs (please point me to papers that I missed). This is in contrast to NeurIPS 2023 where such papers were more common (see the below newsletter).
I’ll be attending the conference in-person so say “hi” if you’re there.
Papers with a † are presented in Findings of EMNLP (rather than the main conference). I am an author/co-organizer on papers/events indicated with a *.
Workshops
GenBench Workshop, Dec 6. Generalization is crucial to ensure robust model behavior but how good generalization looks like and how it should be evaluated is still not well understood. The GenBench workshop on (benchmarking) generalization in NLP aims to catalyse research on generalization and how to measure it in the NLP community. Accepted papers study generalization or are BIG-bench-style collaborative benchmarking tasks (CBT). The program consists of invited talks, CBT spotlights as well as oral presentations and posters.
Workshop for NLP Open-Source Software (NLP-OSS), Dec 6. In light of the increasing number of closed-source LLMs, it is important to continue to promote an open culture of sharing knowledge, data, and software, from which the NLP community has benefited greatly. This workshop aims to further the sharing of insights regarding the creation and development of NLP open-source software. Invited talks feature important NLP open-source projects including trlX, a framework for large-scale open-source RLHF and SEA-LION, LLMs pre-trained for Southeast Asian languages.
The Big Picture Workshop: Crafting a Research Narrative, Dec 7*. In research, we “stand on the shoulders of giants”. However, given the number and rapid pace of published papers, it has become increasingly difficult, to recognize the larger story to which a paper is connected. The Big Picture Workshop aims to explore and distill such broader research narratives. We have a diverse set of accepted papers that provide insightful syntheses of different threads of research. On the workshop day, we’ll try out a new presentation format where we have researchers from different groups working on the same topic critically reflect on and discuss their work.
Multilingual Representation Learning Workshop (MRL), Dec 7*. This workshop provides a forum to discuss work to improve NLP in low-resource and under-represented languages. The large number of accepted papers and Findings papers explore a diverse set of methods, from meta-learning to tokenization and instruction tuning. In addition, shared task on multilingual multi-task information retrieval provided new data for NER and QA for a typologically diverse set of languages. The workshop day is jam-packed with excellent invited talks, poster, shared task and best paper sessions.
Unanswerability and attribution in QA
In question answering (QA), a crucial challenge for current LLMs is hallucinating answers. A scenario where such hallucinations are common is when questions do not have an answer. To deal with hallucinations, a promising strategy is to train the model to attribute the answer to relevant references.
The Curious Case of Hallucinatory (Un)answerability: Finding Truths in the Hidden States of Over-Confident Large Language Models (Slobodkin et al.). This work shows that LLMs are aware of the concept of (un)answerability and that the representation of the first decoded token provides a strong indicator whether a question is answerable (removing this information in the first token significantly decreases performance). Furthermore, mentioning that the question is unanswerable in the prompt improves performance. These results highlight that developing better decoding methods will also help make LLMs more factual.
Evaluating and Modeling Attribution for Cross-Lingual Question Answering (Muller et al.)*. This paper introduces attribution for cross-lingual question answering where the document supporting the generated answer may be in a different language than the question and answer. It creates the XOR-AttriQA dataset to measure attribution of SOTA QA models across 5 languages. Surprisingly, a large portion of generated answers are not attributable to any retrieved passage (up to 47% of correctly predicted answers in Japanese are not attributable). Current QA systems are thus often right but without any evidence, making them untrustworthy. Multilingual LLMs can be used to accurately detect attribution (which can complement string-based evaluation metrics) and can be used to rerank generated answers, improving QA performance. Key research directions are a) improving retrieval of cross-lingual passages and b) designing robust LLM-based metrics for QA evaluation.
Instruction tuning
Instruction tuning is a common way to improve LLMs for downstream settings and to align them to human behavior. However, current instruction tuning datasets still have their limitations (see the below newsletters for an overview).
The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning (Kim et al.). This paper introduces the Chain-of-Thought (CoT) Collection, which augments the Flan collection covering 1,060 tasks with 1.84M chain-of-thought rationales. This amount of chain-of-thought instruction tuning data is particularly useful for smaller LMs and improves their performance on reasoning tasks including BIG-bench Hard and the multilingual MGSM benchmark.
Task Adaptation
While LLMs achieve very strong performance in a zero-shot setting, it is necessary to fine-tune them on task data to achieve the best performance. Keeping models updated as the distribution changes and encoding task knowledge efficiently across many settings are key challenges in this area.
Meta-Learning Online Adaptation of Language Models (Hu et al.). Keeping LLMs up-to-date is an important challenge as it is prohibitive to re-train these models. This paper hypothesizes that when continual fine-tuning a model on a stream of documents, the learning signal of important documents may be drowned out. To ameliorate this, the authors propose to meta-train a small model to reweigh the LM loss for each token during online fine-tuning in order to maximize the QA model’s performance after a single weighted update. They find that this dynamic weighting significantly outperforms standard fine-tuning and weighting heuristics.
Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning (Poth et al.)*. Full fine-tuning of LLMs has become prohibitive and requires parameter-efficient methods instead. This demo paper presents Adapters, a library for parameter-efficient and modular learning with LLMs and the successor to adapter-transformers
. Adapters integrates 10 diverse modular methods such as prompt tuning, prefix tuning, Compacter, LoRA, and (IA)³ into 20 state-of-the-art models for NLP, vision, and multi-modal applications. It supports a range of operations on these modules such as grouping, stacking, fusing, splitting, and parallelizing, among others, which enable a variety of modeling approaches and research directions.
Outlier Dimensions Encode Task-Specific Knowledge (Rudman et al.). This paper shows that outlier dimensions (dimensions with a variance that is significantly higher than the average) in LLMs persist during fine-tuning. They also find that just using the embedding value of such a high-variance dimension with a linear threshold can achieve performance similar to using the full model for some tasks and models. We already know that LLMs capture task knowledge in a low-dimensional subspace (see Aghajanyan et al., 2021, for instance)—but the observation that the subspace can be 1D for some settings can motivate the development of new efficient methods.
NLG Evaluation
As LLMs are increasingly applied to generate natural language text, we need better metrics to evaluate their performance. One of the most promising directions is to use LLMs themselves as part of the metric, whether in a zero-shot setting or fine-tuned on relevant data.
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment (Liu et al.). This paper proposes G-Eval, a framework for NLG evaluation using LLMs as reference-free metrics. Given a description of the task and the evaluation criteria, they first generate a more detailed CoT-style description of the evaluation steps using an LLM. All descriptions are then concatenated with the input example and fed to the LLM. Rather than directly predicting a score for each evaluation criterion, the authors observed that they obtain better measurements if they instead take the sum of all candidate scores weighted with their probability. On summarization, the framework achieves a higher correlation with human judgements than existing metrics.
INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback (Xu et al.). This paper proposes InstructScore, a fine-grained reference-based explainable metric for NLG evaluation using LLMs. To fine-tune the LLM as the metric, they first collect unlabeled sentences. They then specify the number of errors, error types, and their severity labels for each sentence and ask GPT-4 to generate an incorrect sentence containing the errors matching the criteria and an explanation for each error. LLaMa is then fine-tuned on the generated data to identify and explain the errors in the incorrect sentence compared to the reference. LLaMa is further refined using feedback from GPT-4 regarding the correctness of the generated explanations. In practice, InstructScore achieved similar or higher correlation with human judgements than existing metrics on translation and NLG tasks.
Multilingual Models
While current LLMs excel on many tasks for English, performance is still much worse on languages with limited data. We thus require models that perform well for such languages and new methods to effectively scale models to these languages.
FinGPT: Large Generative Models for a Small Language (Luukkonen et al.). This paper is a comprehensive study of training LLMs for a small language (Finnish) including the collection of a diverse dataset, monolingual training at different model sizes (up to 13B parameters), adaptation of an existing multilingual language model (BLOOM) to the new language, and creation of a language-specific benchmark. The trained models outperform all previous models for Finnish while the language-adapted multilingual model outperforms the monolingual models. Overall, this is a nice blueprint of how LLMs can be trained for medium-resource languages.
mmT5: Modular Multilingual Pre-Training Solves Source Language Hallucinations (Pfeiffer et al.)*†. This paper proposes mmT5, the first modular multilingual generative model. The mT5-style model is pre-trained with language-specific modules and dramatically outperforms mT5 and at similar parameter sizes while matching or outperforming XLM-R. Importantly, the model’s modularity enables more direct control over its outputs. While mT5 generates text in the correct language in only 7% of cases for zero-shot cross-lingual summarization, mmT5 generates text in the correct language in 99% (!) of cases.
XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models (Liang et al.). This paper proposes XLM-V, an XLM-R-style model covering 100 languages that is pre-trained with a 1M vocabulary. To create the vocabulary, vocabularies of languages are first clustered (Chung et al., 2020), clusters are allocated capacity corresponding to their average log probability (Zheng et al., 2021), and sentencepiece models are trained for each cluster and then combined. While pre-training with a 1M vocabulary is 2.5x slower than with a 250k vocabulary, the resulting model outperforms a (reimplemented) XLM-R.
Romanization-based Large-scale Adaptation of Multilingual Language Models (Purkayastha et al.)*†. This paper explores the potential of large-scale transliteration to enable multilingual LMs to deal with under-represented languages. In particular, the paper romanizes (i.e., maps UTF-8 characters to Latin characters) text using uroman across 14 diverse languages, which is then used to adapt multilingual LMs. Romanization is particularly useful in the most challenging setups: on languages with unseen scripts and with limited training data.
Multilingual Datasets and Evaluation
A key challenge for multilingual NLP is the lack of evaluation datasets and studies that accurately assess the performance of multilingual models. The creation of new datasets and the development of new evaluation measures and analyses is thus an important research direction.
Multilingual Large Language Models Are Not (Yet) Code-Switchers (Zhang et al.). This paper evaluates LLMs on three code-switching tasks: sentiment analysis (English-{Spanish, Malayalam, Tamil}), translation (English-Hindi), and word-level language identification (English-Hindi, Standard Arabic-Egyptian Arabic). They observe that smaller fine-tuned multilingual LMs (XLM-R, mDeBERTa) still outperform zero-shot prompted LLMs on these tasks.
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages (Ruder et al.)*†. XTREME-UP is a new benchmark focusing on user-centric tasks in under-represented languages with realistic amounts of available data. The benchmark includes impactful multi-modal tasks such as ASR and OCR, which we make accessible for text-only models by providing baseline system outputs (in addition to the original audio and image inputs). We created new data for a range of different tasks and updated standard tasks such as QA and NER to make them more practically relevant. We find that multilingual fine-tuned models still outperform few-shot prompted models on most tasks and that character-level modeling is beneficial. Overall, there is still a lot of headroom left to improve performance on under-represented languages.
The Skipped Beat: A Study of Sociopragmatic Understanding in LLMs for 64 Languages (Zhang et al.). This paper introduces SPARROW, a multilingual multi-task benchmark spanning 169 datasets from different online platforms to measure sociopragmatic understanding in LLMs (i.e., how well they perform on tasks related to social interactions such as sentiment analysis, emotion detection, etc). They observe that the fine-tuned models outperform zero-shot prompted models as well as ChatGPT. LLMs perform particularly poorly on humor and antisocial language detection and ChatGPT performs poorly across most languages in comparison to the best model.
AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages (Muhammad et al.)*. This paper introduces AfriSenti, a sentiment analysis benchmark consisting of 110k+ tweets in 14 African languages. The dataset was used in the AfriSenti SemEval-2023 Shared Task. Data collection and annotation challenges included a lack of support for African languages by the Twitter API, lack of tone markings, frequent code-mixing and dialects, sarcasm and ambiguities, and a lack of annotators and a reliable Internet connection. The strongest model, AfroXLM-R, achieves 67.2 accuracy across all languages, leaving ample room for improvement.
Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLU (Koto et al.). This paper introduces IndoMMLU, the first benchmark on Indonesian language and culture consisting of 15k questions from primary school to university entrance exams. Among the 24 evaluated models, GPT-3.5 is the only one that passes primary school exams while no LLM demonstrates familiarity with local Indonesian languages and culture. The language exams also enable assessing the level of Indonesian language proficiency. For grades 7 and above, GPT-3.5 fails to pass the exam while other models only pass grades 1–3.
TaTA: A Multilingual Table-to-Text Dataset for African Languages (Gehrmann et al.)*†. This paper proposes Table-to-Text in African languages (TaTA), the first large multilingual table-to-text dataset with a focus on African languages. TaTA was created by transcribing figures and associated text in bilingual reports by the DHS Program, which were then professionally translated to make the dataset fully parallel. We find that less than half of the outputs from an mT5-XXL-based model are understandable and attributable to the source data. We also observe that existing metrics perform poorly for multilingual table-to-text generation and introduce a new learned metric that achieves a high correlation with human judgements.
What papers did you find exciting at EMNLP 2023? Let me know in the comments.
I think you missed our paper that actually analyzes LLMs' understanding of concepts (I'm referring to "I did not come across many papers that tried to analyze LLM properties").
The name is "Towards Concept-Aware Large Language Models" (by Chen Shani, Jilles Vreeken and Dafna Shahaf).