More papers on AI are published than ever before1 but each paper tends to only present its part of the picture—and it becomes difficult to recognize the larger story to which a paper is connected. To encourage the community to explore broader research narratives, we2 co-organized the Big Picture Workshop at EMNLP 2023.3 We received a number of high-quality submissions that distill important research topics, from narrative understanding to modern generation techniques.
My favorite part of the workshop, however, were the invited talks. We had asked researchers from different labs working on the same topic to reflect on and consolidate their often disagreeing contributions. The result were talks that were more nuanced and engaging than the—typically one-sided—scientific presentations. The talks covered topics from in-context learning to attention as explanation and morality.4
I recently joined Cohere to help solve real-world problems with LLMs. We are hiring!
What Does In-Context Learning Need?
In-context learning (ICL) is one of the most important emerging phenomena of LLMs but it is still not clearly understood what factors contribute to its success. At EMNLP 2022, two papers with seemingly contradictory hypotheses were published: In “Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?”, Sewon Min and others argued that random labels perform similarly to using ground-truth labels in the demonstrations. On the other hand, Kang Min Yoo and Junyeob Kim found that ground-truth labels are important in “Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations”. So, do we need ground-truth labels for ICL or not?
In their talk (slides), Sewon and Junyeob shed light on this conundrum. ICL with random labels is more sensitive than using ground-truth labels—but still works with careful prompting. However, small deteriorations in performance can be observed, which are not negligible in real-world applications. Nevertheless, although their impact may vary across setups, the correctness of labels is still one of the core components of successful ICL.
They argue that the main reason why ICL without ground-truth labels works is because ICL activates priors from pre-training rather than learning new tasks on-the-fly. What I found very insightful is that they even put their findings in the context of a more recent work, “Larger language models do in-context learning differently”, which hypothesizes that overriding such semantic priors is an emerging ability in larger models. Upon closer inspection of that paper’s results, however, they found that no model with flipped labels performs better than random. So with standard ICL, even large models are unable to override their pre-training priors.
Is "Attention = Explanation"?
In 2019, there was a series of papers memorably titled “Attention is not Explanation” and “Attention is not not Explanation” that studied whether attention is useful as faithful explanation of model predictions.5 In their joint talk (slides), Sarthak Jain and Sarah Wiegreffe, the first authors of the papers, reconciled the findings of their papers and contextualized them with regard to recent developments in the field.
So, is “attention = explanation”? Putting things into perspective, both authors highlight that attention mechanisms in LSTM networks can serve as faithful explanation under certain conditions; there is no one-size-fits-all answer. However, faithfulness evaluation is difficult due to a lack of a ground truth.
But how useful is attention to explain model predictions today? They highlighted that attention is no longer very useful for instance-level explanations but that it still matters for understanding the mechanisms underlying general-purpose Transformers beyond specific models, datasets, and tasks. Understanding attention is thus still important in this context.
Can Machines Learn Morality?
In 2021, Liwei Jiang led researchers from AI in training an ML model, Delphi, to reason about ethical judgements. To train the model, they created a new dataset, Commonsense Norm Bank, containing 1.7M examples of descriptive judgements on everyday situations. This research program was critiqued by Zeerak Talat and others (“On the Machine Learning of Ethical Judgments from Natural Language”). In 2023, Liwei and others created Value Kaleidoscope, a new dataset to model potentially conflicting human values involved in decision-making.
In their joint talk (slides), both discussed their research agendas and the challenges of teaching morality and ethics to AI models. They also engaged in a higher-level meta discussion on disagreements in science. They observed that science and conflict go hand in hand and found honesty and good-faith behavior key to resolve such situations.
The Vision Thing
Finally, Raymond J. Mooney gave an excellent invited talk (slides) where he discussed the importance of finding and pursuing your research passion. He reviewed his changing research vision over the last 40+ years that started with explanation-based learning, took him to bridging ML and NLP, ML for semantic parsing, and more recently grounded NLP, language and code, and language and 3D animation. It’s rare to get such a personal and inspired account on the motivations behind changes in a research vision from a luminary of the field.
For any aspiring researcher, this talk is a treasure chest full of useful and practical advice. I would highly recommend watching the recording (start: 0:07:00).
Final Takeaways
In summary, this was one of my favorite workshops that I’ve attended or organized. It’s a breath of fresh air when talks are more than just an oral recapitulation of a paper. At its best, research is a collaborative and sometimes argumentative conversation. Scientific publications are a culmination of this process but for various reasons6 do not provide the full picture.
This workshop started as an experiment. We wanted to see whether we can turn scientific talks into something more akin to a debate that sheds light on a topic from different perspectives. Overall, the experiment was a success. The speakers did a stellar job presenting and contextualizing their research. The audience was engaged and speakers fielded a flood of questions for each talk while we received a lot of positive feedback on the overall format. The speakers deserve special thanks, however! They put in extra effort by having multiple meetings to prepare and sync with their counterpart. Some of them had never spoken before so the workshop also served as an opportunity to form new connections.
The Big Picture Workshop won’t return this year at an NLP conference but we hope to bring it back in 2025. Feel free to reach out to Yanai Elazar with ideas and feedback. If you enjoyed the format, you are welcome to organize a similar workshop at another venue. All materials (including the proposal, task tracker, email templates, etc) are available online. While it does take more effort to present a topic from different perspectives, we hope future workshops and presenters will consider taking a more debate-style approach to their talks.
Thank you to the sponsors of the workshop, Amazon, Google, and HuggingFace! Thanks to Yanai Elazar for feedback on this post.
See the arXiv monthly submission stats.
Yanai Elazar, Allyson Ettinger, Nora Kassner, Noah A. Smith, and I.
The workshop was inspired by other workshops that seek to add more nuance to the research conversation such as ML Retrospectives and the Workshop on Insights from Negative Results.
The recording is available here. Approximate timestamps for the talks: Raymond J. Mooney (0:07:00); Sarah & Sarthak (1:53:00); Liwei & Zeerak (4:25:00); Sewon & Junyeob (6:55:00).
These were accompanied by Medium posts by Yuval Pinter and Byron Wallace and more recently revisited in “Is Attention Explanation? An Introduction to the Debate”.
Due to space or time limitations, to create a convincing narrative, bias or lack of awareness of the authors, etc.
What an epic piece!
What a brilliant idea. I would love to see many more workshops in this format in other ML conferences!