ICLR 2021 Outstanding Papers, Char Wars, Speech-first NLP, Virtual conference ideas
I hope you've had the chance to take some time off and relax recently. Here in the UK, things are starting to look up: the weather is getting warmer and the lockdown is easing slowly. I'm looking forward to meeting friends in person again and getting the chance to travel.
Reading some of the outstanding papers at ICLR 2021—which cover topics from query answering to physics simulations—I can't help but be amazed by the overall progress in the field. I'm excited where this will take us. We don't need to look to a galaxy far, far away in order to witness an important dispute, however: The Char Wars, the fight between the hegemonic Subword Order and rebelling character models.
Another topic that I'm excited about is speech. This newsletter looks at the role and importance of speech for NLP. Finally, with conferences this year generally staying virtual, I thought it'd be fun to think about how we can make them more social and interactive. Gather is nice but still doesn't come close to interacting with someone in person. So this newsletter includes some ideas for virtual social activities to engage with others around topics of common interest (focusing on ML and NLP).
I really appreciate your feedback, so let me know what you love ❤️ and hate 💔 about this edition. Simply hit reply on the issue.
If you were referred by a friend, click here to subscribe. If you enjoyed this issue, give it a tweet 🐦.
ICLR 2021 Outstanding Papers🎖
Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with 1/n Parameters: A new way to use hypercomplex multiplication to parameterize models, which can lead to large parameter savings at similar performance. The authors use it to make LSTM and Transformer models more efficient and apply them to NLP tasks such as NLI, MT, and text style transfer.
Complex Query Answering with Neural Link Predictors: A framework to answer complex queries (involving logical operators such as conjunctions, disjunctions, etc) on knowledge graphs. Erik, the first author is a Master's student. Way to go!
EigenGame: PCA as a Nash Equilibrium: An interpretation of principal component analysis as a competitive game, which is highly parallelizable. This may have impact on real-world applications where speeding up PCA is beneficial.
Learning Mesh-Based Simulation with Graph Networks: A framework for learning mesh-based simulations of various physical systems (see below) using graph networks. The simulations of even complex systems look great (see the videos).
Neural Synthesis of Binaural Speech from Mono Audio: A method to generate two-channel binaural audio (which is typically recorded with two mics and creates the illusion that the listener is in the same room with the speaker) from single-channel audio. This is important for virtual reality environments so it is no coincidence that the paper is written by researchers at Facebook Reality Labs.
Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime: An analysis of how fast averaged SGD with over-parameterized two-layer neural networks converges. This follows recent studies on the NTK and further develops our understanding of how neural networks and kernel methods relate to each other.
Rethinking Architecture Selection in Differentiable NAS: A new way to perform neural architecture search (NAS)—specifically, how to select the final architecture. NAS is a mainstay of ICLR and has become a lot more efficient since it was first proposed. This method improves the performance of state-of-the-art methods such as DARTS (Liu et al., ICLR 2019).
Score-Based Generative Modeling through Stochastic Differential Equations: A score-based generative framework that is based on an SDE that maps the data distribution to a noise distribution. By reversing the SDE, data can be generated instead. The method achieves high-resolution high-fidelity generations and is the first score-based method that is competitive with standard GAN-based approaches.
Char Wars 🛸
The Subword Menace (inspired by a tweet by Sasha Rush)
Not long ago on a preprint server near you...
It is a period of byte-based battles.
Pure character-level models, proposed
in a recent paper, have won
their first victory against
the biased Subword Order.
Subword tokenization. Besides the Transformer architecture, the other hegemonic feature of state-of-the-art models in NLP is subword tokenization. Subword tokenization—like any type of tokenization really—makes assumptions that are more suitable for some type of data than others. Specifically, it relies on splitting strings by frequency. While this works well on standard English text, models using subword tokenization struggle with noise, both natural (typos, spelling variations in social media, etc.; Sun et al., 2020) and synthetic (adversarial examples; Pruthi et al., 2019).
Non-concatenative morphology. Subword tokenization is also notoriously bad at modelling words that don't consist of morphemes strung together sequentially—which is known as non-concatenative morphology. It can be seen occasionally in English with the plural of irregular nouns, such as foot → feet, but is much more common in other languages such as Hebrew and Arabic.
Improving subword tokenization. One way to deal with these challenges is to make subword tokenization more robust. Subword regularization (Kudo et al., 2018) achieves this by sampling different segmentations for the input—it can be seen as dropout over segmentations. In a recent NAACL 2021 paper, we combine this with a consistency regularization objective (inspired by ideas from semi-supervised learning) to make pre-trained multilingual models more robust. The nice thing is that this multi-view subword regularization can be applied only during fine-tuning and improves performance consistently when transferring to other languages.
Character-based models. Pure character-based models have generally underperformed their word-level counterparts. Instead, models typically obtain a character-aware representation (Kim et al., 2016) using a CNN over the characters of a word, which has also been used in ELMo (Peters et al., 2018). While this method has been applied to BERT (Boukkouri et al., 2020), it is generally less efficient and outperformed by subword tokenization-based Transformers. Recently, character-aware and subword-based information has also been combined, improving robustness to spelling errors (Ma et al., 2020).
CANINE 🐶. CANINE (Clark et al., 2021) is a recent Transformer model that follows in the tradition of pure character-based models by being tokenization-free—it directly consumes a sequence of characters as the input. It is more efficient compared to other character-level models by means of a clever combination of down and up-sampling (see above): A Transformer with local self-attention produces contexualized character embeddings, which are then down-sampled via strided convolutions; a standard deep Transformer (as in BERT) is then applied to this sequence; finally, the representations of the two Transformers are concatenated and up-sampled. For pre-training, character spans (chosen based on whitespace boundaries) are randomly masked and predicted. The model outperforms mBERT on the multilingual open-domain question answering dataset TyDi QA.
CANINE is a step beyond subword tokenization and towards models that are more flexible and better suited to handle variations in the input data. Such models hold promise not only for other languages but may also enable models to generalize better to new words and language change (see the last newsletter). However, subword segmentation may still stay the standard due to its simplicity and ease of use; so it remains to be seen who will ultimately win this war...
Pursued by the Subword's
sinister sentence pieces,
research on new segmentation
methods races forward, in search
of the custodian of the
inductive biases that can free
our models from the burdens
of their tokenization and
restore the rightful segmentation
to the world's languages...
Speech-first NLP 🎤
NLP + Speech. Does natural language processing (NLP) include speech? By definition, natural language can take various forms. In practice, most current work in the NLP community (here: work published at *ACL conferences) focuses on processing written language rather than the sound waves of a person's speech. There are some notable exceptions such as work on spoken language understanding (focusing on processing speech transcripts) and the use of phoneme representations for cross-lingual transfer (Tsvetkov et al., 2016; Peters et al., 2017). More recently, a paper focused on vocalizing silent speech won the Best Paper award at EMNLP 2020 while models such as the Transformer that are popular in NLP are also used to process speech (Baevski et al., 2020). However, by and large, most work that focuses on speech is published in distinct conferences such as ICASSP, Interspeech, and others.
Voice as an interface. At the same time, the importance of voice as an interface is growing. As more people are going online in countries where using a keyboard is not the default way to interact with a device, voice will be the preferred mode of interaction. For ML, this means voice data will become more important and easier to obtain in real-world applications.
Unwritten languages. Furthermore, many languages do not have a strong written tradition: they may not have an orthography or not use one consistently. For such languages, systems that cut out the middleman and directly process speech without mapping it to text may be the only way to serve the needs of the speakers (Scharenborg et al., 2019). A recent line of such work focuses on end-to-end speech translation (Jia et al., 2019).
Speech-based SLU systems. As AI is applied to more challenging and more real-world settings, it will become more common to combine speech and spoken language understanding. For instance, similar to how neural models have initially been applied to textual features and then to raw text, spoken language understanding systems will become more powerful once they receive raw audio data as additional context, which will allow them to model prosody, among other things. Prosody provides information regarding the emotional state of the speaker, their use of irony or sarcasm, etc. and is thus important for affective computing systems. For a deep dive on how prosody can be incorporated into spoken language understanding models, check out Trang Tran's PhD thesis from 2020.
Multi-modal representation learning. Bridging different modalities will also become more important for learning models that can be applied to the real world, which is inherently multi-modal. For example, for analyzing videos, incorporating information from video, audio, and speech data is important (Alayrac et al., 2020).
Speech-to-text in 60 languages. An exciting recent development in this line of work is the integration of the Wav2Vec 2.0 model (Baevski et al., 2020) into the 🤗 Transformers library. In addition, HuggingFace organized a community project to fine-tune the cross-lingual speech-to-text XLSR model (Conneau et al., 2020) in 60 languages on Mozilla's open-source Common Voice dataset. Given the ubiquity of 🤗 Transformers, these models have the potential to be the back-bone of work bridging text-based NLP and speech in the future.
Beyond text and speech. While this section mainly focused on bridging written and spoken language, other modalities are equally relevant. For instance, language can be expressed visually as sign language (have a look at this amazing overview by Amit Moryossef) or via extralinguistic information. Language can also be tactile and researchers have even used olfactory signals to ground semantics (Kiela et al., 2015).
Virtual conference ideas 🎭
The conference season is starting again. After last year, this will be the second cycle of purely virtual conferences. What I miss most is the social aspect of conferences: the serendipitous hallway encounters, the thrill of meeting someone whose work you are intimately familiar with for the first time, and the unplanned things that happen between the planned conference activities. In this spirit, here are some research-themed social activities that could be fun for the upcoming conferences.
PowerPoint karaoke blends the improvisational nature of karaoke and the clarity and energy of great presentations; the output is something that is—when executed well—equal parts hilarious, entertaining, and thought-provoking. In practice, a presenter typically presents a set of slides that they've never seen before, often paired with a random topic.
I did this once at a non-ML related summer school and it was a lot of fun. It works best if slides are visual (so can be interpreted in different ways), nonsensical when put together, and topics are creative. Here are some examples slides and notes for inspiration. The main challenge, as you can imagine, is to get slides that are related to ML/NLP and are the right combination of creative and nonsensical. Luckily, producing text that is both creative and nonsensical is something large language models excel at. Here are potential steps to create an initial version of an ML/NLP-themed PowerPoint karaoke slide deck:
Use a large language model of your choice such as GPT-3 to automatically generate creative presentation titles.
Automatically download papers' slides from the ACL anthology (if possible) or otherwise obtain a large number of slides from ML presentations.
Train a model to automatically identify the most interesting or visual slides. As this is a challenging problem in its own right (e.g. see Talebi & Milanfar, 2018), alternatively download relevant images online (memes are welcome!) and pair them with generated text conditioned on the paper title (with different images for the motivation, experiments, and conclusion).
Open-source slides and paper titles.
Besides the fun aspect of listening to others improvise on a random presentation, it's a good way to practice public speaking and giving compelling presentations with little preparation—a skill that is crucial to master for any time-strapped researcher.
ML and NLP-themed games
These are classic games with an ML or NLP twist. Microsoft Research has a good track record of creating these: One of my favourite swag items is an ML-themed Top Trumps game from NeurIPS 2016 (see above); they also created an ML-themed Cards Against Humanity for NeurIPS 2019 (see below). There are versions of other popular games such as Pictionary, City Country River, Scattergories, etc available online that could be modified. I'd love to see open-source versions of some of these games that can be easily played during a lull or break at a virtual conference.
Virtual whiteboarding sessions
On a less playful but equally interactive note, brainstorming together on a whiteboard is one of the most stimulating and productive processes for ideation. While whiteboards are hard to come by at physical conferences that's not the case for virtual ones. At a recent ELLIS NLP workshop, we used Jamboard for brainstorming during a break-out session in a group of around 10 people. The activity highlighted useful connections between topic areas and interesting directions (you can see part of the output of the session below) and might lead to collaborations down the line. Overall, it was a fun experience so I'd also hope to see whiteboarding sessions—whether scheduled or impromptu—at this year's virtual conferences.