For this issue, we will take a look at different areas that optimize different parts of the training data. We will then talk briefly about how ML & NLP can be used in esports. We’ll have some Deep Learning-related reports from industry and paper highlights on VQA, NMT, and bias detection. Finally, we will talk about a cool new dataset of annotated song lyrics.
NLP in-depth: Data selection
It is common wisdom that the nature of the training data is at least as important as the choice of the model. Different areas deal with particular aspects of the training data. While many of these areas have been heuristics-based, a common thread of recent work is that data selection policies are increasingly learned:
Large amounts of unlabelled data are often available, but annotating all examples is prohibitive. Active learning can be used to interactively obtain annotations from human experts for helpful unlabelled examples during training. Recent work reframes active learning as RL (Fang et al., EMNLP 2017).
For online algorithms, often not just the choice of the training data but also the order in which it is presented to the model is relevant. Curriculum learning (Bengio et al., ICML 2009) orders the training data to maximize the model’s performance. One application where this makes a difference is learning word embeddings (Tsvetkov et al., ACL 2016).
A major highlight of the past week has been the defeat of pro gamers in the multiplayer online battle arena game Dota 2 1v1 by a bot created by OpenAI. Reddit coverage can be found here and here. Denny Britz provides some perspective on the hype here. While the milestone is impressive, many of the same techniques have already been applied to beat pros in other esports, e.g. SSBM (Firoiu et al., 2017). The main takeaway: Self-play, i.e. pitting your model against itself, is a powerful catalyst. It will be interesting to see what other applications we can find for self-play. Related approaches might be dual learning, e.g. for NMT (He et al., NIPS 2016). Other things we can do with esports? How about automatically predicting video highlights using audience chat reactions (Fu et al., EMNLP 2017).
Visual question answering (VQA) is one of these tasks that seem almost miraculous when models perform well at it. Past work on VQA, however, has shown that even simple models (Zhou et al., 2015; Jabri et al., ECCV 2016) perform surprisingly well and even outperform more complex models. Following this trend, as no single model is currently able to ground concepts and answer questions about images, the current state-of-the-art is a thoughtful collection of tips and tricks.
I am generally a fan of papers that explicitly encode linguistic insights into neural models. Learning the initial state has already been recommended by Hinton. Intuitively, the initial state should provide a good starting position for our model’s predictions. For NMT, this implies that the hidden state should already contain information that will allow the model to predict the words in the sentence. Weng et al. achieve this by training the hidden state to predict the words in the sentence. A simple method with convincing results.
Bias is a problem that will become more common as ML & NLP models in production become more ubiquitous and reach more people. Models inherit the bias of the data they were trained on. We already know that every data, even apparently objective news articles (Bolukbasi et al., 2016) contain significant bias. Zhao et al. propose to inject corpus-level constraints to reduce the bias in structured prediction models in this EMNLP 2017 best paper.
How often do you find yourself listening to a song and pondering the intent or meaning behind a particular set of words? Ponder no more! Automated lyric annotation (ALA) is the latest task that will soon be conquered by the machines (arguably later rather than sooner as BLEU scores are still quite low). Sterckx et al. introduce this task and provide a dataset and baselines to go along with it. Break it down!