NLP News - NLP for beginners, dialogue & sentence representations
There are many excellent newsletters out there related to ML (shout-outs in particular to Nathan Benaich's, Jack Clark's, and Denny Britz's excellent newsletters). Natural Language Processing (NLP) is seeing increasing interest recently, but there is no resource available that is dedicated to condensing NLP-related information -- besides the occasional Twitter conversation and your daily arXiv cs.CL digest (a quick Google search turned up that the most relevant newsletters pertain to the other NLP, ugh).
This is an experiment to gauge if there is demand for such a newsletter for NLP. Please let me know which parts you like and dislike and what you are missing.
Top NLP Resources for Beginners
It can feel daunting to try to get into NLP. Here is a list of some of the most helpful resources out there that will kick-start your learning:
Yoav Goldberg's Primer on Neural Network Models for Natural Language Processing, which provides an excellent survey of neural network methods for NLP.
Stanford CS224n: Deep Learning for Natural Language Processing, arguably the best online course to learn about state-of-the-art methods for natural language processing.
Speech and Language Processing, 2nd Edition, the authoritative book on NLP used in many college courses. A third edition is in progress.
NLP in-depth: Dialogue
Modeling dialogue is tricky. Dialogue agents are expected to strike the balance between being able to communicate on a diverse range of topics, providing information, and accomplishing tasks in a wide range of environments.
Deep Learning for Dialogue Systems - 2017 tutorial
Yun-Nung (Vivian) Chen et al. provide a great tutorial of the state-of-the-art in dialogue research and -- in particular -- highlight the difference between chit-chat dialogue systems and task-oriented dialogue agents.
The Problem(s) with Neural Chatbots
Ryan Lowe gives an excellent overview of the problems that plague state-of-the-art neural dialogue systems: 1. Data; 2. model architecture; 3. evaluation; 4. the premise itself that learning from static datasets will allow us to learn the function of language and to ground it in observations.
Conference countdown
News from recent or upcoming conferences.
Annual Meeting of the Association for Computational Linguistics - ACL 2017, July 30-August 4, Vancouver — acl2017.org
Details of all ACL papers are out. Information about the sessions can be found here. Two picks: theory behind "man" + "royal" = "king", bilingual representations with (almost) no parallel data.
EMNLP author notification has been sent out. A list of accepted papers is not yet available, but some have already made it to arXiv (see below).
First meeting of the Society for Computation Linguistics - SCiL, January 4-7 2018, Salt Lake City — blogs.umass.edu
Focus on computational and mathematical approaches in linguistics with an all-star invited speakers and organizers panel. Papers (8pp) and abstracts (2pp). Deadline is August 1.
Text as Data Conference, October 13-14, Princeton University
Abstracts are due on July 31.
Google Scholar Metrics revisited — medium.com
The 2017 edition of the Google Scholar Metrics rankings of NLP conferences has just been released. arXiv cs.CL tops the charts for the first time, followed by ACL and EMNLP. As Carlos shows, this changes if we normalize: Then CL and TACL take the top ranks, followed by ACL.
Industry insights
Baidu acquires natural language startup Kitt.ai — techcrunch.com
Baidu makes an entrance into the chatbot market by acquiring Seattle-based startup Kitt.ai, which provides chatbot and natural language understanding (NLU) services across devices.
The state of the deep learning landscape
Tensorflow and keras are leading the pack of deep learning toolkits, but NLP-focused DyNet sneaks into the charts. DyNet has been developed by CMU among others and is particularly useful for dynamic graphs.
Textio raises $20 million Series B — techcrunch.com
Textio helps companies improve the language of their postings in order to attract a more qualified and diverse set of candidates.
Lingokids raises $4 million seed, partners with Oxford University Press — techcrunch.com
Madrid-based edtech startup Lingokids offers language lessons as interactive games in English and simplified Chinese for children aged 2-6.
Google funds a project to automate writing local news — www.recode.net
Google is awarding the Press Association and Urbs Media $805k to build software to automate the writing of 30k local stories a month.
Paper picks
Some of the most intriguing recent research articles.
FB researchers show that we can use the Stanford Natural Language Inference (SNLI) dataset to learn very good sentence representations. Related: Wieting & Gimpel (ACL, 2017) learn sentence representations from a large paraphrase database; Jernite et al. (2017) introduce new unsupervised objectives for learning sentence representations. What other tasks are helpful for inducing sentence representations?
[1707.01066] Zero-Shot Transfer Learning for Event Extraction
Computer vision has seen increasing interest in zero-shot transfer learning recently. As our training sets are finite, generalizing to unseen events, relations, entities, etc. is key. Huang et al. frame event extraction as grounding rather than classification, which allows them to generalize to new events. Related: Levy et al. (CoNLL, 2017) who generalize to unseen relations by framing relation extracting as reading comprehension.
[1707.01176] CharManteau: Character Embedding Models For Portmanteau Creation
One of my highlights of last year's EMNLP were the many cool natural language generation applications. This year appears to be no different. Gangal et al. propose a noisy-channel character-level seq2seq model to generate portmanteaus, e.g. smog (smoke + fog) or Brexit (Britain + exit). Extra: New dataset of 1624 portmanteaus to play with.
[1706.06551] Grounded Language Learning in a Simulated 3D World
Researchers from DeepMind propose an agent that learns to perform natural language commands (think: "pick the red object/hat/zebra next to the green object") in a simulated environment. The key to learning are unsupervised auxiliary frame and language prediction objectives. Related: a gated-attention model from CMU; other related research from OpenAI, Lazaridou et al. (ICLR, 2017), and others starts from multi-agent dialog and shows that natural language may or may not naturally develop.
[1706.09733] Stronger Baselines for Trustable Results in Neural Machine Translation
Strong baselines are one of the most important prerequisites for conducting reliable research. Many new methods for NMT, however only compare against vanilla implementations. Denkowski & Neubig propose three baselines that are easy to implement and yield significant gains over regular baselines: 1. using Adam with multiple restarts and learning rate annealing; 2. sub-word translation via bye pair encoding; 3. decoding with ensembles of independently trained models.
Dataset spotlight
In this section, I will introduce one new/exciting dataset.
W-NUT dataset on Emerging and Rare Entity Recognition
Named Entity Recognition (NER) systems are very good at predicting frequent entities. However, in social media or newswire, new entities are very common. This dataset of the shared task of the 3rd Workshop on Noisy User-generated Text (W-NUT) at EMNLP 2017 focuses on exactly this challenging scenario of predicting emerging and rare entities.
Twitter highlight
Finally, I will highlight one informative Twitter conversation (powered by Treeverse) or inspiring tweet.
Doing an NLP PhD at a US vs. EU institution
A lively Twitter discussion about the benefits and trade-offs of pursuing a PhD at a top US vs. top European institution. Related: question on Quora.