NLP News - Flies smell word vectors, MarianNMT, Distributed Learning with keras, NLP workshops, NLP for net neutrality, Fairness measures, Opinions on Reproducibility, How to conduct ML research, FigureQA
This edition of the NLP Newsletter answers the following burning questions: Can flies smell word vect
NLP News - Flies smell word vectors, MarianNMT, Distributed Learning with keras, NLP workshops, NLP for net neutrality, Fairness measures, Opinions on Reproducibility, How to conduct ML research, FigureQA
This edition of the NLP Newsletter answers the following burning questions: Can flies smell word vectors? ✅ What is a fast NMT library? ✅ How can I train on multiple GPUs with keras? ✅ Which NLP workshops should I attend next year? ✅ How many anti-net neutrality comments have been faked? ✅ How can I measure fairness? ✅ What does reproducibility in ML mean? ✅ How should I conduct ML research? ✅ What is a novel way to deal with large output spaces with the softmax? ✅ What is a cool new dataset for visual reasoning that I can use? ✅
The authors of this Science paper show that a fly categorizes smells by k-hot encoding odour in (normalized) 50-dimensional smell space, expanding (!) this into a 2k-dimensional latent space, and then finding nearest neighbours. They also simulate a fly smelling word vectors and MNIST pixels. Credit: Tom White (@dribnet)
Fast Neural Machine Translation in C++ with minimal dependencies by the creators of Moses and with support for multi-gpu training and translation, different model types, and compatible with Nematus.
A standardized set of metrics developed by SwitfKey (acquired by Microsoft) for evaluating predictive language models that can be used with any dataset.
The schedule for the inaugural meeting of the Society for Computation in Linguistics (SCIL), a conference that seeks to emphasize the Computational Linguistics part of Natural Language Processing. Have a look at the abstracts to get a sense for some more linguistics-focused ideas.
This article explores the forces that influence the complexity of the language we speak and write and tries to answer why English often has complex sentences, while many other languages have little use for them.
The 2018 conference season is getting started; while not all conference websites for 2018 are available yet, some workshops have already been announced. If you’re interested at which conference your favourite workshop will take place or want to check out a new one, I’ve linked those that I found below:
Bots and spam campaigns are used to influence online discourse. NLP can provide us with the means to identify them. In one of the most insightful and remarkable case studies, Jeff Kao uses NLP techniques to analyze net neutrality comments submitted to the FCC from April-October 2017. He finds that more than 1M pro-repeal net neutrality comments were likely faked and that it’s highly likely that more than 99% of the truly unique comments were in favor of keeping net neutrality.
When Twitter does not rid its platform of bots, others have to do the work. This article describes how two Berkeley students created an app that identifies bots on Twitter.
For identifying bots, often only basic analysis is necessary. This article details how BuzzFeed News identified 45 suspect accounts through analysis of their interactions and retweets with other suspect accounts in Brexit-related tweets. It shows that we need better ways to automatically identify such accounts.
MSR’s Eric Horvitz reflects on gaining a meaningful understanding of automated decision-making at the Berkeley Center for Law & Technology Privacy Law Forum.
A website that provides fairness benchmarking tools for ML. It has pointers to a series of datasets
corresponding to various fields and applications (e.g., finance, law, and human resources) and code implementing a series of measures introduced in the literature to analyze and quantify discrimination.
Researchers at the University of Washington develop methods to measure the sometimes subtle biases in how men and women are portrayed on the big screen – to increase our understanding of how language shapes our perception of gender roles. A demo is available here.
Hugo Larochelle’s talk at the ICML 2017 Reproducibility Workshop that provides thoughts on what reproducibility in ML should mean and what we should aim for as a community.
An opinion piece by Zachary Lipton that will be presented at the Interpretable ML Symposium at NIPS 2017 and calls for giving real-world problems and their respective stakeholders greater consideration.
Stephen Merity introduces one of the most recent innovations in language modeling and describes what major flaw in the traditional softmax it addresses, both theoretically and experimentally.
The finalists for the 2017 Alexa Price are from the Czech Technical University in Prague, the University of Washington, and the Heriot-Watt University in Edinburgh. The teams outperformed their competitors in real-life conversations with Alexa users from July 1 to August 15.
Radim Rehurek talks about the “Mummy Effect”, a source of frustration that should feel familiar to every ML/NLP practitioner: a research paper looks amazing, outperforms the state-of-the-art, and might even have a big lab behind it. You implement it and discover all the corners the authors cut and all the carefully omitted assumptions. The whole elaborate thing crumbles to dust upon touch – like an Egyptian mummy.
Rothe et al. propose an interesting, cognitive science-inspired take on question asking. Their approach treats questions as formal programs that output an answer that depends on the state of the world. In particular, they find that both producing informative as well as complex questions is important in learning to ask human-like questions
Chen proposes a novel RNN variant that instead of using complicated gated interactions as the classic LSTM, employs a small network to embed the input into a latent space. It then only requires one gate to achieve comparable performance to other RNN variants with improved interpretability and trainability.
Fu et al. propose two methods for style transfer in text. The models are inspired by their counterparts in computer vision and try to learn separate representations for content and style using (you might have guessed) adversarial networks.
Maluuba introduces FigureQA , a new dataset composed of figure images – like bar graphs, line plots, and pie charts – and question and answer pairs about them.
The Self-dialogue Corpus, a collection of self-dialogues across music, movies and sports containing 24,165 conversations across 23 topics collected by the Edina dialogue team as part of the Alexa Prize competition.