NLP News - GAN Playground, 2 Big ML Challenges, Pytorch NLP models, Linguistics in *ACL, mixup, Feature Visualization, Fidelity-weighted Learning
The 10th edition of the NLP Newsletter contains the following highlights: Training your GAN in the browser? ✅ Solutions for the two major challenges in Machine Learning? ✅ Pytorch implementations of various NLP models? ✅ Blog posts on the role of linguistics in *ACL? ✅ Pros and cons of mixup, a recent data augmentation method? ✅ An overview of how to visualize features in neural networks? ✅ Fidelity-weighted learning, a new semi-supervised learning technique? ✅
Fun and games
Speech recognition has tremendously improved over recent years. Similarly, we expect virtual assistants like Google Home or Alexa to be able to fulfil simple requests. This reddit thread demonstrates that even the simple query of "Tell me the temperature inside" can still be hilariously misinterpreted by the current generation of assistants.
Generative Adversarial Networks (GANs) are all the rage these days, but understanding and training them remains challenging. This GAN Playground lets you play around with Generative Adversarial Networks right in your browser. It allows you to build up your discriminator and generator layer-by-layer and observe the network in real-time as the generator produces more and more realistic images or gets stuck in a failure mode.
Are you looking for the ultimate challenge in applying NLP to video games? Why not choose then not just any game genre, but the mother of RPGs, text-based adventures? Filip Hracek shows us in this blog post what makes a text-based adventure compelling and how Natural Language Generation may be used to create a modern text-based adventure game.
Chatbots can be annoying, unintelligible, and can take up a lot of your time. New Zealand cybersecurity company Netsafe decided to use these normally unwelcome aspects to their advantage and created a chatbot that is designed to help turn the tables on email scammers.
Presentations and slides
Leon Bottou describes two big challenges in ML these days and how we can resolve them: 1) The disruption of established software engineering practices by ML; and 2) the reliance on a single experimental paradigm, i.e. fitting a training set and evaluating on a test set.
Resources, implementations, and tools
Bottery is a conversational agent prototyping platform by Kate Compton. It contains a syntax, editor, and simulator for prototyping generative contextual conversations modeled as finite state machines. It is for everyone -- from designers to writers to coders -- who want to write simple and engaging contextual conversational agents, and to test them out in a realistic interactive simulation.
There are so many submissions at ICLR 2018 this year that it's hard to find papers that are relevant to your area. Stephen Merity created this useful tool that allows you to search all submissions and retrieve only those that interest you.
Stanford University's CS224n course is one of the best resources to learn about Deep Learning for NLP. This repo by Kim SungDong contains Pytorch implementations of many of the models discussed in the lectures, such as a recursive NN or a dynamic memory network.
The ImageNet model created in the most recent Neural Architecture Search paper is now open-source. While running architecture search without Google-scale resources is largely infeasible, novel architectures such as NASNet and the NASCell for RNNs can still be used by everyone.
Egal is a plugin that allows to easily draw SVG figures in Jupyter notebooks to help illustrate your code and your ideas. It is particularly useful for presentations and supports animations and free-style drawing.
Posts and articles with a linguistic focus
Emily Bender summarizes recent Twitter megathreads discussing the role of linguistics in NLP.
Ryan Cotterell discusses whether NLP is interdisciplinary and argues that work in Computational Linguistics is not really present at *ACL conferences and that NLP fails without linguistic theory.
Vered Shwartz gives an overview of different kinds of ambiguities and outlines why they are problematic for NLP systems.
This article argues that linguistic change is shaped by more than natural selection and that many changes can also be attributed to randomness, which is commonly known as drift.
Articles and posts about novel methods
Ferenc Huszár discusses mixup, a new data augmentation technique that is as baffling as it is seemingly effective: it randomly interpolates examples and their labels to form new examples.
This Wired article takes a look at capsule networks, the newest brainchild of Hinton and collaborators.
Ankit Gupta compares two recent submissions to ICLR 2018 that both aim to perform unsupervised Neural Machine Translation.
This articles gives an overview of a new paper by Salesforce Research on non-autoregressive Neural Machine Translation, which allows NMT to be parallelized more easily.
The New York Times explore the words used in thousands of 'modern love' essay submissions in this interactive article.
Understanding and interpreting our models is important. Visualizing what activates the features is one way to do so. Chris Olah and collaborators give a great overview of different optimization objectives, diversity, and regularization for feature visualization.
More articles and blog posts
New AI Grant Fellows — blog.aigrant.org AI Grant gives grants to people around the world working on exciting AI research projects. The new batch of 20 grant winners includes Radim Rehurek, creator of the hugely popular topic modeling library gensim who will add support for many of the latest-and-greatest research papers.
Baselines are super important to know if our models actually perform well, but often are neglected. Stephen Merity advises to adopt a baseline and give it the care that it deserves.
This article gives another example of the various ways bias can impact our models, which is a problem even for big companies.
Ani Nenkova and her students look at the characteristics that make up great scientific writing. One of the biggest surprises: Great writing is lighter on detail than lesser known work and is able to convey complex information without going into details of a particular scientific finding.
A Wired article about Replika, an app that creates an artificially intelligent doppelgänger offering a glimpse into the future of human-bot interaction.
Jacob Andreas gives two different perspectives on synthetic datasets that have been somewhat controversial in the NLP community (e.g. babI for QA) and argues why they can still be useful to both the AI and the NLP community.
Daniel Gross recommends seven questions that you should ask when you consider joining an AI/ML company as a data scientist or engineer.
The State of ML and Data Science 2017 kaggle survey paints a comprehensive picture of the data science landscape. The average survey respondent is 30 years old, has a Master’s degree, a job as a Data Scientist, and makes about $55,000 per year.
Berlin-based Ada Health raises $47M and is one of the world's fastest-growing medical apps in 2017. In a chat interface, it helps people decipher their ailments, but then also connects them with real doctors.
Dublin-based Deep Learning and NLP startup AYLIEN raises €2M new investment. AYLIEN provides a suite of NLP services as well as a platform for text analysis and develops new solutions with a focus on the analysis of news.
Disclaimer: I'm a research scientist at Aylien. We're also hiring!
Deghani et al. propose a new approach for semi-supervised learning with weak supervision. Their framework trains a student network on weakly annotated data. They then train a Gaussian Process-based teacher on gold data using the student's representations and use the teacher to estimate the confidence of the weakly supervised examples. Finally, the student is fine-tuned by incorporating on the weakly supervised data by incorporating the confidence estimates.
This is a great example of a creative use of NLP: Frermann et al. set out to solve crime investigations using NLP techniques. They frame Whodunnit, i.e. the task of identifying the perpetrator of a crime as an inference task. They create a new dataset based on CSI: Crime Scene Investigation episodes and train an LSTM on multi-modal data (textual, visual, and acoustic) to predict who committed the crime.
Are you interested in conversational agents? This paper gives an overview of recent advances in dialogue systems from various perspectives and discusses task-oriented and non-task-oriented models. It also outlines some interesting research directions for dialogue systems research.
DeepMind opensources a dataset for algebra question answering annotated with rationales. The dataset consists of about 100,000 algebraic word problems with natural language rationales.