One of the most important factors for making progress on an idea in ML is the speed of iteration, i.e. how long it takes to try a hypothesis or a set of hyper-parameters on a dataset and to obtain results. When we are still validating an idea and testing potential hypotheses, we would ideally like to work with a minimum viable dataset (MVD) for our setting, i.e. a dataset that is a) small so that models can be trained efficiently, b) diagnostic in that it can differentiate good from bad models, and c) representative of the capabilities that we’d like our models to learn in more realistic settings.
MNIST ✍️ MNIST is a classic minimum viable dataset in computer vision but is less used for validating recent approaches as most models achieve 99+% accuracy on it. In addition, inputs are 784-dimensional so require a non-trivial amount of computation. Popular more recent datasets primarily used for meta-learning are
mini-ImageNet, a down-sampled version of a subset of ImageNet classes and a number of other datasets used in
Meta-Dataset. Sam Greydanus also recently proposed an MNIST-1D dataset in
this blog post as a more efficient, minimum viable alternative to MNIST.
SQuAD and MultiNLI 🙋♀️ In this context, I think it is interesting to consider what minimum viable datasets exist for current NLP models.
What dataset do you turn to when you quickly want to validate an idea? In my impression,
SQuAD and
MultiNLI have taken on this role for pre-trained models to some extent. Good performance on them demonstrates that a model has learned certain things about natural language such as a broad understanding of semantics. However, both are far from efficient to train.
Beyond these, minimum viable datasets may often be task-specific. Some common datasets are not challenging or realistic enough to differentiate between classic and current methods: On
MLDoc, a cross-lingual document classification dataset, word embedding-based approaches and deep models achieve similar performance (
Artetxe et al., 2020) while n-gram and deep models perform similarly on public test sets for language identification (
Caswell et al., 2020).
Overall, I think it’s worthwhile to think more about how we can test for certain capabilities of natural language understanding in an efficient and minimum viable way. Subsampling existing datasets, leveraging analysis methods (see
this ACL 2020 tutorial), and evaluating sample efficiency (
Yogatama et al., 2019) are all viable options. Given how expensive pre-training is, being able to diagnose model performance early is particularly important. In other words,
what is the MNIST for pre-training?