There is no better data than better data.
Data is king in machine learning. The
unreasonable effectiveness of data has been heralded in the past. More recently, researchers realised
the bitter lesson that scale of data is what ultimately matters most. However, what often goes unsaid is the importance of
good data.
Garbage in, garbage out is a familiar concept in computer science that indicates that flawed input data leads to flawed outputs. In current ML and NLP, the equivalent concept may be
bias in, bias out. There is useful bias such as one that can be encoded via data augmentation. However, in general bias in the input data leads not only to biased model predictions but may be
amplified.
In NLP, we are painfully aware of this issue. There is a long line of recent papers that have analysed biases in recent models (
Blodgett et al., 2020;
Shah et al., 2020). The first step to analyse the biases in our models is to look at the input data. Some of my favourite papers take such a data-first approach. For instance,
Chen et al. (2016) thoroughly examine the CNN / Daily Mail datasets. Their findings: The datasets are easier than previously thought and can be bested with a simple feature-based model. Relative to the outsize importance of data, such detective work, however, is rare in practice.
As our datasets grow larger, sleuthing becomes even more arduous. Gone are the days where you can hope to inspect every training example. How do you even
make sense of 750 GB of text (the amount of data used for training T5)? Recent analyses of pre-training data increasingly rely on automatic classifiers e.g. of toxicity (
Gehman et al., 2020). The identification of biases where such classifiers are unavailable or do not perform well still requires human eyes. If such analysis is challenging in English, consider how challenging it is to analyse data in 100s of languages.
Caswell et al. (2021) recently embark on such an ambitious endeavour. By assembling an exceptionally diverse team of 46 volunteers speaking 41 languages, they perform a manual audit of 231 language-specific subsets of large corpora that have been used to train multilingual models, including the multilingual version of C4 (
Xue et al., 2020). Annotating 100 lines in each subset based on whether a sentence is an incorrect translation, in a wrong language, or non-linguistic content, they arrive at many startling observations:
- In the automatically aligned WikiMatrix (Schwenk et al., 2019) two-thirds of audited samples were misaligned on average.
- CCAligned (El-Kishky et al., 2020), OSCAR, and WikiMatrix suffer from severe quality issues.
- 12% of languages apparently covered by JW-300 (Agić & Vulić, 2019) are supposedly sign languages but instead just high-resource languages that are incorrectly labelled.
While some of these problematic samples can be avoided using filtering based on length-ratio, LangID, or TF-IDF wordlists (
Caswell et al., 2021), there is no easy fix. They recommend to document such issues instead and not to release datasets with low percentages of in-language content.
On the monolingual side, the C4 dataset used for training T5 (
Raffel et al., 2020) was recently
made a lot easier to download. In summary, I hope the above studies as well as the availability of this data will inspire new analyses that help broaden our understanding of what goes into our murky piles of linear algebra.