Large pre-trained language models (LMs) are the de facto standard for achieving state-of-the-art performance on tasks in natural language processing. However, with great power comes great responsibility.
Bender and Gebru et al. (2021) provide an overview of issues with large LMs, which covers the following topics, among others.
Access 👩💻 Due to the reliance on huge amounts of compute, pre-training of the largest models has mostly been restricted to well-funded corporations, with a few exceptions such as
Grover. While checkpoints of models such as BERT and RoBERTa are widely available, the largest recent models (
Fedus et al., 2021) go well beyond the capacity of off-the-shelf GPUs. These models cannot easily be fine-tuned by practitioners or are
gated by an API. Given the prominent role that such models will likely play in the future of NLP, it is crucial that the community is involved in their design. Community-led initiatives such as
EleutherAI thus seek to replicate large-scale modelling efforts such as GPT-3 and to make them more widely available. Other collaborative projects such as
BIG-bench focus on making the benchmarking of such models more easily available.
Energy ⚡️Another downside is that the compute required to train such models incurs a large financial and environmental cost. It is thus key to focus on the development of more efficient methods that enable us to lower costs, such as more sample-efficient pre-training methods (
Clark et al., 2020; see
this post for an overview). In addition, we should benchmark our methods not only in terms of absolute performance but energy efficiency (
Henderson et al., 2020). While we have managed to make downstream training more sample-efficient via fine-tuning, pre-training is generally still done from scratch. I’d like to see more work that seeks to lower the cost of pre-training, for instance by warm-starting from the representations of previous iterations, by distilling from similar pre-trained models, etc.
Bias ⚖️There have been many studies focused on the biases that such models inherit from their pre-training data, e.g. (
Basta et al., 2019). Some recent discussions online (see e.g. this
short essay) focused on whether we should prescribe how a language model should behave, among other things. While large pre-trained language models (LMs) have been likened to many things, from
puppet characters 🐒 to
uncertain winged animals 🦜, it is up to us to ensure that they do not become yet another metaphorical bird—the
canaries in the coal mine of algorithmic bias. To ensure that such models have a positive impact on the largest number of people, the same care and deliberation that goes into their design must be taken when choosing the data used for training. In particular, we should revisit tacit assumptions such as the use of
lists of banned words.