Multi-task learning (MTL), training a model on several tasks at once and sharing information is a general method that is fundamental to training neural networks.
Rich Caruanaās 1997 paper is one of the best introductions to this topic and as relevant today as it was back then. For more recent overviews, you can check out
my survey from 2017 or a
survey from 2020 that I enjoyed.
Research in multi-task learning has long shown that models trained on many tasks learn representations that generalize better to new ones. A common problem in multi-task learning, however, is minimizing negative transfer, i.e. how to make sure that tasks that are dissimilar do not hurt each other.
In recent years despite much work on alternative training objectives, the NLP community has gravitated to a single pre-training objective
to rule them all,
masked language modelling (MLM). Much recent work has focused on ways to adapt and improve it (e.g.,
Levine et al., 2021). Even the next-sentence-prediction objective used in BERT has become slowly phased out (
Aroca-Ouellette & Rudzicz, 2020).
Recently, there has been a flurry of papers that show not only that multi-task learning helps pre-trained models, but that
gains are larger when more tasks are used. Such massive multi-task learning settings cover up to around 100 tasks, going beyond earlier work that covered around 50 tasks (
Aghajanyan et al., 2021).
A key reason for this convergence of papers is that multi-task learning is much easier with recent models, even across many tasks. This is due to the fact that many recent models such as T5 and GPT-3 use a text-to-text format. Gone are the days of hand-engineered task-specific loss functions for multi-task learning. Instead, each task only needs to be expressed in a suitable text-to-text format and models will be able to learn from it, without any changes to the underlying model.
The newly proposed approaches differ in terms of
how and
when multi-task learning is applied. One choice is
fine-tuning an existing pre-trained model on a collection of multiple tasks, i.e.
behavioural fine-tuning. This is done by T0 (
Sanh et al., 2021), one of the first outcomes of the
BigScience workshop, using T5 and FLAN (
Wei et al., 2021) using a GPT-3-like pre-trained model. Both papers describe a unified template and instruction format into which they convert existing datasets. BigScience open-sources their collection of prompts
here. Both papers
report large improvements in terms of zero-shot and few-shot performance compared to state-of-the-art models like T5 and GPT-3.
Min et al. (2021) propose a different fine-tuning setting that optimizes for
in-context learning: instead of fine-tuning a model on examples of a task directly, they provide the concatenation of
k+1 examples to a model as input
x_1, y_1, ā¦, x_k, y_k, x_{k+1} and train the model to predict the label of the
k+1-th example,
y_{k+1}. They similarly report improvements in zero-shot transfer.
In contrast to the previous approaches, ExT5 (
Anonymous et al., 2021)
pre-trains a model on a large collection of tasks. They observe that using multiple tasks during pre-training is better than during fine-tuning and that
multi-task pre-training combined with MLM is significantly more sample-efficient than just using MLM (see below).