Prior work has found that pre-trained models are biased and can generate discriminatory or even toxic language. Ensuring safe responses is thus an important aspect of the development of such models. Recent models such as
LaMDA,
InstructGPT, and
Gopher developed by Google, OpenAI, and DeepMind respectively emphasize safety in their model evaluation and training. A common recipe is to
fine-tune pre-trained models on data labeled with safety ratings by human annotators—using a reward model + RL or by training a detector and filtering out unsafe responses.
For
LaMDA, crowdworkers annotate model responses based on different safety criteria. The model is then fine-tuned both to generate dialogue responses as well as to predict the annotated safety labels. This multi-task setting is not only more efficient but also enables sharing information between the tasks. At test time, candidate responses where the model predicts a low safety rating are filtered out. The authors find that this fine-tuning setting significantly improves the safety of generated responses.
For
InstructGPT, GPT-3 is first fine-tuned on the demonstrations of annotators following instructions in a supervised setting. In a second step, raters rank multiple outputs of the fine-tuned model, which is used to train a reward model. Finally, the model is fine-tuned based on the output of the reward model using reinforcement learning. In an evaluation, the outputs of InstructGPT are significantly preferred over GPT-3’s outputs while InstructGPT replaces GPT-3 in the API.
For
Gopher, the authors perform an extensive analysis of the toxicity and bias of the model. They find that larger models increase the toxicity of toxic input but do not amplify training data toxicity when unprompted. They also observe that large models are prone to bias against subgroups in a few-shot setting and that larger models are not able to overcome limitations in the coverage of dialects.
Overall, prior work as well as these recent efforts demonstrate that we cannot just pre-train models and expect them to produce safe or harmless responses. Instead, safety and inclusion need to be key design criteria that are included as part of the development of such models. This requires clearly enumerating and defining potential safety risks, collecting and annotating relevant data as well as explicitly training models to demonstrate safe behaviour. For recent reviews that highlight potential risks associated with language models, have a look
here. I hope to see safety being considered as a design criterion and evaluation dimension in more work going forward.