Mitigating inherent biases in language models by reinforcement learning

Miguel Couceiro
Université de Lorraine, CNRS, LORIA

Fairness in AI is currently of paramount importance since it critically impacts human lives. One of the highly discussed aspects are the intrinsic bias that have been reported in complex language models, and their adverse effects on various fields ranging from healthcare to legal policing. Indeed, several works reported evidence of gender, geographical, religious and racial biases in language models, which significantly influence the model’s prediction. Most works to mitigate bias in language models focus on data pre-processing or on embedding debiasing techniques. However, the multitude of bias sources and their specificity render these debiasing approaches ineffective, as they fail to adequately tackle the trade-offs between different types of bias and their impact on models’ performance.

In this talk, we will present REFINE-LM, a novel architecture based on Deep Reinforcement Learning to mitigate unintended biases present in pre-trained language models. As we will see, REFINE-LM is capable of reducing, significantly and simultaneously, various stereotypical bias without compromising the performance of language models.

Most of the results we will discuss were obtained in an ongoing collaboration with Luis Galarraga (Inria) and Rameez Querish (UCD).