Google AI Introduces A Dataset for Studying Gender Bias in Machine Translation

Google AI Introduces A Dataset for Studying Gender Bias in Machine Translation

by Pisana Ferrari – cApStAn Ambassador to the Global Village

A recent Google AI blog post on the topic of gender bias in machine translation reports an ambitious but sensible endeavour: the creation of a dataset based on translated Wikipedia biographies. The aim of this dataset is to analyze common gender errors in machine translation. Why Wikipedia? Google explains that because they are well-written, geographically diverse, contain multiple sentences, and refer to subjects in the third person (and so contain plenty of pronouns). Wikipedia biographies offer a high potential for common translation errors associated with gender. These often occur when articles refer to a person explicitly in early sentences of a paragraph, but there is no explicit mention of the person in later sentences. The examples provided in the post are clear, the scope of work is defined, the claims and objectives are not unrealistic. Google specifies that this set focuses on a specific problem related to gender bias and doesn’t aim to cover the whole problem also that they don’t aim to be prescriptive in determining what’s the optimal approach to address gender bias. The contribution aims to foster progress on this challenge across the global research community.

Previous efforts by Google to promote fairness and reduce bias in machine learning include providing feminine and masculine translations for some gender-neutral words on the Google Translate website. Historically, it had provided only one translation for a query, even if the translation could have either a feminine or masculine form. So when the model produced one translation, it inadvertently replicated gender biases that already existed. For example: it would skew masculine for words like “strong” or “doctor,” and feminine for other words, like “nurse” or “beautiful.” In an article for our blog titled, “Do the footprints of stereotyping and gender bias follow us in online environments?” cApStAn project manager Emel Ince, who is Turkish, provided some telling examples from her language. In Turkish, there is no male or female distinction in the third person pronoun “o”, which can be used for females and males alike. Translation of Turkish content results in: O ilginc, He is interesting; O bir engineer, He is an engineer; O bir ahci, She is a cook; O bir doktor, He is a doctor. In her article Emel explains how the bias seems to be more salient in some languages than others and how this does not necessarily a reflection of gender attitudes in the source language’s native culture but may simply be the result of different grammar systems in different languages. Read more at this link

Photo credit Shutterstock