Has Microsoft’s neural machine translation (NMT) really reached parity with human translation?

Has Microsoft’s neural machine translation (NMT) really reached parity with human translation?

by Pisana Ferrari – cApStAn Ambassador to the Global Village

While at cApStAn we investigate how neural machine translation (NMT) can be used to improve a human translator’s output rate while increasing consistency, blue chip companies driving the progress in NMT still fall into the trap of comparing the NMT output with human output. Microsoft has again claimed that its NMT has reached parity with human translation. It has indeed made huge progress but is not yet sophisticated enough to meet the claim: this is the conclusion Tommi Nieminen, translation technology developer, has come to after a careful analysis of the Microsoft research. His analysis is referenced in the latest edition of the “Tool Box Journal”, a computer journal for translation professionals. We at cApStAn believe that parity of NMT with human translation is far less interesting than reports on how efficient a human translator can become when integrating state of the art NMT in the translation workflow and using it with discernment.

The author points out, first of all, that Microsoft compared NMT with translations produced by non-native translators, i.e. native Chinese-speakers translating into English. Nieminen says that while this is not stated specifically in the paper it was obvious from reading the reference translations. According to him only exceptional translators can translate competently into a language that is not their native language. Examples are given in the article of a number of mistakes in the translation into English, and even sentences that were technically correct were “stilted and unidiomatic”. 

What about the judges called in to evaluate the quality of the translation? According to Microsoft, human parity is achieved when a “bilingual” human judges the quality equivalent. Nieminen points out that being bilingual does not always or necessarily imply the kind of competence that is required to evaluate the quality of a translation. And, even so, in manual translation evaluation it is often observed that judges have trouble understanding the instructions and may consequently form their own strategies for scoring sentences.

In the research report Microsoft acknowledges that its NMT translation was not error-free but it adds that “machines, like humans, will continue to make mistakes”. The percentage of errors was relatively high, in particular in terms of incorrect words, grammatical mistakes, missing words and named entity mistakes. Nieminen says that, by contrast, expert human translators can consistently produce virtually error free translations (if not pressed for time, he adds).

Finally, and this a point that keeps coming up when talking of NMT, Nieminen says that the evaluation of independent sentences is not sufficient to demonstrate human parity: “human translations are produced within textual, cultural and other contexts”, and these are vital elements of a true understanding.