21.01.2019

The language imbalance and “lopsided geography” of Wikipedia

by Pisana Ferrari – cApStAn Ambassador to the Global Village

Thanks to a recent agreement between the two companies, Google will provide machine translation (MT) support for over 100 languages on Wikipedia, 15 of which were not served earlier, including Hausa, Kurdish (Kurmanji), Yoruba, and Zulu (1). The encylopedia’s volunteer content editors will have a new MT tool to choose from amongst those offered by the platfom. A positive step, no doubt, as Zulu, for example, is spoken by 12 million people, but only has about 1.000 articles on the Wikipedia Zulu platform. Many other languages are underrepresented in the encyclopedia, despite the 303 language versions. English tends to dominate for written content, and, while some parts of the world are heavily represented, others are largely left out. Recent research warns of Wikipedia “reproducing new, uneven, geographies of information”.

Language imbalance

Looking at the Wikipedia official stats (2), one can see that, in addition to Zulu, there are 80 other language versions that have less than 2.000 articles, and 160, i.e. over half of the total 303, have less that 10.000 (of note that English has 5.7+ million). According to an article in “The conversation” (3) not only are many languages underrepresented, but while most articles written about European and East Asian countries are written in their main languages, English dominates for much of Africa, the Middle East and even parts of South and Central America. The article refers to a study conducted by a group of Oxford researchers (4), which mined 44 language editions of Wikipedia, mapping more than 3 million articles that had been “geotagged”. What emerged from the study is that there are more Wikipedia articles in English than Arabic about almost every Arabic speaking country in the Middle East, and there are more English articles about North Korea than there are Arabic articles about either Saudi Arabia, Libya, or the United Arab Emirates. All this matters, says the author, because different languages may lead to very different narratives about places and topics. 

Geographic imbalance

Another big issue has to do with disparities in “knowledge production”. Even though Wikipedia has a huge amount of information about millions of events and places around the globe, it is characterized by “uneven and clustered geographies” and “there is simply not a lot of content about much of the world”. The Oxford research highlighted that even though 60% of the world’s population is concentrated in Asia, less than 10% of Wikipedia articles relate to the region. The same is true in reverse for Europe, which is home to around 10% of the world’s population but accounts for nearly 60% of geotagged Wikipedia articles. The article in the “The Conversation” says some parts of the world are therefore massively underrepresented not only in their own language, but also in major world languages. The risk is that Wikipedia “might not just be reflecting the world, but also reproducing new, uneven, geographies of information”.

Conclusion

How can these disparities best be addressed? Wikipedia founder Jimmy Wales explains in a (still relevant) 2016 article titled “The lopsided geography of Wikipedia” (5) that the size of the Wikipedia language versions depends on factors such as literacy rates, access to internet and computers, number of speakers. Even the latter is not a perfect indicator, he says, giving the example of the nearly 70 million Tamil speakers: getting them to contribute to Wikipedia in Tamil depends on how widespread internet access is and how high the demand is for information in that language rather than English. Censorship also plays a role, e.g. in China, where Wikipedia has “flickered in and out” of the internet over the years. Part of the problem may also be the strict editorial policy Wikipedia has put in place, particularly with regard to sourcing. A vast amount of material is required in order to generate a new article (books, maps, photos, etc) while work around existing content is relatively easy… The Wikipedia editorial policies, which are thought by many to be one of Wikipedia’s greatest strengths, “might also be one of its greatest impediments to expansion”.

Footnotes

1) https://wikimediafoundation.org/2019/01/09/you-can-now-use-google-translate-to-translate-articles-on-wikipedia/

2) https://en.wikipedia.org/wiki/List_of_Wikipedias

3) https://theconversation.com/geotagging-reveals-wikipedia-is-not-quite-so-equal-after-all-30550

4) https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2382617 – full report at https://bit.ly/2sRYMM5

5) https://www.theatlantic.com/international/archive/2016/06/geography-wikipedia-jimmy-wales/487388/

Photo: Soner Eker @ Unsplash