gender bias article

New research using AI on a corpus of 3.000 books reveals a 4:1 male-female literary imbalance

by Pisana Ferrari – Branding and Social Media Manager

Despite progress in recent decades, gender disparity still affects many aspects of economic, social and cultural life. Two researchers at the Information Sciences Institute of the University of Southern California have used AI to uncover a very substantial gender disparity, with a 4:1 male-female imbalance, also in literature. (1) Authors Akarsh Nagaraj and Mayank Kejriwal used Natural Language Processing (NLP) to obtain gender-specific cultural analytics on over 3,000 English literary texts included in Project Gutenberg. (2) NLP was used to segment sentences, extract characters (using disambiguation so that they are not overcounted) and pronouns, and assign gender to characters. The books were written by 142 authors between 1700 and 1950 and the genre ranged from adventure and science fiction, to mystery and romance, and in varied mediums, including novels, short stories, and poetry.

Lack of diversity

The authors say that with the renewed focus on diversity and equity in the current era, understanding the lack of such diversity in cultural hallmarks, such as influential literary texts, is an important first step. Interviewed by Science News Dr. Mayank Kejriwa says “Gender bias is very real, and when we see females four times less in literature, it has a subliminal impact on people consuming the culture”.

Gender stereotypes

The authors say that words associated with women included adjectives such as “weak”, “amiable”, “pretty”, and sometimes even “stupid”. For male characters, the words describing them included “leadership”, “power”, “strength” and “politics”. The team didn’t quantify this facet of their study but say there is scope for future qualitative investigation on word associations with gender.

Importance of the research

The authors say that their study has quantitatively revealed, in an indirect way, how bias persists in culture. They expect that researchers currently studying gender disparity and gender bias in literature will benefit widely from this data. They hope that the study will serve to highlight the importance of interdisciplinary research and AI technology to highlight pressing social issues and inequalities that can be addressed.

Limitations of the study

The authors acknowledge the possibility of bias in data processing and estimations of accuracy: e.g. the assumption that gender can be determined from the names of the book authors; the relatively small samples; the simplistic dichotomy of male-female in determining gender (there may be non-binary and transgender authors/characters in the corpus). They hope future researchers will make ethical use of the data in their research, rather than treat it as ground-truth, without further critical review.


1) “Dataset for studying gender disparity in English literary texts”, Akarsh Nagaraj, Mayank Kejriwal, University of Southern California, Information Sciences Institute, in Elsevier, Volume 41, 2022, 107905, ISSN 2352-3409,

2) Founded by Michael S. Hart (inventor of the eBook) in 1971, Project Gutenberg is an online library of over 60,000 free eBooks, in over 60 languages and dialects. It contains the world’s best iterature, with a focus on older works for which U.S. copyright has expired. Thousands of volunteers contribute by digitizing and proofreading the eBooks.


 “Male Characters are Four Times More Prevalent in Pre-Modern Literature than Female Characters: Study”, News Staff, Science News, April 29, 2022

“Four times more male characters in literature than female, research suggests”, Sarah Shaffi, The Guardian, April 27, 2022

“Study finds that males are represented four times more than females in literature”, University of Southern California, Phys Org, April 27, 2022

“AI study finds that males are represented four times more than females in literature”, Maya Abu-Zahra, USC Viterbi School of Engineering News, April 22, 2022

“New research uncovers substantial gender disparity in the literature”, Staff writer, Mental Daily, April 28, 2022 

Photo credit