Complexity of Creating Multiple Comparable Language Versions of Assessments

By Dr Kadriye Ercikan, Educational Testing Service (ETS)

Multiple language versions of assessments serve important roles within different countries and internationally. They provide opportunities to assess individuals in languages in which they can most optimally demonstrate their knowledge, skills, and abilities. In this way, multiple language versions play a critical role in supporting fairness and validity of interpretations of assessment results for individuals from different language and cultural backgrounds. They also facilitate comparisons of performance across language and cultural groups. Such comparisons are central to international assessments such as the Programme for International Student Assessment (PISA) and the Trends in International Mathematics and Science Study (TIMSS), which provide data that are useful for gaining insights about the effectiveness of educational practices and education systems. However, establishing validity and fairness of inferences from these assessments that involve multiple language versions of assessments and individuals from different cultures and countries is complex.

Score comparability and consistency of score meaning for different groups in a society is a key concern for all assessments—but these issues gain heightened importance and complexity in assessments administered in multiple languages. Multiple language versions may be used not only for assessments that are administered in different countries, but also within the same country where the test will be taken by individuals from diverse language and cultural backgrounds. The validity of interpretation and use of assessments require that scores reflect the underlying knowledge and abilities that the test is designed to measure, and the score meaning is consistent for individuals from different language and cultural backgrounds. In particular, validity is tied to whether

– the assessments tap the knowledge and skills that we are interested in assessing;

– constructs being assessed are comparable for different cultural groups; and

– assessments and scores are comparable across languages and cultures.

These requirements are central to making comparisons across language and cultural groups. They require us, as assessment developers, users, and specialists, to examine and verify which constructs the assessments are targeting and whether they are assessing the same construct with the same psychometric properties for different groups.   

In a previous publication (Ercikan & Por, 2020, pp. 217 – 218), we discussed a seven-step process for developing multiple language versions of assessments, starting from a source version and creating a target version; the process is intended to facilitate provision of comparable scores and consistent score meaning across languages:

1. Examine construct equivalence. Examine the construct definitions of both the source test, as well as those of the target version in the respective language and culture.  For example, a review panel may determine the extent to which the construct is appropriate in the target culture and identify aspects of the construct that may be different for the two language and cultural groups. 

2. Select a test adaptation and development method. Choose which type of test development is most appropriate for the purposes of your adaptation.  For example, if test developers are able to build a test simultaneously with other language versions, it may be possible to employ parallel or simultaneous development. If a source language version of an assessment already exists, and developers wish to create a target version, it may be necessary to use successive test adaptation (see description of these development methods in Ercikan & Lyons-Thomas, 2013). 

3. Perform the adaptation of the test. Adapting a test requires not only that translators be fluent in both languages, but that they are also knowledgeable about both the source and target culture, and that they understand the construct being studied and use of the tests. Other suggestions for test adaptation include using short sentences, repeating nouns instead of using pronouns on second reference, and avoiding metaphors and passive voice in developing source versions of tests.

4. Evaluate the linguistic equivalence between the source and target versions. Bilingual expert reviewers should evaluate and determine differences in language, content, format, and other appearance-related aspects of items in the two language versions being compared. Test developers can use reviewer feedback to revise the adapted versions of tests. The bilingual experts should re-evaluate the tests after revision. 

5. Document changes made in the adaptation process. For the benefit of future test users, document the changes and the rationale for these changes between the two language versions of tests.

6. Conduct a field test study to examine measurement equivalence. The field test data are used to examine reliability and validity of both language versions of tests, as well as measurement equivalence using analyses based on classical test theory, factor analyses, differential item functioning (DIF) analyses, and comparisons of test characteristic curves (see description of these methods in Ercikan & Lyons-Thomas, 2013). A second round of expert reviews and cognitive analyses can provide further support for comparability of the language versions.

7. Conduct linking studies. Once measurement equivalence has been established, conduct a linking study to create measurement unit equivalence using specially designed studies.

When scores from multiple language versions of assessments are compared, implicitly or explicitly, without establishing comparability, this compromises the validity and fairness of the interpretations from assessments. These steps highlight the complexity of multidisciplinary efforts required in creating equivalent test forms across languages and cultures through test adaptation and verifying comparability using multiple approaches.


Ercikan, K., & Lyons-Thomas, J. (2013). Adapting tests for use in other languages and cultures. In K. Geisinger (Ed.), APA Handbook of testing and assessment in psychology, Volume 3 (pp. 545-569). American Psychological Association: Washington, DC. 

Ercikan, K., & Por, H.H. (2020). Comparability in Multilingual and Multicultural Contexts. In A.I, Berman, E.H., Haertel, & J. W. Pellegrino (Eds.). (2020). Comparability of Large-Scale Educational Assessments: Issues and Recommendations (pp.205-225). Washington, DC: National Academy of Education.

Educational Testing Service

About Dr Kadriye Ercikan

Kadriye Ercikan is Vice President of Psychometrics, Statistics and Data Sciences at Educational Testing Service and Professor Emerita at the University of British Columbia. She leads a team of nearly 350 psychometricians, research scientists, data analysts, and measurement technology specialists who work to sustain the foundation of ETS’s educational measurement results to ensure their reliability, validity and fairness for all test takers and advance the practice and science of measurement.  Ercikan earned a PhD in research and evaluation methods in education from Stanford University. Her research focuses on designing and validating assessments of complex thinking, the assessment of linguistic minorities, and fairness and validity issues in cross-cultural and international assessments.

Ercikan is a Fellow of the International Academy of Education.  Her research has resulted in six books, four special issues of refereed journals and over 100 publications. One co-edited book, Validating Score Meaning in the Next Generation of Assessments, was selected for publication as part of the National Council on Measurement in Education (NCME) book series. She was also awarded the AERA Division D Significant Contributions to Educational Measurement and Research Methodology recognition for another co-edited volume, Generalizing from Educational Research: Beyond Qualitative and Quantitative Polarization, and received an Early Career Award from the University of British Columbia.

Dr Ercikan on LinkedIn

About ETS

At ETS, we advance quality and equity in education for people worldwide by creating assessments based on rigorous research. ETS serves individuals, educational institutions and government agencies by providing customized solutions for teacher certification, English language learning, and elementary, secondary and postsecondary education, and by conducting education research, analysis and policy studies. Founded as a nonprofit in 1947, ETS develops, administers and scores more than 50 million tests annually — including the TOEFL® and TOEIC® tests, the GRE® tests and The Praxis Series® assessments — in more than 180 countries, at over 9,000 locations worldwide.

ETS on LinkedIn and Twitter