The Role of Linguistic Quality Assurance after Field Trialling

The Role of Linguistic Quality Assurance after Field Trialling

Re-edited by Andrea Ferrari, Co-founder at cApStAn, from a presentation made at CSDI Washington 2012 by Steve Dept, Andrea Ferrari and Juliette Mendelovits

The design of major international comparative studies such as OECD’s PISA (Programme for International Student Assessment) or IEA’S TIMSS (Trends in International Mathematics and Science Study) includes a Field Trial (FT) phase carried out on a smaller scale before the Main Survey (MS) phase, i.e. the actual data collection on full samples of the assessed populations.

As regards the development of assessment instruments, the FT data collection is followed by an analysis of results, which informs the selection and revision of the instruments that will be used in the MS. (Note: the FT phase serves other purposes as well, not covered here.)

For the translation/adaptation of these instruments, it is no longer an option to administer a large-scale survey without some level of sophistication in the localization design. Thus, linguistic quality assurance (LQA) and control (LQC) processes are implemented at both FT and MS phases, with a view to maximizing the comparability of the data which are collected.

It is of interest to focus on processes implemented after the FT phase, which may be less well-known.

Overview of LQA and LQC processes implemented at FT phase

Before turning to the processes implemented at MS phase, here is a brief overview of typical LQA and LQC processes implemented in the run-up to the FT, taking the example of OECD/PISA:

  • Careful linguistic construction of the source materials to identify and minimize translatability issues.
  • Provision to participating countries[1] of the source version(s), general and item-specific translation/adaptation guidelines, a translation/adaptation training workshop, manuals, and helpdesk.
  • Centralized linguistic quality control follows. Specially recruited and trained verifiers check both correspondence of target version to the source version and fluency/correctness in the target version, striving to achieve an optimal balance between these two goals. The verifiers also check whether the item-specific guidelines are followed. Verifiers implement their suggested edits in a trackable mode and document (explain and justify) them. In the next stage, the National centres review the verified versions and may make further changes. These versions are then submitted for a final check.

Brief description of the statistical analysis of FT results

The purpose of the international statistical analysis of FT data is to evaluate the quality of the field trial items and to model the scales and subscales on which the main survey will be reported.

In the analysis, several item and scale statistics are considered. At the item level, classical item statistics (“itanals”) are produced. These analyses allow an examination of each item’s discrimination, fit, ability ordering and – in the case of multiple-choice items – point-biserial correlation.

  • The discrimination statistic indicates whether the individual item is discriminating between students in a way similar to other items in the assessment.
  • The fit statistics, similarly, show whether an item appropriately contributes to spreading the ability estimates of the students across the scale.
  • The mean ability statistic calculates, for students in each score category, their mean ability across the whole assessment. It shows whether students are responding to the item as expected.
  • For multiple-choice items, the point-biserial correlation is the correlation between a response category and the total score on the assessment. Correct responses should have positive correlations with the total score, incorrect responses negative correlations.

A further set of key statistics in the item selection process is generated from Differential Item Analysis (DIF), which compares the performance of subgroups of interest on each item. The DIF analysis shows when an item is harder than expected or easier than expected for a particular group (given their overall performance). In PISA, DIF analyses are conducted on the FT data for gender, country and language.

How the analysis of FT results informs the selection and revision of MS instruments

The FT data are used to investigate item performance to optimize the MS pool, taking into account sometimes competing constraints. Alongside the individual item statistics, test developers also consider test-level factors including the fit of the item set to the framework to meet targeted percentages under several categories and the range of difficulty of the items.

In addition, the national centres may be provided with information about how items have performed in their country. Items that have behaved anomalously on any of the criteria outlined above are flagged as “dodgy” items, and countries are asked to reflect on the data and explain, if possible, why such results may have occurred. Some anomalies, such as poor discrimination, may appear across several countries with a common language, indicating an unforeseen cultural or linguistic problem.

The process of updating national versions

Once the MS source versions are finalized, the national versions of assessment instruments must be updated, and a reliable process must include LQA and LQC. For example, in a decentralized model:

  • The National centres are provided with guidelines for updating their FT versions: they must echo changes made in the source version (unless not applicable) and they may make additional national changes in light of the FT results, but they are discouraged from making “cosmetic” or “preferential” changes (in line with the classic advice “If it isn’t broken, don’t fix it”).
  • National Centres update their FT versions, creating MS draft versions which are submitted for verification. (Or, even safer, they submit requests for updates to their FT versions.) 
  • Verifiers are provided with dodgy item reports and translation- or adaptation-related feedback from National Centres. They are asked to examine whether solutions proposed by countries address the issues identified; if this is the case, whether the solutions are implemented consistently and correctly; if this is not the case, to propose alternative corrective action. They are also asked to report on other potential issues: the source of the problem, for a dodgy item, can be hidden in a stimulus that can be relatively far from the actual item.

Technological and organisational features that allow for an efficient MS LQA and LQC process

  • CAT tools allow to lock segments for which the source version remains unchanged versus FT. Both the National Centres and the verifiers cannot change these segments by accident: they must unlock each one where they purposefully choose to make a change.
  • National Centres and verifiers are informed of FT analysis: the reports on dodgy items are used to make informed decisions about implementing national FT to MS changes.
  • With automated detection and marking of changes in files going through a workflow, verifiers do not need to check for possible undocumented changes made by participating countries. This enables a cost-efficient “focussed” MS verification: verifiers are instructed to not re-verify the entire materials, but to focus on:

– Changes made to the source version (in principle these should be echoed across all national versions);

– National changes (in principle these should be justified in light of the “dodgy items” report for each national version);

– Selected passages e.g. in case of overall “dodginess” across a group of countries sharing a common language;

– Consistency issues that may arise elsewhere in the instruments as a result of the above changes.

  • Thanks to this same aspect (automated detection and marking of changes), the countries are reassured of retaining control on their versions as they can see all changes made by verifiers.

Conclusions and open challenges

  • Versus the early years of paper-and-pencil PISA, the technological environment has made possible real breakthroughs as regards optimizing LQA and LQC processes in the Main Survey phase.
  • An open challenge is to make more and better use of the statistical analysis of FT results (and in this respect also of MS results) as regards the selection, training and monitoring of verifiers: DIF reports and dodgy item analyses provide a wealth of data pointing to possible weaknesses or areas of improvement in verification practice.

[1] Many international studies apply a decentralized translation/adaptation model: the participating entities (countries or regions) produce or adapt their instruments, whereas guidance and quality assurance (before, during, and after) is the remit of the contractors implementing the project.

How can cApStAn help

As an independent linguistic quality control agency, cApStAn can intervene at multiple stages of your multilingual projects: from source optimization to training translation teams, from documenting adaptations to validating test and questionnaire translations, from analyzing language-induced item bias to managing translation memories across survey cycles.

Contact us here to discuss your requirements