Everything you want (and need) to know about producing and validating tests and questionnaires in multiple languages

Everything you want (and need) to know about producing and validating tests and questionnaires in multiple languages

by Pisana Ferrari – cApStAn Ambassador to the Global Village

Open Bar – Live Q&A – Thursday, October 28, 2021

Our first in the series of cApStAn Open Bar – Live Q&A webinars took place on October 28. Our new format: a very short PowerPoint – with five slides sent to participants in advance– and a long Q&A session, appears to have been well received, judging by the good turnout and excellent questions. It was a refreshing change compared to previous events to have the time to reply at leisure. Many in the audience expressed concern about test items that can be sensitive in certain cultural contexts and there was interest in where a “cultural suitability review” could fit in in a translation and adaptation workflow. Other questions were related to the organization of translation and adaptation workflows, e.g., where to include human experts, and where machine translation, in the loop, and whether there should be different workflows depending on the domains that are being analyzed. All in all, it was a rewarding experience, which we plan to replicate. Any suggestions on how we can improve the format are welcome. Stay tuned for our next webinar in the series!

From the transcript of the Live Q&A

Q 1. What recommendations might you offer for including or adapting questions that may be perceived as sensitive in some cultural contexts, e.g., age, income ranges, gender roles. Do you exclude these types of questions or do you deem they can be adaptable?

Steve: The short answer is no, don’t exclude, even if a question is sensitive. Find out about adaptations that have worked before. Find a way in which you can say it without hurting sensibilities. For example, in some countries it is difficult to ask people how many children people have. In Uganda, mothers will think in terms of number of pregnancies, whether the children were actually born or not, so you have to ask in a certain way. There are countries where respondents will hesitate to respond to this question, as they worry that this might bring bad luck, that something might happen to their children. In general, there is a way around the problem, but not always. In some countries you cannot ask about income, for example, it is just not done. If you want an income bracket, it is very difficult. Age can also be a sensitive issue in some cultures. The first approach we suggest is to see if that construct has been tapped before in that culture, how it has been adapted, whether it worked. There are also other issues related to sensitive questions. In some languages, where you necessarily have “he” or “she”, for example in Poland, the polite form is in the third person. A question would have to read “Does Mr have…”? “Does Mrs have…”? So, if you ask a question like “Do you live with a partner?” you would have to ask “Does Mr live with a female partner” or “Does Mrs live with a male partner?”, because the words will have an ending that shows the gender. If you ask these questions that way, you are missing out on an entire demographic segment, which, I assure you, exists in Poland. So, you have to think then, what do I do? It is offensive to conservatives in Poland to ask differently, but then it is a social reality that there are different gender identities. So, how do you ask the question? With slashes? Those could be considered offensive, too. You have to find a way out of the stereotypes that cultures sometimes impose, but don’t rule it out. The constructive attitude is to try. It may not be perfect but there are creative ways that may come from the linguists or from what has been done successfully in the past. More often, however, you really have to think of the target culture and you may have to change your question.

Q 2. Would you say that test content made of short questions, or even pairs of words, or short statements, lend themselves better to machine translation and light post editing?

Steve: Actually, no, because the semantic match between two words in two languages is never perfect, there is always a “distance” between words. Let’s say you have pairs of words (with a Which word is more like you? format) and these words somehow convey a state of mind, an emotion. When you translate such a word into a different language there can be a meaning shift, the semantic coverage is not identical. The perfect transfer of form and meaning into another language is not achievable, there is always a distance in translation. And, if you have a short entity to translate, the risk is that distance will have more impact. If you have a longer sentence there is some context in that sentence, that gives whatever word you choose to translate some more context. Using machine translation just because the sentences are shorter is very risky.

Q. 3 Would you recommend a different translation workflow for the stimulus text, the actual questions and the reports that go to the candidates?

Steve: The answer is yes, even if it speaks a bit against what we normally do (project managers like to simplify their work: one project, one workflow!). But if you have a stimulus that is two pages long, where only two paragraphs are really important to answer the question, single them out, and go for a machine translation and post editing for the unimportant parts of the stimulus text, and have your expert only concentrate on the two paragraphs that count. That will save time. Likewise, you have your coding guides and scoring instructions, these need to be accurate and precise but they don’t need to have the same nuance, whereas in, say, a psychological test or an assessment of non-cognitive skills, every single bit of meaning that is between the lines, between the words, is important, and no machine can render that. So yes, project management that involves differential workflows depending on the content and target audience is a recommendation that would be advantageous for the client, anyway. There is more work for LSPs but clients would benefit as it would go faster and be cheaper and the quality would be where it matters most.

Q.4 Where would you recommend to insert the “cultural suitability review” in a translation and adaptation workflow?

Steve: This is a complex question because there are organisations that have been administering, say, credentialing exams for decades, successfully, without paying too much attention to the demographic groups that could potentially be disadvantaged. In the 2021 society, with #meetoo and #blm, a broader acceptance of non-binary gender forms, more attention to language that can be perceived as discriminatory, it’s become essential for all organisations to run their exams through this lens. With existing exams, e.g., that have robust statistics, high predictability of, say, success in the workplace, and have been used to analyze trends across time, you may not want to make radical changes. So, perhaps you could organize an independent multicultural audit to see whether there are elements, aspects and components in there that could or should be amended, or worded differently, in order to make sure that no demographic groups might be put at an advantage or disadvantage. Completely new instruments are now being developed, with video content, scenarios, simulations, mix of gaming and verbal reasoning, pairs of statements, etc. Our recommendation is to have the test developers submit a “mature” draft for a multicultural review, with special attention on the diversity, equity, inclusion and bias reduction (DEI-BR) filters that can be applied: see what can be tweaked, what can be reworded, or what alternative forms can be introduced, to avoid potential biases and to ensure that the test is representative of all the populations interviewed. How inclusive should the scenarios be? Should there be for example a trans teacher in addition to male and female among the characters of the scenario? Should you also have different ethnic groups represented and in what proportion? The proportion of the US? Or of Europe, India, Africa? Should there be different scenarios, should they be adapted depending on the audiences they are addressed to? Isn’t that expensive? Those types of considerations would come out of the discussions between multicultural reviewers and test developers.

Q.5 One cannot expect from your regular variety of LSPs that they know all the pitfalls in questionnaire translation, but when one works with survey experts rather than language specialists, one can’t use translation technology and one misses out on the other aspect. Is there a way to blend the expertise of both groups to a sort of automated workflow where the two don’t step on each other’s toes?

Steve: That’s the quest for the Holy Grail! That’s where we want to go! With automated workflows, in technology rich environments, it is important to determine two things. First, where should the human in the loop be integrated in the workflow? Where is the best place, the most efficient? Secondly, you need to compartmentalize instructions: i.e., have different humans with different backgrounds and expertise, with precise instructions that match the instruments they are reviewing, match their specific expertise, knowing that there are other humans at other points in the loop. So, if you have an industrial/organizational (I-O) psychologist looking at the adaptation of a scale, based on the big 5 questionnaire scales, and she knows that “openness” functions a bit differently in collectivistic cultures – in the Far East – whereas the other traits are more stable, it is really important to have that knowledge for those items. It is not important to have those subject matter experts comment on register, on syntax, or phraseology. We have technology to harmonise forms of address, punctuation, etc, or that will check for consistency with the term base. So, the instructions should be separate. You don’t just add reviews to reviews, with different people with different skills adding their comments, and not all comments should trigger a direct intervention on the production of translated content. It should much rather be a sequential intervention where there is a zone, a field, for each intervening party, and there is one supervisor that takes into account the different interventions and says whether they should be implemented, or just left there as a note.

Q.6 Method: Should there be the same translation and adaptation process for talent management assessment and medical certification exam, or different approaches?

Steve: The processes does not depend so much on the field of expertise. They depend more on the type of instrument and on the type of output you want. So, if you are reviewing a nuclear physics assessment, as we have done in the past, or a test for a nursing certification, or computer programming skills, I don’t think you should have a different approach in mind. You determine your steps, based on the expertise that is being analysed, on the timeline and on the budget constraints, but you don’t have a different process depending on the domain. I may be wrong — I am only speaking out of experience.

Q.7 Could you tell us a bit more about the glossary format functionality? Are defined terms flagged through the survey questions? Or are separate guidelines provided for separate audiences to review all the glossary terms first?

Steve: We have not implemented the separate guidelines for term bases, but there may be contexts where that could be relevant. For our glossary formats we use with the OASIS standard TBX term base exchange format. The first step is to “extract” recurring terms, terms that may be ambiguous but are important, technical terms and terms that occur very often, from the text. This is a semi-automatic process, which can of course yield false positives. One of our in-house consultants will typically take a look and decide which of the terms go in the glossary. We are going to propose a translation for these terms and then have it validated by the survey developer or local partner, or one of our domain experts, if no local partners are available. So, you have the lemma of the terms, their various possible forms and their proposed translation (glossaries are bilingual), and you get it validated. Once you have done that, you prepare your translation project in a computer-assisted tool (a CAT tool), which is simply an efficiency tool, nothing to do with machine translation. CAT tools help increase translator output and consistency by leveraging assets such as term bases and they generate translation memories as the work progresses. The translation memory will see to it that the existing translation of any sentence or scale will pop up when there is a matching segment, so that he can reuse, recycle or edit it. The term base is added to the translation project, which means in practical terms that as you start translation, any occurrence of a term listed in the term base will be highlighted and the translation that is in the glossary will be suggested. You can accept it or reject, depending on the context.

Q.8 What guidance would you have when survey questions contain items about specific behaviours?

Steve: We have been confronted with that problem when we were translating a survey about depression. Symptoms of depression are different in some cultures, they can be recognized in different ways in different countries/cultures. This makes it difficult. The same applies to behaviour. For example, you can have different ways of expressing friendship, that could be considered ambiguous in some cultures (e.g., walking hand in hand with a friend). So, in some cultures you can’t talk about that. There are guidelines to adapt those, they may be differential per culture. If you try to develop generic adaptation guidelines for an instrument that needs to be translated into several languages, for several cultures, at some point, they will fail you. There is also the concept of “social desirability” to be taken into account, which is much stronger is some cultures than in others. There are countries, e.g, Indonesia, where nobody would select “totally disagree” in a Likert scale, and very few would pick “disagree” if, for example, the interviewer is an older, educated, person or there is a sense that it could be seen as an indirect lack of respect. So, you need to do scale differently, e.g., with a numbered scale (1-5), or word the question differently to get more meaningful results. The one size fits all approach does not work well for behaviours. You need to have specificity. That is why we do not work with linguists who are expats, only with people who actually live in the country and would know what to expect.

About the speaker

Steve Dept is one of cApStAn’s founders. He received his education in English, Dutch, French and German. He is essentially an autodidact and a field practitioner. In 1998, Steve was sought out to organise the translation verification of PISA 2000 instruments and, since cApStAn’s creation in 2000, Steve has supervised linguistic quality assurance in PISA and in over 35 international surveys and polls. His translatability assessment methodology is applied in small and large multilingual projects in both the private and the public sector. Steve is the driving force behind cApStAn’s adaptive strategy.