Appendix C — Interpretability in Human Translation Workflows

C.1 Machine Translation Post-editing for Typologically Diverse Languages

C.1.1 Subject Information

During the setup of our experiment, one translator refused to carry out the main task after the warmup phase, and another was substituted by our choice. Both translators were working in the English-Italian direction and were found to make heavy usage of copy-pasting during the warmup stage, suggesting an incorrect utilization of the platform in light of our guidelines. Both translators, which we identified as T\(_2\) and T\(_3\) for Italian, were replaced by T\(_5\) and T\(_4\) respectively. Table C.1 reflects the final translation selection for all languages, with the information collected by means of the pre-task questionnaire.

		Gender	Age	Degree	Position	En Level	YoE	PE YoE	% PE
Arabic	T\(_1\)	M	35-44	BA	Freelancer	C2	> 15	2-5	20%-40%
	T\(_2\)	M	25-34	BA	Employed	C2	5-10	2-5	60%-80%
	T\(_3\)	M	25-34	MA	Freelancer	C1	5-10	< 2	20%-40%
Dutch	T\(_1\)	M	25-34	MA	Freelancer	C2	5-10	5-10	60%-80%
	T\(_2\)	F	35-44	MA	Freelancer	C1	10-15	5-10	40%-60%
	T\(_3\)	F	25-34	MA	Freelancer	C2	2-5	2-5	20%-40%
Italian	T\(_1\)	F	25-34	MA	Employed	C1	5-10	5-10	20%-40%
	T\(_5\)	F	25-34	MA	Freelancer	C1	2-5	2-5	40%-60%
	T\(_4\)	F	35-44	BA	Freelancer	C2	10-15	5-10	> 80%
Turkish	T\(_1\)	F	25-34	BA	Freelancer	C2	5-10	2-5	< 20%
	T\(_2\)	F	25-34	BA	Freelancer	C1	5-10	5-10	< 20%
	T\(_3\)	M	25-34	High school	Freelancer	C2	10-15	< 2	< 20%
Ukrainian	T\(_1\)	F	35-44	MA	Employed	C1	5-10	5-10	20%-40%
	T\(_2\)	M	35-44	MA	Employed	C1	10-15	10-15	20%-40%
	T\(_3\)	M	35-44	High school	Employed	B2	2-5	2-5	20%-40%
Vietnamese	T\(_1\)	F	25-34	MA	Employed	C2	10-15	5-10	40%-60%
	T\(_2\)	F	25-34	BA	Freelancer	C1	5-10	< 2	20%-40%
	T\(_3\)	F	25-34	MA	Employed	C1	2-5	< 2	< 20%

Table C.1: Subjects information for DivEMT. The last three columns represent respectively the number of years of professional experience as a translator (YoE), the number of years of experience with MT post-editing (PE YoE) and the % of work assignments requiring post-editing in the last 12 months (% PE) for each subject.

C.1.2 Translation Guidelines

An extract of the translation guidelines provided to the translators follows. The full guidelines are provided in the additional materials.

Fill in the pre-task questionnaire before starting the project. In this experiment, your goal is to complete the translation of multiple files in one of two possible translation settings. Please, complete the tasks on your own, even if you know another translator that might be working on this project. The translation setting alternates between texts, with each text requiring a single translation in the assigned setting. The two translation settings are:

Translation from scratch. Only the source sentence is provided, you are to write the translation from scratch.

Post-editing. The source sentence is provided alongside a translation produced by an MT system. You are to post-edit this MT output. Post-edit the text so you are satisfied with the final translation (the required quality is publishable quality). If the MT output is too time-consuming to fix, you can delete it and start from scratch. However, please do not systematically delete the provided MT output to give your own translation.

Important: All editing MUST happen in the provided PET interface: that is, working in other editors and copy-pasting the text back to PET is NOT ALLOWED, because it invalidates the experiment. This is easy to spot in the log data, so please avoid doing this. Complete the translation of all files sequentially, i.e. in the order presented in the tool. DO NOT SKIP files at your own convenience. Make sure that ALL files are translated when you deliver the tasks.

The aim is to produce publishable professional quality translations for both translation settings. Thus, please translate to your best abilities. You can return to the files and self-review as many times as you think it is necessary. Important: The time invested to translate is recorded while the active unit (sentence) is in editing mode (yellow background). Therefore:

Only start to translate when you are in editing mode (yellow background). In other words, do not start thinking how you will translate a sentence when the active unit is not yet in editing mode (green or red background).

Do not leave a unit in editing mode (yellow background) while you do something else. If you need to do something unrelated in the middle of a translation then go out of editing mode and come back to editing mode when you are ready to resume translating.

First you will be translating a warmup task, and then the main task. When you are translating each file, you can consult the source text by looking up the url in the Excel files that we have sent for reference.

In order to find the correct terminology for the translation you can consult any source in the Internet. Important: However, it is NOT ALLOWED to use any MT engine to find terms or alternatives to translations (such as Google Translate, DeepL, MS Translator or any MT engine available in your language). Using MT engines invalidates the experiment, and will be detected in the log data. Please fill-in the post-task questionnaire ONLY ONCE after completing all the translation tasks (both warmup and main tasks).

C.1.3 Details on Document Selection and Preprocessing

Document selection Table C.2 present the distribution of selected documents from the Flores-101 devtest split based on their domain and the number of sentences that compose them. The first goal in the selection process was to preserve a rough balance between the three categories while including mostly 4 and 5-sentence docs which are faster to edit in PET (no need to frequently close and reopen an editing window). Another objective of the selection was to minimize the chance of translators finding the translated version of the Wikipedia article from which documents were taken and copied from there, despite our guidelines. We thus scrape the articles from Wikipedia and assess the number of available translations. Among the selected documents, only a small subset has translations in other languages (see Figure C.1 top, an article can have multiple languages), mainly in Hebrew (14), Chinese (10), Spanish (7) and German (5) respectively. Considering the total number of translations for every article (Figure C.1 bottom), we see that roughly 75% of them (79 docs) have no translations. We consider this satisfactory as proof there should not be a large amount of possible copying involved, and we follow up on this evaluation by also ensuring that no repeated copy-paste patterns are present in keylogs after the warmup stage.

Type	WN	WV	WB	# Sent.	# Words
3S	11	13	11	105	2168
4S	14	8	13	140	3214
5S	12	13	12	185	3826
Tot.	37	34	36	450	9626

Table C.2: Distribution of the selected DivEMT documents across sizes and Wikipedia categories. A Type value of NS stands for documents composed by N contiguous sentences, WN, WV and WB stand respectively for WikiNews, WikiVoyage and Wikibooks

Figure C.1: **Left:** Distribution for the availability of documents selected for DivEMT in languages other than English. **Right:** Quantity of selected documents per number of available translations of Wikipedia.

Filtering of Outliers For our analysis of Section 8.4, we only use sentences with an editing time lower than 45 minutes, which was selected heuristically as a reasonably high threshold to allow for extensive searching and thinking. In the following, we present the identifiers of the sentences that were filtered out during this process. E.g. 54.1 means the first sentence of document 54, having item_id equal to flores101-main-541 in the dataset. Note that the sentences were outliers only for 2/6 languages and were all different, indicating no systematic issues in the sample: ARA: 54.1, 100.3, VIE: 3.1, 3.2, 24.3, 28.4, 33.1, 33.2, 40.3, 41.2, 50.3, 100.1, 102.1, 106.1, 107.2, 107.4. The 17 sentences were removed for all modalities and languages in the analysis of Section 8.4 to preserve the validity of our comparison, representing a loss of roughly 4% of the total available data, a tolerable amount for our analysis.

Fields Description Table C.3 presents the set of fields that were collected for every entry of the DivEMT dataset. The fields related to keystrokes, times, pauses, annotations and visit order were extracted from the event log of PET .per files, while edits information and other MT quality metrics were computed in a second moment with the help of widely-used libraries.

Field name	Description
`unit_id, flores_id, subject_id, task_type`	Identifiers for the item, respective FLORES-101 sentence, translator and translation mode.
`src_text`	The original source sentence extracted from Wikinews, wikibooks or wikivoyage.
`mt_text`	MT output sentence before post-editing, present only if `task_type` is 'pe'.
`tgt_text`	Final sentence produced by the translator (either from scratch or post-editing `mt_text`)
`aligned_edit`	Aligned visual representation of the machine translation and its post-edit with edit operations
`edit_time`	Total editing time for the translation in seconds.
`k_letter, k_digit, k_white, k_symbol, k_nav`	Number of keystrokes for various key types (letters, digits, keystrokes, whitespaces, punctuation, navigation keys) during the translation.
`k_erease, k_copy, k_paste, k_cut, k_do`	Number of keystrokes for erease (backspace, cancel), copy, paste, cut and Enter actions during the translation.
`k_total`	Total number of all keystroke categories during the translation.
`n_pause_geq_N`, `len_pause_geq_N`	Number and length of pauses longer than 300ms and 1000ms during the translation.
`num_annotations`	Number of times the translator focused the target sentence texbox during the session.
`n_insert, n_delete, n_substitute, n_shift, tot_shifted_words, tot_edits, hter`	Granular editing metrics and overall HTER computed using the Tercom library.
`cer`	Character-level HTER score computed between the MT and post-edited outputs.
`bleu, chrf`	Sentence-level BLEU and ChrF scores between MT and post-edited fields computed using the SacreBLEU library with default parameters.
`time_per_char, key_per_char, words_per_hour, words_per_minute`	Edit time per source character, expressed in seconds. Proportion of keys per character needed to perform the translation. Amount of source words translated or post-edited per hour/minute
`subject_visit_order`	Id denoting the order in which the translator accessed documents in the interface.

Table C.3: Description of the main fields associated to every DivEMT data entry. An entry correspond to a translation in a specific modality (HT, PE\(_1\) or PE\(_2\)) for one of the six target languages

C.1.4 Other Measurements

Automatic Evaluation of NMT Systems The selection of systems used in this study was driven by a broader evaluation procedure covering more models, metrics and target languages. Table C.4 presents the overall results of our evaluation. We use HuggingFace’s transformers library (Wolf et al., 2020) for all neural models, using the default decoding settings without further fine-tuning. All metrics were computed using the default settings of SacreBLEU (Post, 2018) and comet (Rei et al., 2020).

	System	BLEU	chrF2	TER	chrF2++	COMET
Arabic	M2M100	19.2	50.9	69.2	47	0.417
	MarianNMT	22.7	54.2	64.7	50.4	0.483
	mBART-50	17	48.5	69.1	44.8	0.452
	GTrans	34.1	65.6	52.8	61.9	0.737
Dutch	M2M100	21.3	52.9	66.1	49.8	0.405
	MarianNMT	25	56.9	62.5	53.8	0.543
	mBART-50	22.6	53.9	63.7	50.9	0.532
	DeepL	28.7	59.5	59.5	56.6	0.67
	GTrans	29.1	60	58.5	57.1	0.667
Indonesian	M2M100	35.9	63.1	47.3	60.8	0.614
	MarianNMT	38.5	65.6	46.5	63.3	0.671
	mBART-50	35.9	63.3	47.7	61.1	0.706
	GTrans	51.5	73.6	34.5	71.9	0.894
Italian	M2M100	23.6	53.9	63.2	51	0.51
	MarianNMT	27.5	57.6	58.9	54.8	0.642
	mBART-50	24.4	54.7	61.2	51.8	0.648
	DeepL	33	61	54	58.5	0.795
	GTrans	32.8	61.4	53.6	58.8	0.781
Japanese	M2M100	24.5	32.2	123.3	26	0.389
	mBART	27.1	35.4	123	28.3	0.538
	DeepL	41.3	46.8	108	37	0.75
	GTrans	38.4	44.7	101.5	33.9	0.683
Polish	M2M100	16.1	46.5	74.2	43.1	0.486
	MarianNMT	19.3	49.9	70.5	46.6	0.648
	mBART-50	17.4	48.2	72.4	44.9	0.603
	DeepL	24	54.3	66.4	51.1	0.832
	GTrans	24.4	54.6	64.6	51.4	0.804
Russian	M2M100	22.5	51.1	65.6	48.1	0.427
	MarianNMT	25.4	53.5	64.3	50.7	0.537
	mBART	24.8	52.6	63.7	49.7	0.541
	DeepL	35.9	61.8	53.3	59.3	0.79
	GTrans	33	60.5	55.2	57.7	0.731
Turkish	M2M100	20.3	53.9	65.2	50.1	0.686
	MarianNMT	26.3	59.8	58.8	55.8	0.881
	mBART-50	18.8	52.7	67.5	48.7	0.755
	GTrans	35	65.5	50.4	62.2	1
Ukrainian	M2M100	21.9	51.4	65.8	48.3	0.463
	MarianNMT	20	48.8	69.2	45.7	0.427
	mBART-50	21.9	50.7	67.9	47.7	0.587
	GTrans	31.1	59.8	55.9	56.8	0.758
Vietnamese	M2M100	33.3	52.3	52.4	52.1	0.43
	MarianNMT	26.7	45.7	60.2	45.6	0.117
	mBART-50	34.7	54	50.7	53.8	0.608
	GTrans	45.1	61.9	41.8	61.9	0.724

Table C.4: Automatic MT quality of all evaluated NMT systems on all tested languages in the English-to-XX setting, using the FLORES-101 full devtest for evaluation. Besides mBART-50 and Google Translate (GTrans), we also evaluate a set of bilingual Transformer-based NMT models trained with MarianNMT (Tiedemann and Thottingal, 2020), the DeepL industrial MT system and the multilingual M2M-100 418M model (Fan et al., 2021). Best overall and open-source only performances are highlighted.

Inter-subject Variability in Translation Times Although the variability across different subjects working on the same language directions is not the main concern of our investigation, we produce Figure C.2 (an expanded version of Figure 8.2) to visualize the inter-subject variability for translation times. We observe that the variability across different translators is more pronounced when translating from scratch and that the overall trend of speed improvements associated with PE is mostly preserved (with few exceptions related to the PE\(_2\) modality).

Figure C.2: Time per processed source word across languages, subjects and translation modalities, measured in seconds. Each point represents a document containing 3–5 sentences translated by a subject in one of the languages, with higher scores representing slower editing.

English
Inland waterways can be a good theme to base a holiday around.
Arabic
HT	يمكن أن تكون الممرات المائية الداخلية خياراً جيداً لتخطيط عطلة حولها.
PE1	mt: يمكن أن تكون الممرات المائية الداخلية موضوعًا جيدًا لإقامة عطلة حولها
	pe: يمكن** أن تكون الممرات المائية الداخلية مظهرًا جيدًا لإقامة عطلة حولها
PE2	mt: يمكن أن تكون السكك الحديدية الداخلية موضوعًا جيدًا لإقامة عطلة حول
	pe: قدتكونالممراتالمائيةالداخليةمكانًاجيدًالقضاءعطلةحولها
Dutch
HT	Binnenlandse waterwegen kunnen een goed thema zijn voor een vakantie.
PE1	MT: De binnenwateren kunnen een goed thema zijn om een vakantie omheen te baseren.
	PE: Binnenwateren kunnen een goede vakantiebestemming zijn.
PE2	MT: Binnenwaterwegen kunnen een goed thema zijn om een vakantie rond te zetten.
	PE: Binnenwaterwegen kunnen een goed thema zijn om een vakantie rond te organiseren.
Italian
HT	I corsi d'acqua dell'entroterra possono essere un ottimo punto di partenza da cui organizzare una vacanza.
PE1	MT: Trasporto fluviale può essere un buon tema per basare una vacanza in giro.
	PE: I canali di navigazione interna possono essere un ottimo motivo per cui intraprendere una vacanza.
PE2	MT: I corsi d’acqua interni possono essere un buon tema per fondare una vacanza.
	PE: I corsi d’acqua interni possono essere un buon tema su cui basare una vacanza.
Turkish
HT	İç bölgelerdeki su yolları, tatil planı için iyi bir tema olabilir.
PE1	MT: İç su yolları, bir tatili temel almak için iyi bir tema olabilir.
	PE: İç su yolları, bir tatil planı yapmak için iyi bir tema olabilir.
PE2	MT: İç suyolları, tatil için uygun bir tema olabilir.
	PE: İç sular tatil için uygun bir tema olabilir.
Ukrainian
HT	Можна спланувати вихідні, взявши за основу подорож внутрішніми водними шляхами.
PE1	MT: Внутрішні водні шляхи можуть стати гарною темою для відпочинку навколо.
	PE: Внутрішні водні шляхи можуть стати гарною темою для проведення вихідних.
PE2	MT: Водні шляхи можуть бути хорошим об ’єктом для базування відпочинку навколо.
	PE: Місцевість навколо внутрішніх водних шляхів може бути гарним вибором для організації відпочинку.
Vietnamese
HT	Du lịch trên sông có thể là một lựa chọn phù hợp cho kỳ nghỉ.
PE1	MT: Đường thủy nội địa có thể là một chủ đề hay để tạo cơ sở cho một kỳ nghỉ xung quanh.
	PE: Đường thủy nội địa có thể là một ý tưởng hay để lập kế hoạch cho kỳ nghỉ.
PE2	MT: Các tuyến nước nội địa có thể là một chủ đề tốt để xây dựng một kì nghỉ.
	PE: Du lịch bằng đường thủy nội địa là một ý tưởng nghỉ dưỡng không tồi.

Table C.5: An example sentence (81.1) from the DivEMT corpus, with the English source and all output modalities for all target languages, including intermediate machine translations (MT) and subsequent post-editings (PE). Colors denote insertions, deletions, substitutions and shifts computed with Tercom (Snover et al., 2006).

English
The Internet combines elements of both mass and interpersonal communication.
Arabic
HT	يجمع الإنترنت بين عناصر وسائل الاتصال العامة والشخصية على حدٍ سواء
PE1	mt: تجمع الإنترنت بين عناصر الاتصال الجماهيري والشخصي
	pe: يجمع الإنترنت بين عناصر الاتصال الجماهيري والشخصي
PE2	mt: إنترنت تجمع عناصر التواصل الجماعي والتواصل الشخصي
	pe: تجمع شبكة الإنترنت عناصر التواصل الجماعي والتواصل الشخصي
Dutch
HT	Het internet combineert elementen van zowel massa- en intermenselijke communicatie.
PE1	MT: Het internet combineert elementen van zowel massa- als interpersoonlijke communicatie.
	PE: Het internet combineert elementen van zowel massa- als interpersoonlijke communicatie.
PE2	MT: Het internet combineert elementen van massa- en interpersoonlijke communicatie.
	PE: Het internet combineert elementen van massa- en interpersoonlijke communicatie.
Italian
HT	Internet combina elementi di comunicazione sia di massa sia interpersonale.
PE1	MT: Internet combina elementi di comunicazione di massa e interpersonali.
	PE: Internet combina elementi di comunicazione di massa e interpersonale.
PE2	MT: Internet combina elementi di comunicazione di massa e interpersonale.
	PE: Internet combina elementi di comunicazione di massa e interpersonale.
Turkish
HT	İnternet hem kitlesel hem de bireysel iletişim öğelerini birleştiriyor.
PE1	MT: İnternet, hem kitle hem de kişiler arası iletişimin unsurlarını birleştirir.
	PE: İnternet, hem kitleler hem de kişiler arası iletişimin unsurlarını birleştirir.
PE2	MT: İnternet hem kitlesel hem de kişisel iletişim unsurlarını birleştiriyor.
	PE: İnternet hem kitlesel hem de kişisel iletişim unsurlarını birleştiriyor.
Ukrainian
HT	В інтернеті поєднуються елементи групового спілкування та особистого спілкування.
PE1	MT: Інтернет поєднує в собі елементи як масового, так і міжособистісного спілкування.
	PE: Інтернет поєднує в собі елементи як масового, так і міжособистісного спілкування.
PE2	MT: Інтернет об ’єднує як масову, так і міжлюдську комунікацію.
	PE: Інтернет поєднує в собі елементи як групової, так і особистої комунікації.
Vietnamese
HT	Internet là nơi tổng hợp các yếu tố của cả phương tiện truyền thông đại chúng và giao tiếp liên cá nhân.
PE1	MT: Internet kết hợp các yếu tố của cả giao tiếp đại chúng và giao tiếp giữa các cá nhân.
	PE: Internet kết hợp các yếu tố của cả truyền thông đại chúng và giao tiếp giữa các cá nhân.
PE2	MT: Internet kết hợp những yếu tố của sự giao tiếp quần chúng và giao tiếp giữa người với người.
	PE: Internet kết hợp những yếu tố của cả việc giao tiếp đại chúng và giao tiếp cá nhân.

Table C.6: An example sentence (29.2) from the DivEMT corpus, with the English source and all output modalities for all target languages, including intermediate machine translations (MT) and subsequent post-editings (PE). Colors denote insertions, deletions, substitutions and shifts computed with Tercom (Snover et al., 2006).

Subject	Coefficient
`ara_t1`	0.281
`ara_t2`	-0.384
`ara_t3`	-0.103
`nld_t1`	0.001
`nld_t2`	-0.459
`nld_t3`	0.458
`ita_t1`	0.086
`ita_t4`	0.350
`ita_t5`	-0.436
`tur_t1`	-0.381
`tur_t2`	0.272
`tur_t3`	0.109
`ukr_t1`	0.077
`ukr_t2`	0.314
`ukr_t3`	-0.391
`vie_t1`	0.012
`vie_t2`	0.176
`vie_t3`	-0.188

Table C.7: Coefficients of the random intercept related to the subject_id variable, representing the identity of the translator performing the translation.

C.1.5 Data Filtering and Feature Significance

We log-transform the dependent variable, edit time in seconds, given its long right tail. The models are built by adding one element at a time, and checking whether such addition leads to a significantly better model with AIC (i.e. if the score gets reduced by at least 2). Our random effects structure includes random intercepts for different segments (nested with documents) and translators, as well as a random slope for modality over individual segments. We start with an initial model that just includes the two random intercepts (by-translator and by-segment) and proceed by (i) finding significance for nested document/segment random effect; (ii) adding fixed predictors one by one; (iii) adding interactions between fixed predictors; and (iv) adding the random slopes.¹ From this sequential procedure, we obtain the resulting model. When checking the homoscedasticity and normality of residuals assumptions (Figure C.3 and Figure C.4), we find the latter is not fulfilled. Consequently, we remove data points for which observations deviate by more than 2.5 standard deviations from the predicted value by the model (2.4% of the data) and refit the best model on this subset, in order to find out whether any of the effects were due to these outliers. The resulting trends do not change significantly in this final model, in which residuals are normally distributed. As a final sanity check, in Table C.7 we measure the effect of subject identity on edit times and find no systematic patterns across languages.

Figure C.3: Residuals of the final LMER model, used to verify the heteroscedasticity assumption.

Figure C.4: Quantile-quantile plot before and after the removal of outliers when fitting the LMER model, used to verify the normality assumption.

C.2 Word-level Quality Estimation for Machine Translation Post-editing

C.2.1 Filtering Details for QE4PE Data

Documents should contain between 4 and 10 segments, each containing 10-100 words (959 docs). This ensures that all documents are roughly uniform in terms of size and complexity to maintain a steady editing flow Section 9.2.5.
The average segment-level QE score predicted by XCOMET-XXL is between 0.3 and 0.95, with no segment below 0.3 (429 docs). This forces segments to have a decent but still imperfect quality, excluding fully wrong translations.
At least 3 and at most 20 errors spans per document, with no more than 30% of words in the document being highlighted (351 docs). This avoids overwhelming the editor with excessive highlighting, while still ensuring error presence.

The same heuristics were applied to both translation directions, selecting only documents matching our criteria in both cases.

C.2.2 Additional Details and Statistics

Identifier	Job	Eng. Lvl	Trans. YoE	Post-edit YoE	Post-edit %	Adv. CAT YoE	MT good/bad for:	Post-edit comment
`ita-nohigh-fast`	FL (FT)	C1	2-5	2-5	100%	Often	G: Productivity, quality, repetitive work.	PE better than from scratch when consistency is needed.
`ita-nohigh-avg`	FL (PT)	C1	>10	<2	20%	Often	G: Productivity, repetitive work. B: less creative.	PE produces unnatural sentences.
`ita-nohigh-slow`	FL (PT)	C2	>10	2-5	40%	Sometimes	G: creativity.	Good for time saving.
`ita-oracle-fast`	FL (FT)	C2	5-10	2-5	60%	Sometimes	G: Productivity, repetitive work. B: less creative.	Good for productivity, humans always needed.
`ita-oracle-avg`	FL (FT)	C2	5-10	5-10	20%	Always	G: productivity, terminology.	Good for tech docs, not for articulated texts.
`ita-oracle-slow`	FL (FT)	C2	2-5	5-10	80%	Always	G: Productivity, repetitive work.	Useful for consistency and productivity, unless creativity is needed.
`ita-unsup-fast`	FL (FT)	C1	<2	<2	60%	Often	G: Productivity, terminology. B: less creative.	Humans will always be needed in translation.
`ita-unsup-avg`	FL (FT)	C1	>10	2-5	60%	Often	G: Productivity, repetitive work. B: less creative.	An opportunity for translators.
`ita-unsup-slow`	FL (FT)	C1	5-10	5-10	80%	Always	G: Productivity, repetitive work. B: less creative.	Good for focusing on detailed/cultural/creative aspects of translations.
`ita-sup-fast`	FL (PT)	C1	>10	2-5	40%	Often	G: Productivity, quality, repetitive work, terminology.	Improves quality and consistency.
`ita-sup-avg`	FL (FT)	C1	>10	5-10	100%	Always	G: Productivity, repetitive work. B: less creative.	Consistency improved, but less variance means less creativity.
`ita-sup-slow`	FL (FT)	C1	>10	2-5	20%	Always	G: Productivity, creativity, quality, repetitive work.	Good for productivity, but does not work on creative texts.
`nld-nohigh-fast`	FL (FT)	C1	>10	>10	40%	Often	G: Productivity, terminology. B: creativity.	Widespread but still too literal
`nld-nohigh-avg`	FL (FT)	C2	>10	2-5	40%	Always	G: Repetitive work. B: creativity, often wrong, worse quality.	Increase in productivity to save on costs brings down quality.
`nld-nohigh-slow`	FL (FT)	C2	>10	5-10	100%	Often	G: Creativity, quality, repetitive work, terminology.	Working with MT can be creative beyond PE.
`nld-oracle-fast`	FL (FT)	C1	5-10	5-10	80%	Always	G: Productivity, quality, repetitive work, terminology.	Good for tech docs and repetition.
`nld-oracle-avg`	FL (FT)	C2	>10	2-5	40%	Always	B: less creative, less productive, often wrong	Bad MT is worse than no MT for specialized domains.
`nld-oracle-slow`	FL (FT)	C2	>10	2-5	60%	Often	G: Productivity, repetitive work. B: cultural references.	More productivity at the cost of idioms and cultural factors.
`nld-unsup-fast`	FL (FT)	C2	5-10	2-5	40%	Often	G: all. B: often wrong, worse quality.	PE makes you less in touch with the texts and often poorly paid.
`nld-unsup-avg`	FL (FT)	C2	5-10	2-5	60%	Sometimes	G: Productivity, quality, repetitive work, terminology. B: wrong.	Practical but less effective for longer passages.
`nld-unsup-slow`	FL (FT)	C2	>10	2-5	40%	Always	G: repetitive work, productivity, terminology	Improves consistency and productivity if applied well.
`nld-sup-fast`	FL (FT)	C2	>10	5-10	60%	Often	G: repetitive work, creativity, terminology	Useful, but worries about job loss
`nld-sup-avg`	FL (FT)	C2	>10	10	60%	Sometimes	G: terminology, creativity	Useful for inspiration on better translations
`nld-sup-slow`	FL (FT)	C1	5-10	5-10	80%	Always	G: repetitive work, productivity	Better productivity at the cost of creativity.

Table C.8: Sample of pre-task questionnaire results. YoE = years of experience. FL = Freelance, PT = Part-time, FT = Full-time. PE = Post-editing. G = Good, B = Bad.

					Highlights statements
Identifier	MT good / fluent / accurate	High. accurate / useful	Interface clear	Task difficult	\(\uparrow\) Speed?	\(\uparrow\) Quality?	\(\uparrow\) Effort?	\(\uparrow\) Influence?	\(\uparrow\) Spot errors?	\(\uparrow\) Enjoy?
`ita-nohigh-fast`	4 / 0.8 / 0.8	- / -	5	1	-	-	-	-	-	-
`ita-nohigh-avg`	3 / 0.6 / 0.4	- / -	2	4	-	-	-	-	-	-
`ita-nohigh-slow`	3 / 0.8 / 0.8	- / -	1	5	-	-	-	-	-	-
`ita-oracle-fast`	5 / 0.4 / 0.8	4 / 4	4	5	5	2	1	1	1	4
`ita-oracle-avg`	3 / 0.4 / 0.6	2 / 1	2	3	1	1	4	1	1	1
`ita-oracle-slow`	3 / 0.6 / 0.6	2 / 2	2	5	1	1	1	1	4	1
`ita-unsup-fast`	3 / 0.8 / 0.6	3 / 2	4	5	3	3	3	2	2	2
`ita-unsup-avg`	3 / 0.6 / 0.6	3 / 3	3	5	2	3	2	1	1	3
`ita-unsup-slow`	3 / 0.4 / 0.6	2 / 2	3	4	2	2	3	3	4	4
`ita-sup-fast`	3 / 0.4 / 0.4	2 / 1	2	2	1	1	3	1	2	2
`ita-sup-avg`	3 / 0.4 / 0.4	2 / 2	3	5	3	2	4	3	3	4
`ita-sup-slow`	3 / 0.6 / 0.6	2 / 2	1	2	2	1	1	4	4	1
`nld-nohigh-fast`	3 / 0.2 / 0.4	- / -	4	4	-	-	-	-	-	-
`nld-nohigh-avg`	2 / 0.4 / 0.6	- / -	4	5	-	-	-	-	-	-
`nld-nohigh-slow`	2 / 0.2 / 0.4	- / -	3	5	-	-	-	-	-	-
`nld-oracle-fast`	3 / 0.6 / 0.6	2 / 1	3	2	2	2	2	1	1	1
`nld-oracle-avg`	3 / 0.8 / 0.6	4 / 3	3	4	3	3	3	3	2	3
`nld-oracle-slow`	3 / 0.6 / 0.4	3 / 1	3	4	1	1	1	1	1	3
`nld-unsup-fast`	3 / 0.6 / 0.8	3 / 2	4	4	1	3	1	1	2	1
`nld-unsup-avg`	3 / 0.6 / 0.6	4 / 3	2	4	3	3	4	3	2	3
`nld-unsup-slow`	1 / 0.4 / 0.4	2 / 4	1	4	4	4	3	2	2	3
`nld-sup-fast`	3 / 0.6 / 0.4	2 / 2	3	5	1	1	5	3	1	1
`nld-sup-avg`	3 / 0.4 / 0.6	2 / 2	2	4	1	1	1	1	2	3
`nld-sup-slow`	5 / 0.8 / 1	4 / 3	2	5	3	3	2	2	2	4

Table C.9: Sample of post-task questionnaire results. Statements use a 1–Strongly disagree to 5–Strongly agree scale.

Target: Seg. Edit Time, 5s bins from 0 to 600s
Feature	Coeff.	Sig.
(Intercept)	1.67	***
MT Num. Chars	2.42	***
Highlight Ratio %	1.59	***
Target Lang.: ITA	-0.34	***
Text Domain: Social	0.31	***
Oracle Highlight	-0.79	.
Sup. Highlight	0.02
Unsup. Highlight	-0.07
MT XCOMET QE Score	0.01	***
ITA:Oracle	0.91	***
ITA:Sup.	1.18	***
ITA:Unsup.	0.48	***
Social:Oracle	-0.19	**
Social:Sup.	-0.34	***
Social:Unsup.	-0.22	***
Highlight Ratio:Oracle	-0.83	*
Highlight Ratio:Sup.	-1.33	***
Edit Order	Random Factors
Translator ID
Segment ID

Table C.10: Details for the negative binomial mixed-effect model used for the productivity analysis of Section 9.3.1.

Target: % of edited characters in a segment (0-100).
Feature	Coeff.	Sig.
(Intercept)	21.0	***
MT Num. Chars	10.3	***
Highlight Ratio %	7.1	***
Target Lang.: ITA	-9.9	***
Text Domain: Social	10.9	***
Oracle Highlight	-5.2
Sup. Highlight	-4.7
Unsup. Highlight	-0.9
ITA:Oracle	12.2	***
ITA:Sup.	15.9	***
ITA:Unsup.	13.4	***
Social:Oracle	3.5	***
Social:Sup.	-0.4
Social:Unsup.	2.1	**
Highlight Ratio:Oracle	-0.18
Highlight Ratio:Sup.	-1.78	***
Edit Order	Random Factors
Translator ID
Segment ID
MT Num. Chars	Zero-Inflation Factors
Target Lang
Text Domain
Translator ID

Table C.11: Details for the zero-inflated negative binomial mixed-effect model used for the editing analysis of Section 9.3.2. The model achieves an RMSE of 0.11 and an \(R^2\) of 0.98.

Modalities		en→it			en→nl			Both
		Bio	Social	Both	Bio	Social	Both	Bio	Social	Both
Oracle and	Sup.	0.17	0.32	0.25	0.38	0.29	0.34	0.26	0.29	0.29
Oracle and	Unsup.	0.14	0.30	0.20	0.31	0.27	0.28	0.22	0.29	0.24
Sup. and	Oracle	0.19	0.31	0.26	0.30	0.26	0.29	0.24	0.29	0.28
Sup. and	Unsup.	0.19	0.33	0.25	0.28	0.24	0.25	0.24	0.29	0.25
Unsup. and	Oracle	0.22	0.32	0.27	0.35	0.30	0.33	0.28	0.31	0.30
Unsup. and	Sup.	0.22	0.37	0.30	0.39	0.27	0.33	0.30	0.31	0.32

Table C.12: Average highlight agreement proportion between different modalities across language pairs and domains (Section 9.3.2). Scores are normalized to account for the relative frequency of highlight modalities compared to the mean highlight frequency for the current language and domain combination.

Domain	Speed	\(P(H)\)	\(P(E)\)	\(P(E\|H)\)	\(P(E\|\neg H)\)	\(\Lambda_H(E)\)	\(P(H\|E)\)	\(P(H\|\neg E)\)	\(\Lambda_E(H)\)
en→it
Biomed.	Fast	.09	.04 / .01	.12 / .02	.03 / .01	4.0 / 2.0	.30 / .27	.08 / .11	3.7 / 2.4
	Avg.		.10 / .05	.27 / .12	.09 / .04	3.0 / 3.0	.22 / .30	.07 / .11	3.1 / 2.7
	Slow		.09 / .02	.21 / .04	.08 / .01	2.6 / 4.0	.19 / .26	.07 / .11	2.7 / 2.3
Social	Fast	.14	.11 / .07	.30 / .20	.07 / .04	4.2 / 5.0	.40 / .52	.11 / .16	3.6 / 3.2
	Avg.		.23 / .14	.48 / .32	.18 / .10	2.6 / 3.2	.30 / .42	.09 / .15	3.3 / 2.8
	Slow		.17 / .05	.39 / .14	.14 / .03	2.7 / 4.6	.31 / .54	.11 / .17	2.8 / 3.1
en→nl
Biomed.	Fast	.14	.03 / .02	.11 / .05	.02 / .01	5.5 / 5.0	.48 / .61	.13 / .18	3.6 / 3.3
	Avg.		.11 / .19	.20 / .30	.10 / .17	2.0 / 1.7	.25 / .29	.13 / .16	1.9 / 1.8
	Slow		.12 / .10	.26 / .23	.10 / .07	2.6 / 3.2	.29 / .42	.12 / .16	2.4 / 2.6
Social	Fast	.12	.06 / .07	.19 / .21	.04 / .04	4.7 / 5.2	.37 / .47	.10 / .13	3.7 / 3.6
	Avg.		.17 / .32	.32 / .48	.15 / .29	2.1 / 1.6	.22 / .23	.10 / .12	2.2 / 1.9
	Slow		.18 / .18	.38 / .40	.15 / .14	2.5 / 2.8	.25 / .34	.09 / .11	2.7 / 3.0

Table C.13: Highlighting (\(H\)) and editing (\(E\)) statistics for each domain, and translation direction across translator speeds (\(n = 4\) post-editors per combination, regardless of highlight modality). Values after slashes are adjusted by projecting highlights of the specified modality over edits from No Highlight translators to estimate highlight-induced editing biases (Section 9.3.2).

Domain	Modality	\(P(H)\)	\(P(E)\)	\(P(E\|H)\)	\(P(E\|\neg H)\)	\(\Lambda_H(E)\)	\(P(H\|E)\)	\(P(H\|\neg E)\)	\(\Lambda_E(H)\)
en→it
Biomed.	Random	.12	-	- / .02	- / .02	- / 1.0	- / .11	- / .13	- / 0.8
	No High.	-	.02	-	-	-	-	-	-
	Oracle	.08	.07	.26 / .08	.05 / .02	5.2 / 4.0	.30 / .26	.06 / .08	5.0 / 3.2
	Unsup.	.16	.10	.18 / .06	.08 / .02	2.2 / 3.0	.29 / .36	.14 / .15	2.0 / 2.4
	Sup.	.11	.12	.18 / .05	.11 / .02	1.6 / 2.5	.16 / .23	.10 / .10	1.6 / 2.3
Social	Random	.20	-	- / .09	- / .09	- / 1.0	- / .21	- / .20	- / 1.0
	No High.	-	.09	-	-	-	-	-	-
	Oracle	.25	.20	.42 / .23	.13 / .04	3.2 / 5.7	.52 / .66	.18 / .21	2.8 / 3.1
	Unsup.	.17	.18	.35 / .19	.14 / .07	2.5 / 2.7	.33 / .37	.14 / .15	2.3 / 2.4
	Sup.	.15	.21	.38 / .23	.18 / .06	2.1 / 3.8	.27 / .39	.11 / .12	2.4 / 3.2
en→nl
Biomed.	Random	.17	-	- / .12	- / .10	- / 1.2	- / .19	- / .17	- / 1.1
	No High.	-	.10	-	-	-	-	-	-
	Oracle	.21	.08	.21 / .20	.05 / .08	4.2 / 2.5	.52 / .41	.18 / .18	2.8 / 2.2
	Unsup.	.23	.09	.17 / .17	.07 / .08	2.4 / 2.1	.43 / .38	.21 / .21	2.0 / 1.8
	Sup.	.12	.08	.20 / .21	.06 / .09	3.3 / 2.3	.30 / .25	.11 / .11	2.7 / 2.2
Social	Random	.16	-	- / .22	- / .19	- / 1.1	- / .19	- / .16	- / 1.1
	No High.	-	.19	-	-	-	-	-	-
	Oracle	.19	.12	.33 / .39	.07 / .15	4.7 / 2.6	.54 / .39	.15 / .15	3.6 / 2.6
	Unsup.	.15	.13	.25 / .33	.11 / .17	2.2 / 1.9	.30 / .26	.13 / .12	2.3 / 2.1
	Sup.	.12	.10	.30 / .36	.08 / .17	3.7 / 2.1	.36 / .23	.10 / .10	3.6 / 2.3

Table C.14: Highlighting (\(H\)) and editing (\(E\)) statistics for each domain, modality and translation direction combination (\(n = 3\) post-editors per combination). Values after slashes are adjusted by projecting highlights of the specified modality over edits from No Highlight translators to estimate highlight-induced editing biases (Section 9.3.2). A Random baseline is added by projecting random highlights matching the average frequency over all modalities for specific domain and translation direction settings.

ID	Source text	Target text	Proposed correction	Error Cat.	Severity	Score
9-1	Specifying peri- and postnatal factors in children born very preterm (VPT) that affect later outcome helps to improve long-term treatment.	Specificare i fattori peri- e postnatali nei bambini nati molto pretermine (VPT) che influenzano il risultato successivo aiuta a migliorare il trattamento a lungo termine.	Specificare i fattori peri- e postnatali nei bambini nati molto pretermine (VPT, Very Preterm) che influenzano il risultato successivo aiuta a migliorare il trattamento a lungo termine.	Readability	Minor	90
9-2	To enhance the predictability of 5-year cognitive outcome by perinatal, 2-year developmental and socio-economic data.	Migliorare la prevedibilità del risultato cognitivo a 5 anni mediante dati perinatali, di sviluppo e socioeconomici a 2 anni.				100
9-3	5-year infants born VPT were compared to 34 term controls.	I neonati di 5 anni nati VPT sono stati confrontati con 34 nati a termine come controllo.	I neonati di 5 anni nati VPT sono stati confrontati con 34 controlli a termine.	Mistranslation	Minor	70
9-4	The IQ of 5-year infants born VPT was 10 points lower than that of term controls and influenced independently by preterm birth and SES.	Il QI dei bambini di 5 anni nati VPT era di 10 punti inferiore a quello dei nati a termine di controllo, e influenzato indipendentemente dalla nascita pretermine e dai dati SES.	Il QI dei bambini di 5 anni nati VPT era di 10 punti inferiore a quello dei nati a termine e influenzato indipendentemente dalla nascita pretermine e dallo stato socioeconomico (SES).	Mistranslation	Minor	70
9-4		Il QI dei bambini di 5 anni nati VPT era di 10 punti inferiore a quello dei nati a termine di controllo, e influenzato indipendentemente dalla nascita pretermine e dai dati SES.	Il QI dei bambini di 5 anni nati VPT era di 10 punti inferiore a quello dei nati a termine e influenzato indipendentemente dalla nascita pretermine e dallo stato socioeconomico (SES).	Untranslated	Minor
52-1	But with less than 3 months to go for that, I feel I'm not ready yet, but having never taken it, I have nothing to compare it to besides colleagues' advice.	Ma con meno di 3 mesi per farlo, sento di non essere ancora pronto, ma non l’ho mai preso, non ho nulla con cui confrontarlo oltre ai consigli dei colleghi.	Ma con meno di 3 mesi per farlo, sento di non essere ancora pronto, e non avendolo mai fatto, non ho nulla con cui confrontarlo oltre ai consigli dei colleghi.		Mistranslation	Major	30
52-2	Without knowing what I know, they can't know if I'm actually ready yet, but many of them are pushing me to sign up for it.	Senza sapere quello che so, non possono sapere se sono ancora pronta, ma molti di loro mi stanno spingendo a iscrivermi.	Se non hanno idea di quanto sappia, non possono sapere se sono davvero pronta, ma molti di loro mi stanno spingendo a iscrivermi.	Readability	Minor	60
52-2		Senza sapere quello che so, non possono sapere se sono ancora pronta, ma molti di loro mi stanno spingendo a iscrivermi.	Se non hanno idea di quanto sappia, non possono sapere se sono davvero pronta, ma molti di loro mi stanno spingendo a iscrivermi.	Mistranslation	Minor
52-3	I'm close... I just don't know if I'm 2 months study close.	Ci sono quasi… solo che non so se ce la farò in soli 2 mesi, ma penso di potercela fare.	Ci sono quasi... solo che non so se ce la farò in soli 2 mesi.	Addition	Major	20

Table C.15: QA interface with cropped examples of biomedical and social media texts with error annotations (Biomedical: post-edited segments with No Highlight; Social media: MT outputs).

Figure C.5: **Top:** Post-editing rate across highlight modalities, domains and directions. **Bottom:** Proportion of edits in highlighted spans across highlight modalities. *** \(=p<0.001\), ** \(=p<0.01\), * \(=p<0.05\), ns \(=\) not significant with Bonferroni correction.

Figure C.6: Post-editing agreement across various modalities Section 9.3.2. Results are averaged across all translator pairs for the two modalities (\(n = 3\) intra-modality, \(n=9\) inter-modality for every language) and all segments.

Figure C.7: ESA ratings for MT outputs and post-edits across domains and translation directions.

Figure C.8: Distribution of MQM error categories for MT and post-edits across highlight modalities for the two translation directions and domains of QE4PE.

Figure C.9: Editing proportion, measured by word error rate between MT and post-edited texts, with respect to post-editor progression. Values are medians across all post-editors.

Figure C.10: Segment-level post-editing time with respect to post-editor progression. Values are medians across all annotators. Light gray area is min-max values, dark gray represents 25%-75% quantiles. The annotators do not became considerably faster with the task progression, likely due to the simplicity of the task and the high post-editing proficiency of professional post-editors. The high variability in editing times motivates the careful group assignments performed using Pre task edit logs.

C.3 Unsupervised MT Error Detection and Human Disagreement

C.3.1 Full Results

Method	QE4PE\(_{\mathbf{t1}}\)		QE4PE\(_{\mathbf{t2}}\)		QE4PE\(_{\mathbf{t3}}\)		QE4PE\(_{\mathbf{t4}}\)		QE4PE\(_{\mathbf{t5}}\)		QE4PE\(_{\mathbf{t6}}\)		QE4PE\(_{\mathbf{avg}}\)
Method	AP	F1^*	AP	F1^*	AP	F1^*	AP	F1^*	AP	F1^*	AP	F1^*	AP	F1^*
Random Baseline	.08	.14	.15	.26	.06	.12	.11	.19	.22	.36	.18	.30	.13	.23
Surprisal	.11	.20	.21	.31	.11	.17	.16	.25	.30	.40	.25	.35	.19	.28
Out. Entropy	.12	.18	.22	.30	.10	.16	.17	.24	.30	.39	.26	.34	.19	.27
Surprisal MCD\(_{\text{avg}}\)	.12	.20	.22	.32	.11	.17	.16	.26	.30	.41	.26	.36	.19	.29
Surprisal MCD\(_{\text{var}}\)	.13	.21	.26	.33	.12	.20	.19	.27	.31	.40	.29	.36	.22	.30
LL Surprisal\(_{\text{best}}\)	.11	.19	.21	.32	.11	.16	.16	.25	.29	.40	.26	.35	.19	.28
LL KL-Div\(_{\text{best}}\)	.09	.16	.19	.28	.08	.14	.13	.21	.25	.37	.22	.31	.16	.25
LL Pred. Depth	.09	.16	.18	.28	.07	.13	.14	.21	.25	.37	.21	.31	.16	.24
Attn. Entropy\(_{\text{avg}}\)	.11	.16	.17	.27	.12	.17	.11	.19	.23	.36	.19	.31	.15	.24
Attn. Entropy\(_{\text{max}}\)	.09	.14	.15	.26	.10	.18	.09	.19	.20	.36	.16	.30	.13	.24
Blood\(_{\text{best}}\)	.08	.14	.16	.26	.06	.12	.11	.19	.23	.36	.18	.30	.14	.23
xcomet-xl	.11	.24	.22	.35	.10	.20	.16	.30	.27	.35	.23	.34	.18	.30
xcomet-xl\(_{\text{confw}}\)	.20	.25	.30	.36	.14	.21	.25	.31	.37	.40	.31	.36	.26	.32
xcomet-xxl	.13	.27	.22	.32	.10	.24	.17	.31	.28	.32	.23	.31	.19	.30
xcomet-xxl\(_{\text{confw}}\)	.19	.27	.31	.36	.17	.24	.26	.32	.37	.41	.33	.39	.27	.33
Human Editors\(_{\text{min}}\)	.17	.33	.26	.38	.10	.21	.16	.26	.25	.36	.23	.30	.19	.31
Human Editors\(_{\text{avg}}\)	.20	.38	.29	.43	.14	.30	.22	.39	.32	.38	.30	.40	.25	.39
Human Editors\(_{\text{max}}\)	.24	.43	.31	.47	.20	.41	.24	.43	.37	.50	.33	.50	.28	.46

Table C.16: WQE metrics’ performance for predicting error spans from the six edit sets over NLLB 3.3B translations in the En\(\rightarrow\)It QE4PE dataset. Best unsupervised and overall best metric results are highlighted.

Method	QE4PE\(_{\mathbf{t1}}\)		QE4PE\(_{\mathbf{t2}}\)		QE4PE\(_{\mathbf{t3}}\)		QE4PE\(_{\mathbf{t4}}\)		QE4PE\(_{\mathbf{t5}}\)		QE4PE\(_{\mathbf{t6}}\)		QE4PE\(_{\mathbf{avg}}\)
Method	AP	F1^*	AP	F1^*	AP	F1^*	AP	F1^*	AP	F1^*	AP	F1^*	AP	F1^*
Random Baseline	.07	.14	.34	.51	.22	.36	.19	.32	.13	.24	.22	.36	.20	.32
Surprisal	.12	.19	.41	.51	.30	.39	.29	.37	.21	.30	.31	.41	.27	.36
Out. Entropy	.11	.18	.41	.51	.31	.37	.29	.36	.20	.27	.31	.39	.27	.35
Surprisal MCD\(_{\text{avg}}\)	.12	.19	.42	.52	.31	.40	.30	.40	.21	.30	.31	.42	.28	.37
Surprisal MCD\(_{\text{var}}\)	.13	.21	.45	.53	.36	.41	.34	.40	.24	.32	.36	.42	.31	.38
LL Surprisal\(_{\text{best}}\)	.12	.19	.42	.53	.30	.40	.29	.38	.21	.30	.31	.41	.27	.37
LL KL-Div\(_{\text{best}}\)	.09	.15	.39	.52	.28	.37	.25	.34	.17	.26	.29	.38	.25	.34
LL Pred. Depth	.09	.16	.37	.52	.26	.37	.24	.33	.17	.25	.27	.38	.23	.33
Attn. Entropy\(_{\text{avg}}\)	.09	.15	.37	.51	.22	.36	.20	.32	.13	.24	.23	.37	.21	.32
Attn. Entropy\(_{\text{max}}\)	.09	.15	.35	.51	.22	.36	.18	.32	.12	.24	.21	.37	.19	.32
Blood\(_{\text{best}}\)	.07	.13	.35	.51	.22	.36	.19	.32	.14	.24	.23	.36	.20	.32
xcomet-xl	.13	.27	.39	.39	.31	.44	.28	.32	.20	.35	.31	.44	.27	.38
xcomet-xl\(_{\text{confw}}\)	.24	.31	.47	.53	.43	.45	.40	.43	.29	.36	.43	.46	.38	.42
xcomet-xxl	.13	.28	.39	.29	.30	.35	.26	.35	.19	.31	.30	.35	.26	.32
xcomet-xxl\(_{\text{confw}}\)	.24	.30	.48	.53	.43	.45	.40	.42	.31	.35	.43	.45	.38	.42
Human Editors\(_{\text{min}}\)	.16	.29	.43	.51	.34	.45	.33	.47	.26	.42	.36	.46	.32	.43
Human Editors\(_{\text{avg}}\)	.17	.33	.44	.51	.34	.45	.33	.47	.26	.42	.36	.46	.32	.43
Human Editors\(_{\text{max}}\)	.19	.36	.46	.51	.36	.51	.37	.53	.32	.51	.40	.53	.35	.49

Table C.17: WQE metrics’ performance for predicting error spans from the six edit sets over NLLB 3.3B translations in the En\(\rightarrow\)Nl QE4PE dataset. Best unsupervised and overall best metric results are highlighted.

Method	Italian		Dutch		Arabic		Turkish		Vietnamese		Ukrainian		Average
Method	AP	F1^*	AP	F1^*	AP	F1^*	AP	F1^*	AP	F1^*	AP	F1^*	AP	F1^*
Random Baseline	.25	.40	.28	.43	.33	.49	.34	.50	.35	.52	.48	.65	.34	.50
Surprisal	.34	.45	.36	.46	.42	.51	.43	.54	.46	.55	.55	.65	.43	.53
Out. Entropy	.37	.43	.39	.45	.45	.50	.49	.52	.48	.54	.58	.65	.46	.51
Surprisal MCD\(_{\text{avg}}\)	.34	.45	.37	.47	.43	.52	.44	.54	.46	.55	.56	.65	.43	.53
Surprisal MCD\(_{\text{var}}\)	.39	.46	.41	.47	.47	.53	.49	.55	.48	.55	.61	.67	.48	.54
LL Surprisal\(_{\text{best}}\)	.33	.44	.36	.45	.41	.51	.44	.54	.44	.55	.55	.66	.42	.53
LL KL-Div\(_{\text{best}}\)	.34	.42	.37	.45	.41	.51	.44	.52	.44	.52	.56	.65	.43	.51
LL Pred. Depth	.30	.42	.32	.44	.39	.50	.40	.52	.39	.53	.54	.66	.39	.51
Attn. Entropy\(_{\text{avg}}\)	.28	.41	.30	.43	.35	.49	.37	.51	.40	.52	.50	.65	.37	.50
Attn. Entropy\(_{\text{max}}\)	.25	.41	.26	.43	.34	.49	.34	.50	.35	.52	.47	.65	.34	.50
Blood\(_{\text{best}}\)	.26	.40	.28	.43	.35	.52	.35	.50	.36	.52	.49	.65	.35	.51
xcomet-xl	.34	.39	.37	.44	.41	.47	.44	.50	.42	.44	.56	.44	.42	.45
xcomet-xl\(_{\text{confw}}\)	.46	.47	.49	.50	.51	.53	.58	.56	.53	.55	.68	.67	.54	.55
xcomet-xxl	.34	.36	.35	.35	.43	.47	.45	.48	.43	.42	.57	.41	.43	.42
xcomet-xxl\(_{\text{confw}}\)	.48	.49	.50	.50	.55	.54	.58	.56	.56	.57	.70	.67	.56	.55

Table C.18: WQE metrics’ performance for predicting error spans from multiple edit sets (one per language) over mBART-50 translations across the six topologically diverse target languages of DivEMT.

Method	En\(\rightarrow\) Ja		En\(\rightarrow\) Zh		En\(\rightarrow\) Hi		Cs\(\rightarrow\) Uk		En\(\rightarrow\) Cs		En\(\rightarrow\) Ru		Average
Method	AP	F1^*	AP	F1^*	AP	F1^*	AP	F1^*	AP	F1^*	AP	F1^*	AP	F1^*
Random Baseline	.02	.03	.03	.07	.03	.07	.05	.09	.06	.11	.08	.16	.05	.09
Surprisal	.03	.07	.05	.09	.05	.09	.14	.20	.10	.16	.13	.19	.08	.13
Out. Entropy	.03	.08	.06	.11	.06	.10	.20	.27	.12	.18	.14	.20	.10	.16
LL Surprisal\(_{\text{best}}\)	.03	.07	.05	.09	.05	.09	.14	.20	.10	.16	.13	.19	.08	.13
LL KL-Div\(_{\text{best}}\)	.02	.05	.04	.07	.04	.08	.10	.17	.09	.15	.12	.19	.07	.12
LL Pred. Depth	.02	.05	.04	.08	.04	.09	.09	.18	.08	.14	.11	.18	.06	.12
Attn. Entropy\(_{\text{avg}}\)	.02	.03	.03	.07	.03	.07	.03	.09	.05	.11	.07	.16	.04	.09
Attn. Entropy\(_{\text{max}}\)	.01	.03	.03	.07	.03	.07	.03	.09	.05	.11	.08	.16	.04	.09
xcomet-xl	.04	.09	.05	.11	.06	.12	.13	.28	.11	.24	.16	.32	.09	.19
xcomet-xl\(_{\text{confw}}\)	.08	.14	.10	.16	.10	.19	.18	.30	.19	.29	.24	.32	.15	.23
xcomet-xxl	.04	.11	.06	.13	.05	.11	.13	.28	.11	.24	.16	.33	.09	.20
xcomet-xxl\(_{\text{confw}}\)	.07	.15	.09	.19	.09	.17	.19	.29	.22	.30	.28	.33	.16	.24

Table C.19: WQE metrics’ performance for predicting error spans from the ESA annotations (one set per language) over Aya23-35B outputs for the WMT24 dataset.

Figure C.11: Precision-recall curves for xcomet metrics and Surprisal MCD_var for all annotators of QE4PE En\(\rightarrow\)It.

Figure C.12: Precision-recall curves for xcomet metrics and Surprisal MCD_var for all annotators of QE4PE En\(\rightarrow\)Nl.

Figure C.13: Precision-recall curves for xcomet metrics and Surprisal MCD_var on all DivEMT languages.

Figure C.14: Precision-recall curves for xcomet metrics and Out. Entropy on all WMT24 languages.

Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Çelebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, and Armand Joulin. 2021. Beyond english-centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48.

Matt Post. 2018. A call for clarity in reporting BLEU scores. In Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, and Karin Verspoor, editors, Proceedings of the third conference on machine translation: Research papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.

Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. COMET: A neural framework for MT evaluation. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.

Matthew Snover, Bonnie Dorr, Rich Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th conference of the association for machine translation in the americas: Technical papers, pages 223–231, Cambridge, Massachusetts, USA. Association for Machine Translation in the Americas.

Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT – building open translation services for the world. In André Martins, Helena Moniz, Sara Fumega, Bruno Martins, Fernando Batista, Luisa Coheur, Carla Parra, Isabel Trancoso, Marco Turchi, Arianna Bisazza, Joss Moorkens, Ana Guerberof, Mary Nurminen, Lena Marg, and Mikel L. Forcada, editors, Proceedings of the 22nd annual conference of the european association for machine translation, pages 479–480, Lisboa, Portugal. European Association for Machine Translation.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, et al. 2020. Transformers: State-of-the-art natural language processing. In Qun Liu and David Schlangen, editors, Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations, pages 38–45, Online. Association for Computational Linguistics.

The document processing order was originally included to identify possible longitudinal effects but was removed due to a lack of significant improvements.↩︎