10 Unsupervised MT Error Detection and Human Disagreement

Chapter Summary

This final experimental chapter presents our comprehensive evaluation of unsupervised word-level quality estimation methods exploiting interpretability and uncertainty quantification methods to identify translation errors in model outputs. In our evaluation spanning 14 metrics across 12 translation directions, we also quantify the impact of human label variation on metric performance, using multiple edit sets from the DivEMT and QE4PE studies of the previous chapters. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.

This chapter is adapted from the paper Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement (Sarti et al., 2025b).

So, you see, translators do not so much deliver a message as they rewrite the original. And herein lies the difficulty—rewriting is still writing, and writing always reflects the authors ideology and biases.

– Rebecca F. Kuang, Babel (2022)

10.1 Introduction

Word-level error spans are widely used in machine translation evaluation to obtain robust and fine-grained estimates of translation quality (Lommel et al., 2013; Freitag et al., 2021a; Freitag et al., 2021b; Kocmi et al., 2024b). Due to the cost of manual annotation, word-level quality estimation (WQE) was proposed for assisting in annotating error spans over MT outputs (Zouhar et al., 2025). Modern WQE approaches generally rely on costly inference with large language models or ad-hoc training with large amounts of human-annotated texts (Fernandes et al., 2023; Kocmi and Federmann, 2023; Guerreiro et al., 2024), making them impractical for less resourced settings (Zouhar et al., 2024).

To improve the efficiency of MT quality assessment, several works explored the use of signals derived from the internals of neural MT systems (Fomicheva et al., 2020; Fomicheva et al., 2021; Leiter et al., 2024), for identifying problems in MT outputs, such as hallucinations (Guerreiro et al., 2023a; Guerreiro et al., 2023b; Dale et al., 2023a; Dale et al., 2023b; Himmi et al., 2024). However, previous works have focused on sentence-level metrics for overall translation quality and do not evaluate performance on multiple label sets due to high annotation costs (Fomicheva et al., 2022; Zerva et al., 2024).

Figure 10.1: Example of German\(\rightarrow\)English translation with two sets of human word-level error span annotations and two examples of continuous and binary WQE metrics.

In this chapter, we conduct a more comprehensive evaluation spanning 10 unsupervised metrics derived from models’ inner representations and predictive distributions to identify translation errors at the word level. We test three open-source multilingual MT models and LLMs of varying sizes across 12 translation directions, including typologically diverse languages and challenging textual domains. Importantly, we focus on texts with multiple human annotations to measure the impact of individual annotator preferences on metric performance, setting a “human-level” baseline for the WQE task.

We address the following research questions:

How accurate are unsupervised WQE metrics in detecting MT errors compared to trained metrics and human annotators?
Are popular supervised WQE metrics well-calibrated?
Are the relative performances of WQE metrics affected by the variability in human error annotations?

We conclude with recommendations for improving the evaluation and usage of future WQE systems.

10.2 Related Work

Actionable Insights from Interpretability Advances in interpretability research have elucidated multiple mechanisms underlying decision-making, knowledge representation, and biases in LMs (Ferrando et al., 2024). However, a better understanding of model’s inner workings often did not translate to tangible gains in model design and other practical applications, which remain rarely explored (Mosbach et al., 2024). Some examples in this direction include using targeted machine unlearning methods for safety-critical scenarios (Barez et al., 2025), or the use of attribution for trustworthy context citations in LM generations (Cohen-Wang et al., 2024; Sarti et al., 2024; Qi^* et al., 2024). In this study, unsupervised metrics extracted from an MT model during generation are employed to detect errors in models’ generated outputs, following the unsupervised QE paradigm introduced in Section 2.6. This can be seen as a variant of out-of-distribution detection in signal processing research (Hendrycks and Gimpel, 2017).

Uncertainty Estimation for Language Models The estimation of uncertainty in language models has garnered increasing attention (Baan et al., 2023), particularly in the context of generation tasks for which the set of plausible responses is large (Giulianelli et al., 2023). Predictive uncertainty is typically decomposed into its aleatoric and epistemic components, representing respectively the irreducible variability in the modeled phenomena, and the improvable confidence in model predictions (Kiureghian and Ditlevsen, 2009). Popular methods for uncertainty estimation involve the calibration of predictive probabilities to reflect aleatoric uncertainty (Jiang et al., 2020; Ulmer et al., 2022; Zhao et al., 2023; Chen et al., 2023), and conformal sets prediction (Zerva and Martins, 2024; Ravfogel et al., 2023). In this study, we utilize uncertainty signals from the predictive distribution of MT models and their internal processing to efficiently predict the resulting generation quality at a fine-grained, token-level scale.

Human Label Variation Human label variation is a type of uncertainty that arises from the inherent variability in human judgments (Plank et al., 2014; Plank, 2022), which can be hard to disentangle from actual annotation mistakes (Snow et al., 2008; Weber-Genzel et al., 2024). The use of multiple references was recently recommended to ensure a sound evaluation of generative LMs, reflecting human-plausible levels of variability (Giulianelli et al., 2023), contrary to standard practices that employ a single set of “gold” labels. In our analysis of QE4PE data, which contains multiple edits, we adopt a perspectivist approach¹ to ensure a robust assessment of WQE metrics by accounting for annotators’ disagreement (Uma et al., 2021).

10.3 Models and Datasets

We use datasets containing error annotations or post-edits on the outputs of open-source models to extract unsupervised WQE metrics using real model outputs, thereby avoiding potential confounders. We select the following datasets, summarized in Table 10.1:

DivEMT We reuse the DivEMT dataset, introduced in Chapter 8, including out-of-English machine translations towards six typologically diverse target languages (English\(\rightarrow\)Arabic,Italian,Dutch,Turkish,Ukrainian,Vietnamese) produced by Google Translate and mBART-50 1-to-many for a subset of Wiki texts from the FLORES dataset (Goyal et al., 2022), with edits made by professional translators. In this study, we evaluate unsupervised metrics on the mBART-50 1-to-many model, converting the human post-edits into token-level labels to perform a cross-lingual comparison over a fixed set of examples.

WMT24 The WMT24 dataset is taken from the General Machine Translation Shared Task at WMT 2024 (Kocmi et al., 2024a). It contains evaluation of several machine translation systems across English\(\rightarrow\){Czech, Hindi, Japanese, Chinese, Russian} (634 segments per language) and Czech\(\rightarrow\)Ukrainian (1954 segments). The human evaluation was conducted using the Error Span Annotation protocol (ESA, Kocmi et al. (2024b)), which involves human annotators highlighting erroneous spans in the translation and marking them as either minor or major errors. This dataset covers the news, social, and speech (with automatic speech recognition) domains. We adopt the official prompting setup from the WMT24 campaign, using the Aya23 model alongside the provided prompt and three in-context translation examples per language to ensure uniformity with previous results.² Aya23 is a large language model introduced by Aryabumi et al. (2024) to improve the multilingual capabilities of the original Aya model (Üstün et al., 2024) on a selected set of 23 languages. The model was included in the WMT24 evaluation by Kocmi et al. (2024a), resulting in the best translation performance among the tested open-source models. The model is a decoder-only transformer model with 40 layers, a model dimension of 8196 and 64 attention heads per layer. Using WMT24 allows us to extend our evaluation to a state-of-the-art LLM, given the popularity of such systems in MT (Kocmi et al., 2023).

QE4PE The QE4PE dataset, introduced in Chapter 9, was created to measure the effect of word-level error highlights when included in real-world human post-editing workflows. The QE4PE data provides granular behavioral metrics to evaluate the speed and quality of post-editing of 12 annotators for En\(\rightarrow\)It and En\(\rightarrow\)Nl across two challenging textual domains (social posts and biomedical abstracts) and four error span highlighting modalities, including the unsupervised Surprisal MCD_var method and the supervised xcomet-xxl we also test in this study. Provided that the presence of error span highlights was found to influence the editing choices of human editors, we limit our evaluation to the six human annotators per language that post-edited sentences without any highlights (3 for the Oracle Post-edit task to produce initial human-based highlights, and 3 for the No Highlight modality in the main task). This prevents us from biasing our evaluation of WQE metrics in favor of the metrics that influenced editing choices. As for DivEMT, we use the post-edits over translations—in this case, those of the NLLB 3.3B model (NLLB Team et al., 2024)—to produce token-level error spans, enabling an evaluation of WQE metrics across multiple annotation sets.

	DivEMT	WMT24	QE4PE
Languages	en→ar,it, nl,tr,uk,vi	en→ja,zh,hi,cs,ru cs→uk	en→it,nl
Errors type	Post-edit	Annotation	Post-edit
Label sets	1	1	6
Domains	Wiki	Multiple	Social, Biomed
MT Model	mBART-50	Aya23	NLLB
# Segments	2580	5124	3888

Table 10.1: Summary of tested datasets. Error spans are obtained from explicit error annotations or post-edited spans.

10.4 Evaluated Metrics

The following metrics were evaluated using the Inseq library introduced in Chapter 3.

Predictive Distribution Metrics We use the Surprisal of the predicted token \(t^{*}\), as negative log-probability \(-\log p(t^{*}_i|t_{<i})\), and the Entropy \(H\) of the output distribution \(P_N\) over vocabulary \(\mathcal{V}\), \(-\sum_{i=1}^{|\mathcal{V}|} p(t_i|t_{<i}) \log_2 p(t_i|t_{<i})\), as simple metrics to quantify pointwise and full prediction uncertainty (Fomicheva et al., 2020). For surprisal, we also compute its expectation (MCD\(_\text{avg}\)) and variance (MCD\(_\text{var}\)) with \(n=10\) steps of Monte Carlo Dropout (MCD, Gal and Ghahramani, 2016) to obtain a robust estimate and a measure of epistemic uncertainty in predictions, respectively. Intuitively, epistemic uncertainty reflects models’ lack of knowledge rather than data ambiguity.³ We employ the mean of the negative log probabilities as a robust estimate of surprisal:

\[\text{Surprisal MCD}_{\text{avg}} = \hat y_{\text{MCD}} = \frac{1}{T} \sum_{t=1}^{T} - \log p(x | \Theta_t)\]

Moreover, we estimate predictive uncertainty by calculating the variance of predictive probabilities under the same setup:

\[\text{Surprisal MCD}_{\text{var}} = \frac{1}{T} \sum_{t=1}^{T} \big(- \log p(x | \Theta_t) - \hat y_{\text{MCD}} \big)\]

Vocabulary Projections We use the Logit Lens method (LL, nostalgebraist, 2020), introduced in Section 2.1.3, to extract probability distributions \(P_0, \dots, P_{N-1}\) over \(V\) from intermediate activations at every layer \(l_0, \dots, l_{N-1}\) of the decoder. We use the surprisal for the final prediction at every layer (LL-Surprisal) to assess the presence of layers with high sensitivity to incorrect predictions. For the NLLB and mBART-50 models, we also apply a final layer normalization before the projection, following the model architecture. For the Aya model, we instead scale logits by \(0.0625\) (the default logit_scale defined in the model configuration). Following the residual stream view of the transformer model (Elhage et al., 2021), the resulting logits offer insight into the model’s predictive confidence at that specific depth of processing. Then, we compute the KL divergence between every layer distribution and the final distribution \(P_N\), e.g. \(\text{KL}(P_{N-1}\|P_N)\), to highlight trends in the shift in predictive probability produced by the application of remaining layers (LL KL-Div). Finally, we adapt the approach of Baldock et al. (2021) and use the number of the first layer for which the final prediction corresponds to the top logit as a metric of model confidence, \(l \;\text{s.t.}\;\arg \max P_l = t^{*}\) and \(\arg \max P_i \neq t^{*} \;\forall i<l\) (LL Pred. Depth).

Context mixing We employ simple estimates of context relevance using attention weights produced during the transformer attention operation. More specifically, for every attention head at every layer of the decoder module, we extract a score for every token in the preceding context. We then use the entropy of the distribution of attention weights⁴ over previous context as a simple measure of information locality during inference (Ferrando et al., 2022; Mohebbi et al., 2023). Following Fomicheva et al. (2020), we experiment with using the mean and the maximum entropy across all attention heads of all layers as separate metrics (Attn. Entropy_avg/max). Finally, we evaluate the Between Layer OOD method by Jelenić et al. (2024), employing gradients to estimate layer transformation smoothness for OOD detection (BLOOD).

Supervised baselines We also test the state-of-the-art supervised WQE model xcomet (Guerreiro et al., 2024), introduced in Section 2.6. In this chapter, we focus on their word-level error span prediction capabilities in a quality estimation setup, where the model classifies every input token according to MQM severity levels {ok, minor, major, critical} with a learned linear layer.⁵ Contrary to the continuous metrics from the previous section, binary labels from xcomet cannot be easily calibrated to match subjective annotation propensity. Hence, we propose to adapt the xcomet metric to use the sum of probability for all error types as a token-level continuous confidence metric, \(s(t^{*}) = p(\text{minor}) + p(\text{major}) + p(\text{critical})\), which we dub xcomet_conf.

Human Editors For QE4PE, we report the min/mean/max agreement between each annotator’s edited spans and those of the other five editors as a less subjective “human-level” quality measure.

10.5 Experiments

10.5.1 Setup

Token-level Evaluation Error spans used as labels in our evaluation are defined at the character level, while metric scores depend on the tokenization employed by either the MT model (for unsupervised metrics) or xcomet (for supervised metrics). To facilitate comparison, we label tokens as part of an error span if at least one character contained within them was marked as an error or edited by an annotator. Table 10.2 and Table 10.3 provide examples of various segmentations for the same MT output.

Hover highlighted spans to see error annotations.

Source_en	So why is it that people jump through extra hoops to install Google Maps?
MT_it (NLLB)	Quindi perché le persone devono fare un salto in più per installare Google Maps?

Annotator \(t1\)	Quindi perché le persone devono fare un passaggio in più per installare Google Maps?
Annotator \(t2\)	Quindi perché le persone fanno i salti mortali per installare Google Maps?
Annotator \(t3\)	Quindi perché le persone effettuano dei passaggi ulteriori e superflui per installare Google Maps?
Annotator \(t4\)	Allora perché le persone fanno un passaggio in più per installare Google Maps?
Annotator \(t5\)	E allora mi chiedo: perché gli utenti iPhone si affannano tanto per installare Google Maps?
Annotator \(t6\)	Quindi perché le persone fanno di tutto per installare Google Maps?
Edit Counts (Figure 10.3)	Quindi perché le persone devono fare un salto in più per installare Google Maps?

xcomet-xl	Quindi perché le persone devono fare un salto in più per installare Google Maps?
xcomet-xxl	Quindi perché le persone devono fare un salto in più per installare Google Maps?
xcomet-xl\(_{\text{conf}}\)	Quindi perché le persone devono fare un salto in più per install are Google Maps ?
xcomet-xxl\(_{\text{conf}}\)	Quindi perché le persone devono fare un salto in più per install are Google Maps ?
Surprisal MCD\(_{\text{var}}\)	Quindi perché le persone devono fare un sal to in più per installare Google Maps ?

Table 10.2: Annotated example from the En\(\rightarrow\)It portion of the QE4PE dataset. Top: Annotator edits with highlighted final text and replaced text on top, with count-based aggregation showing inter-annotator agreement. Bottom: Word-level annotations for best-performing metrics discussed in the study.

Hover highlighted spans to see error annotations.

Source_en	So the challenges in this are already showing themselves. I'm likely going to have a VERY difficult time getting a medical clearance due to the FAA's stance on certain medications.
MT_cs (Aya23)	Takže problémy s tím se již projevují. Pravděpodobně budu mít PŘESNĚ obtížný čas dostat lékařské potvrzení kvůli postoji FAA k některým lékům.

Annotator	Takže problémy s tím se již projevují. Pravděpodobně budu mít PŘESNĚ obtížný čas dostat lékařské potvrzení kvůli postoji FAA k některým lékům.

xcomet-xl	Takže problémy s tím se již projevují. Pravděpodobně budu mít PŘESNĚ obtížný čas dostat lékařské potvrzení kvůli postoji FAA k některým lékům
xcomet-xxl	Takže problémy s tím se již projevují. Pravděpodobně budu mít PŘESNĚ obtížný čas dostat lékařské potvrzení kvůli postoji FAA k některým lékům.
xcomet-xl\(_{\text{conf}}\)	Takže problémy s tím se již projevují . Pravděpodobně budu mít PŘESNĚ obtížný čas dostat lékařské potvrzení kvůli postoji FAA k některým lékům .
xcomet-xxl\(_{\text{conf}}\)	Takže problémy s tím se již projevují . Pravděpodobně budu PŘESNĚ obtížný čas dostat lékařské potvrzení kvůli postoji FAA k některým lékům .
Out. Entropy	Takže problémy s tím se již projevují . Pravděpodobně budu mít PŘESNĚ obtížný čas dostat lékařské potvrzení kvůli postoji FAA k některým lékům .

Table 10.3: Annotated example from the En\(\rightarrow\)Cs portion of the WMT24 dataset. Top: Annotator edits with highlighted Error Span Annotation of minor and major errors. Bottom: Word-level annotations for best-performing metrics discussed in the study.

Constraining generation Evaluating metrics at the word level can be challenging due to the need for perfect uniformity between model generations and annotated spans. For this reason, we extract unsupervised metrics during generation while force-decoding the annotated outputs from the MT model to ensure perfect adherence with annotated error spans. In general, such an approach could introduce a problematic confounder in the evaluation, as observed results may be the product of constraining a model towards an unnatural generation, rather than reflecting the underlying phenomena. However, in this study, we carefully ensure that the generation setup matches exactly the one of previous works where the annotated translations were produced, using the same MT model and the same inputs.⁶ Hence, the constraining process serves as a simple assurance of conformity in light of potential discrepancies introduced by different decoding strategies, and does not affect the soundness of our method.

10.5.2 Results

How Accurate are Unsupervised WQE Metrics? Table 10.4 reports the average metrics performance across all translation directions across tested datasets.⁷ We report Average Precision (AP) as a general measure of metric quality across the full score range, and we estimate calibrated metric performance as the best F1 score (F1*) across all thresholds for binarizing continuous metric scores into pos./neg. labels matching human annotation.⁸ Our results show that, despite high variability in error span prevalence across different models, languages and annotators, metric rankings remain generally consistent, suggesting the presence of robust relations between various signals sourced from models’ inner workings and translation errors.

	Method	DivEMT		WMT24		QE4PE
		AP	*F1\(^{}\)**	AP	*F1\(^{}\)**	AP	*F1\(^{}\)**
	Random	.34	.50	.05	.09	.17	.27
unsupervised	Surprisal	.43	.53	.08	.13	.23	.32
	Out. Entropy	.46	.51	.10	.16	.23	.31
	Surprisal mcd\(_{\text{avg}}\)	.43	.53	-	-	.24	.33
	Surprisal mcd\(_{\text{var}}\)	.47	.54	-	-	.26	.34
	LL Surprisal\(_{\text{best}}\)	.42	.53	.09	.15	.23	.32
	LL KL-Div\(_{\text{best}}\)	.43	.51	.07	.12	.20	.29
	LL Pred. Depth	.39	.51	.06	.12	.20	.29
	Att. Entropy\(_{\text{avg}}\)	.37	.50	.05	.09	.18	.28
	Att. Entropy\(_{\text{max}}\)	.34	.50	.05	.09	.16	.28
	blood\(_{\text{best}}\)	.34	.50	-	-	.17	.28
supervised	xcomet-xl	.42	.45	.09	.19	.23	.34
	xcomet-xl\(_{\text{conf}}\)	.54	.55	.15	.23	.32	.37
	xcomet-xxl	.43	.41	.09	.20	.22	.31
	xcomet-xxl\(_{\text{conf}}\)	.56	.55	.16	.24	.33	.37
human	Hum. Editors\(_{\text{min}}\)	-	-	-	-	.24	.34
	Hum. Editors\(_{\text{avg}}\)	-	-	-	-	.28	.41
	Hum. Editors\(_{\text{max}}\)	-	-	-	-	.32	.47

Table 10.4: Average Precision (AP) and Optimal F1 (F1*) for metrics across tested datasets. Results are averaged across all languages and annotators, with best unsupervised and overall best results highlighted.

Among unsupervised metrics, we find those based on the output distribution to be most effective at identifying error spans, in line with previous segment-level QE results (Fomicheva et al., 2020). Notably, the Surprisal MCD_var shows strong performances in line with the default xcomet models. For the multi-label QE4PE dataset, we find that the best supervised metrics score on par with the average human annotator consensus (Hum. Editors_avg), while unsupervised metrics generally obtain lower performances.

Confidence Weighting Enables xcomet Calibration From Table 10.4 results, default xcomet metrics underperform compared to the best unsupervised techniques, a surprising result given their ad-hoc tuning. On the contrary, simple continuous scores derived from xcomet (xcomet_conf) consistently reach better results across all tested sets. Figure 10.2 shows the precision-recall tradeoff for these metrics on the EN\(\rightarrow\)IT subset of the DivEMT dataset.⁹ In their default form, commonly used for evaluation via the unbabel-comet library, xcomet metrics consistently outperform Surprisal MCD_var in terms of precision (51-60%, compared to 34% optimal precision for MCD_var), but identify only 32-26% of tokens annotated as errors, resulting in lower AP.

Figure 10.2: Precision-Recall tradeoff for binary and confidence-weighted xcomet variants and the Surprisal MCD_var metric for DivEMT EN\(\rightarrow\)IT.

The low recall of these metrics may be problematic in WQE applications, where omitting an error could result in oversights by human post-editors, who may trust the comprehensiveness of WQE predictions. On the contrary, the confidence-weighted xcomet_conf shows strong performances across the whole recall range, resulting in consistent improvements in both F1* and AP Table 10.4. Concretely, these results confirm that default xcomet performance does not reflect the full capacity of the metric, and operating with granular confidence scores can be beneficial when calibration is possible.

Metrics Performance for Multiple Annotations While our evaluation so far employed human error span annotations as binary labels, we set out to assess how more granular labeling schemes impact metrics’ performance. Given \(L\) sets of binary labels (up to 6 per language for QE4PE), we assign a score \(s \in \{1,\dots,L\}\) to every MT token using the number of annotators that marked it as an error, resulting in edit counts reflecting human agreement rate, as shown in Table 10.2.

Figure 10.3 presents the correlation of various metrics as the number of annotators available increases, with median values and confidence bounds obtained from edit counts across all combinations of \(L\) label sets.¹⁰ The increasing trend in correlations across all reported metrics indicates that these methods effectively reflect the aleatoric uncertainty in error span labels, i.e., the disagreement between various annotators. In particular, the Surprisal MCD_var metric sees a steeper correlation increase than other well-performing metrics, surpassing default xcomet supervised approaches for higher correlation bins. This suggests the epistemic uncertainty derived from noisy model predictions might be a promising way to anticipate the aleatoric uncertainty across human annotators for WQE. We observe that 95% confidence intervals for high-scoring metrics largely overlap when a single set of labels is used, indicating that rankings of metric performance are subject to change depending on the subjective choices of the annotator. While this poses a problem when attempting a robust evaluation of WQE metrics, we remark that including multiple annotations largely mitigates this issue. As a result, we recommend explicitly accounting for human label variation by including multiple error annotations in future WQE evaluations to ensure generalizable findings.

Figure 10.3: Spearman correlation between WQE metric scores and human edit counts across multiple annotation sets for QE4PE EN\(\rightarrow\)IT (left) and EN\(\rightarrow\)NL (right).

10.6 Limitations

Our findings are accompanied by several limitations. Firstly, our choice of tested datasets was limited by the availability of annotated outputs generated by open-source MT models. While several other datasets matching these criteria exist (Fomicheva et al., 2022; Yang et al., 2023; Dale et al., 2023b), we restricted our assessment to a sufficient subset to ensure diversity across languages and tested models to support our findings. To facilitate comparison with other datasets, our evaluation for WMT24 treats available error spans as binary labels and does not directly account for error severity in human-annotated spans. Our choice of unsupervised metrics was primarily driven by previous work on uncertainty quantification in MT, and ease of implementation for popular methods in mechanistic interpretability literature (Ferrando et al., 2024). However, our choices in the latter category were limited, as most methods are nowadays developed and tested specifically for decoder-only transformer models. Finally, despite their strong performance, we found that unsupervised methods based on MCD require substantial computational resources, and as such, we were unable to evaluate them on Aya23 35B. While our primary focus was to establish baseline performances across various popular methods, future work should leverage the latest insights from more advanced techniques, such as those requiring the tuning of vocabulary projections (Belrose et al., 2023; Yom Din et al., 2024) or the identification of “confidence neurons” that modulate predictive entropy (Stolfo et al., 2024).

10.7 Conclusion

We conducted a comprehensive evaluation of supervised and unsupervised WQE metrics across multiple languages and annotation sets. Our results show that, while unsupervised metrics generally lag behind state-of-the-art supervised systems, some uncertainty quantification methods based on the predictive distribution show promising correlation with human label variation. Moreover, we find that popular supervised WQE metrics generally have low levels of recall and can benefit from confidence weighting when calibration is possible. Finally, individual annotator preferences are key confounders in WQE evaluations and can be mitigated by using multiple annotation sets.

We offer the following practical recommendations for evaluating WQE systems:

Use agreement between multiple human annotations to control the effect of subjective preferences and rank WQE metrics robustly.
Employ an in-distribution calibration set of error spans before testing to ensure fair metric comparisons, and favor evaluations accounting for precision-recall tradeoffs to ensure their usability across various confidence levels.
Previous work showed the effectiveness of visualization reflecting prediction confidence (Vasconcelos et al., 2025), such as highlights for various error severity levels (Sarti et al., 2025a). Consider using continuous WQE metrics in real-world applications such as WQE-augmented post-editing to convey fine-grained confidence variations.

This final assessment concludes our investigation into the potential of model processing signals for enhancing the downstream verification of machine-translated content, converting interpretability methods commonly used for model analysis into practical tools for improving decision-making in real-world human-AI interaction settings.

Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Jon Ander Campos, Yi Chern Tan, Kelly Marchisio, Max Bartolo, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Aidan Gomez, Phil Blunsom, Marzieh Fadaee, et al. 2024. Aya 23: Open weight releases to further multilingual progress.

Joris Baan, Nico Daheim, Evgenia Ilia, Dennis Ulmer, Haau-Sing Li, Raquel Fernández, Barbara Plank, Rico Sennrich, Chrysoula Zerva, and Wilker Aziz. 2023. Uncertainty in natural language generation: From theory to applications.

Robert Baldock, Hartmut Maennel, and Behnam Neyshabur. 2021. Deep learning through the lens of example difficulty. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan, editors, Advances in neural information processing systems, volume 34, pages 10876–10889. Curran Associates, Inc.

Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O’Gara, Robert Kirk, Ben Bucknall, Tim Fist, Luke Ong, Philip Torr, Kwok-Yan Lam, Robert Trager, David Krueger, Sören Mindermann, José Hernandez-Orallo, Mor Geva, and Yarin Gal. 2025. Open problems in machine unlearning for AI safety.

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. 2023. Eliciting latent predictions from transformers with the tuned lens. ArXiv, abs/2303.08112.

Yangyi Chen, Lifan Yuan, Ganqu Cui, Zhiyuan Liu, and Heng Ji. 2023. A close look into the calibration of pre-trained language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1343–1367, Toronto, Canada. Association for Computational Linguistics.

Benjamin Cohen-Wang, Harshay Shah, Kristian Georgiev, and Aleksander Mądry. 2024. ContextCite: Attributing model generation to context. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in neural information processing systems, volume 37, pages 95764–95807. Curran Associates, Inc.

David Dale, Elena Voita, Loic Barrault, and Marta R. Costa-jussà. 2023a. Detecting and mitigating hallucinations in machine translation: Model internal workings alone do well, sentence similarity Even better. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pages 36–50, Toronto, Canada. Association for Computational Linguistics.

David Dale, Elena Voita, Janice Lam, Prangthip Hansanti, Christophe Ropers, Elahe Kalbassi, Cynthia Gao, Loic Barrault, and Marta Costa-jussà. 2023b. HalOmi: A manually annotated benchmark for multilingual hallucination and omission detection in machine translation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 conference on empirical methods in natural language processing, pages 638–653, Singapore. Association for Computational Linguistics.

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, et al. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread. https://transformer-circuits.pub/2021/framework/index.html.

Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, André Martins, Graham Neubig, Ankush Garg, Jonathan Clark, Markus Freitag, and Orhan Firat. 2023. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of the eighth conference on machine translation, pages 1066–1083, Singapore. Association for Computational Linguistics.

Javier Ferrando, Gerard I. Gállego, and Marta R. Costa-jussà. 2022. Measuring the mixing of contextual information in the transformer. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 conference on empirical methods in natural language processing, pages 8698–8714, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Javier Ferrando, Gabriele Sarti, Arianna Bisazza, and Marta R. Costa-jussà. 2024. A primer on the inner workings of transformer-based language models. Arxiv Preprint.

Marina Fomicheva, Piyawat Lertvittayakumjorn, Wei Zhao, Steffen Eger, and Yang Gao. 2021. The Eval4NLP shared task on explainable quality estimation: Overview and results. In Yang Gao, Steffen Eger, Wei Zhao, Piyawat Lertvittayakumjorn, and Marina Fomicheva, editors, Proceedings of the 2nd workshop on evaluation and comparison of NLP systems, pages 165–178, Punta Cana, Dominican Republic. Association for Computational Linguistics.

Marina Fomicheva, Shuo Sun, Erick Fonseca, Chrysoula Zerva, Frédéric Blain, Vishrav Chaudhary, Francisco Guzmán, Nina Lopatina, Lucia Specia, and André F. T. Martins. 2022. MLQE-PE: A multilingual quality estimation and post-editing dataset. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the thirteenth language resources and evaluation conference, pages 4963–4974, Marseille, France. European Language Resources Association.

Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Frédéric Blain, Francisco Guzmán, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, and Lucia Specia. 2020. Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics, 8:539–555.

Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021a. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.

Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, George Foster, Alon Lavie, and Ondřej Bojar. 2021b. Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. In Loic Barrault, Ondrej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussa, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Tom Kocmi, Andre Martins, Makoto Morishita, et al., editors, Proceedings of the sixth conference on machine translation, pages 733–774, Online. Association for Computational Linguistics.

Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33rd international conference on machine learning, volume 48, pages 1050–1059, New York, NY, USA. Proceedings of Machine Learning Research (PLMR).

Mario Giulianelli, Joris Baan, Wilker Aziz, Raquel Fernández, and Barbara Plank. 2023. What comes next? Evaluating uncertainty in neural text generators against human production variability. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 conference on empirical methods in natural language processing, pages 14349–14371, Singapore. Association for Computational Linguistics.

Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. 2022. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.

Nuno M. Guerreiro, Pierre Colombo, Pablo Piantanida, and André Martins. 2023a. Optimal transport for unsupervised hallucination detection in neural machine translation. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pages 13766–13784, Toronto, Canada. Association for Computational Linguistics.

Nuno M. Guerreiro, Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre Colombo, and André F. T. Martins. 2024. Xcomet: Transparent machine translation evaluation through fine-grained error detection. Transactions of the Association for Computational Linguistics, 12:979–995.

Nuno M. Guerreiro, Elena Voita, and André Martins. 2023b. Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation. In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th conference of the european chapter of the association for computational linguistics, pages 1059–1075, Dubrovnik, Croatia. Association for Computational Linguistics.

Dan Hendrycks and Kevin Gimpel. 2017. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International conference on learning representations (ICLR 2017).

Anas Himmi, Guillaume Staerman, Marine Picot, Pierre Colombo, and Nuno M Guerreiro. 2024. Enhanced hallucination detection in neural machine translation through simple detector aggregation. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 conference on empirical methods in natural language processing, pages 18573–18583, Miami, Florida, USA. Association for Computational Linguistics.

Fran Jelenić, Josip Jukić, Martin Tutek, Mate Puljiz, and Jan Snajder. 2024. Out-of-distribution detection by leveraging between-layer transformation smoothness. In The twelfth international conference on learning representations.

Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.

Armen Der Kiureghian and Ove Ditlevsen. 2009. Aleatory or epistemic? Does it matter? Structural Safety, 31(2):105–112. Risk Acceptance and Risk Communication.

Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Benjamin Marie, Christof Monz, Kenton Murray, Masaaki Nagata, Martin Popel, Maja Popović, et al. 2024a. Findings of the WMT24 general machine translation shared task: The LLM era is here but MT is not solved yet. In Barry Haddow, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the ninth conference on machine translation, pages 1–46, Miami, Florida, USA. Association for Computational Linguistics.

Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Philipp Koehn, Benjamin Marie, Christof Monz, Makoto Morishita, Kenton Murray, Masaaki Nagata, Toshiaki Nakazawa, Martin Popel, et al. 2023. Findings of the 2023 conference on machine translation (WMT23): LLMs are here but not quite there yet. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of the eighth conference on machine translation, pages 1–42, Singapore. Association for Computational Linguistics.

Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. In Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl, Tharindu Ranasinghe, Eva Vanmassenhove, Sergi Alvarez Vidal, Nora Aranberri, Mara Nunziatini, Carla Parra Escartín, Mikel Forcada, Maja Popovic, Carolina Scarton, and Helena Moniz, editors, Proceedings of the 24th annual conference of the european association for machine translation, pages 193–203, Tampere, Finland. European Association for Machine Translation.

Tom Kocmi, Vilém Zouhar, Eleftherios Avramidis, Roman Grundkiewicz, Marzena Karpinska, Maja Popović, Mrinmaya Sachan, and Mariya Shmatova. 2024b. Error span annotation: A balanced approach for human evaluation of machine translation. In Barry Haddow, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the ninth conference on machine translation, pages 1440–1453, Miami, Florida, USA. Association for Computational Linguistics.

Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, and Steffen Eger. 2024. Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 25(75):1–49.

Arle Richard Lommel, Aljoscha Burchardt, and Hans Uszkoreit. 2013. Multidimensional quality metrics: A flexible system for assessing translation quality. In Proceedings of translating and the computer 35, London, UK. Aslib.

Hosein Mohebbi, Willem Zuidema, Grzegorz Chrupała, and Afra Alishahi. 2023. Quantifying context mixing in transformers. In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th conference of the european chapter of the association for computational linguistics, pages 3378–3400, Dubrovnik, Croatia. Association for Computational Linguistics.

Marius Mosbach, Vagrant Gautam, Tomás Vergara Browne, Dietrich Klakow, and Mor Geva. 2024. From insights to actions: The impact of interpretability and analysis research on NLP. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 conference on empirical methods in natural language processing, pages 3078–3105, Miami, Florida, USA. Association for Computational Linguistics.

NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, et al. 2024. Scaling neural machine translation to 200 languages. Nature, 630(8018):841–846.

nostalgebraist. 2020. Interpreting GPT: The logit lens. AI Alignment Forum.

Barbara Plank. 2022. The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 conference on empirical methods in natural language processing, pages 10671–10682, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Barbara Plank, Dirk Hovy, and Anders Søgaard. 2014. Linguistically debatable or just plain wrong? In Kristina Toutanova and Hua Wu, editors, Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 2: Short papers), pages 507–511, Baltimore, Maryland. Association for Computational Linguistics.

Jirui Qi^*, Gabriele Sarti^*, Raquel Fernández, and Arianna Bisazza. 2024. Model internals-based answer attribution for trustworthy retrieval-augmented generation. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 conference on empirical methods in natural language processing, pages 6037–6053, Miami, Florida, USA. Association for Computational Linguistics.

Shauli Ravfogel, Yoav Goldberg, and Jacob Goldberger. 2023. Conformal nucleus sampling. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the association for computational linguistics: ACL 2023, pages 27–34, Toronto, Canada. Association for Computational Linguistics.

Gabriele Sarti, Grzegorz Chrupała, Malvina Nissim, and Arianna Bisazza. 2024. Quantifying the plausibility of context reliance in neural machine translation. In The twelfth international conference on learning representations (ICLR 2024), Vienna, Austria. OpenReview.

Gabriele Sarti, Vilém Zouhar, Grzegorz Chrupała, Ana Guerberof-Arenas, Malvina Nissim, and Arianna Bisazza. 2025a. QE4PE: Word-level quality estimation for human post-editing. Transactions of the Association for Computational Linguistics, 13:1410–1435.

Gabriele Sarti, Vilém Zouhar, Malvina Nissim, and Arianna Bisazza. 2025b. Unsupervised word-level quality estimation for machine translation through the lens of annotators (dis)agreement. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 conference on empirical methods in natural language processing, pages 18320–18337, Suzhou, China. Association for Computational Linguistics.

Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Ng. 2008. Cheap and fast – but is it good? Evaluating non-expert annotations for natural language tasks. In Mirella Lapata and Hwee Tou Ng, editors, Proceedings of the 2008 conference on empirical methods in natural language processing, pages 254–263, Honolulu, Hawaii. Association for Computational Linguistics.

Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, and Neel Nanda. 2024. Confidence regulation neurons in language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in neural information processing systems, volume 37, pages 125019–125049. Curran Associates, Inc.

Dennis Ulmer, Jes Frellsen, and Christian Hardmeier. 2022. Exploring predictive uncertainty and calibration in NLP: A study on the impact of method & data scarcity. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the association for computational linguistics: EMNLP 2022, pages 2707–2735, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Alexandra N Uma, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, and Massimo Poesio. 2021. Learning from disagreement: A survey. Journal of Artificial Intelligence Research, 72:1385–1470.

Ahmet Üstün, Viraat Aryabumi, Zheng Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. 2024. Aya model: An instruction finetuned open-access multilingual language model. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15894–15939, Bangkok, Thailand. Association for Computational Linguistics.

Helena Vasconcelos, Gagan Bansal, Adam Fourney, Q. Vera Liao, and Jennifer Wortman Vaughan. 2025. Generation probabilities are not enough: Uncertainty highlighting in AI code completions. ACM Trans. Comput.-Hum. Interact., 32(1).

Leon Weber-Genzel, Siyao Peng, Marie-Catherine De Marneffe, and Barbara Plank. 2024. VariErr NLI: Separating annotation error from human label variation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 2256–2269, Bangkok, Thailand. Association for Computational Linguistics.

Zhen Yang, Fandong Meng, Yuanmeng Yan, and Jie Zhou. 2023. Rethinking the word-level quality estimation for machine translation from human judgement. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the association for computational linguistics: ACL 2023, pages 2012–2025, Toronto, Canada. Association for Computational Linguistics.

Alexander Yom Din, Taelin Karidi, Leshem Choshen, and Mor Geva. 2024. Jump to conclusions: Short-cutting transformers with linear transformations. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024), pages 9615–9625, Torino, Italia. ELRA; ICCL.

Chrysoula Zerva, Frederic Blain, José G. C. De Souza, Diptesh Kanojia, Sourabh Deoghare, Nuno M. Guerreiro, Giuseppe Attanasio, Ricardo Rei, Constantin Orasan, Matteo Negri, Marco Turchi, Rajen Chatterjee, Pushpak Bhattacharyya, Markus Freitag, and André Martins. 2024. Findings of the quality estimation shared task at WMT 2024: Are LLMs closing the gap in QE? In Barry Haddow, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the ninth conference on machine translation, pages 82–109, Miami, Florida, USA. Association for Computational Linguistics.

Chrysoula Zerva and André F. T. Martins. 2024. Conformalizing machine translation evaluation. Transactions of the Association for Computational Linguistics, 12:1460–1478.

Yao Zhao, Mikhail Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, and Peter J Liu. 2023. Calibrating sequence likelihood improves conditional language generation. In The eleventh international conference on learning representations.

Vilém Zouhar, Shuoyang Ding, Anna Currey, Tatyana Badeka, Jenyuan Wang, and Brian Thompson. 2024. Fine-tuned machine translation metrics struggle in unseen domains. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 2: Short papers), pages 488–500, Bangkok, Thailand. Association for Computational Linguistics.

Vilém Zouhar, Tom Kocmi, and Mrinmaya Sachan. 2025. AI-assisted human evaluation of machine translation. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 conference of the nations of the americas chapter of the association for computational linguistics: Human language technologies (volume 1: Long papers), pages 4936–4950, Albuquerque, New Mexico. Association for Computational Linguistics.

https://pdai.info/↩︎
https://github.com/wmt-conference/wmt-collect-translations ↩︎
MCD is tested only on encoder-decoder models since Aya layers do not include dropout. The MCD\(_\text{var}\) setting corresponds to the Unsupervised setting from Chapter 9.↩︎
For encoder-decoder model, self-attention and cross-attention weights are concatenated and renormalized.↩︎
The default xcomet metric was used with the unbabel-comet library (v2.2.6).↩︎
Generation parameters such as sampling temperature are not relevant in this setting, provided that they only alter the selection of the following output token, which we do via force-decoding.↩︎
Full breakdown available in Table C.16, Table C.17, Table C.18, Table C.19.↩︎
Random baseline AP values match the proportion of tokens marked as errors, which can vary greatly.↩︎
Results for all datasets in Figure C.11, Figure C.12, Figure C.13, Figure C.14.↩︎
\(x\)=1 corresponds to binary labels from previous sections.↩︎