2 Background

The initial sections of this chapter adapt contents from A Primer on the Inner Workings of Transformer-based Language Models (Ferrando et al., 2024).

Distress not yourself if you cannot at first understand the deeper mysteries of Spaceland. By degrees they will dawn upon you.

– Edwin A. Abbott, Flatland: A Romance of Many Dimensions, 1884

This chapter provides a succinct introduction to various topics discussed in the experimental chapters of this dissertation. Rather than a comprehensive review of relevant literature, it aims to provide key background about the research presented in this manuscript.

In particular, Section 2.1 and Section 2.4 discuss the basic functioning of neural networks-based language models and machine translation (MT) systems, and introduce the machine translation task representing the core focus of this work. Section 2.2 and Section 2.3 provide an introduction to methods for attributing inputs and conditioning generation in language models, corresponding to the topics discussed in Part I and Part II. Finally, Section 2.5 and Section 2.6 dive deeper in the translation domain, providing an overview of how MT models are employed in the translation industry by human post-editors, and discussing techniques for automatically evaluating machine translation quality. These notions provide a valuable background to Part III, which focuses on the impact of interpretability insights on human translation workflows.

Beyond this background section, each experimental chapter briefly summarizes relevant literature to contextualize the research questions and findings.

2.1 From Neural Networks to Neural Language Models

Neural networks are computational models which integrate principles from statistical learning theory (Vapnik, 1995), consisting of interconnected nodes (neurons) organized in layers, where each connection has an associated weight. Formally, a neural network is a function \(\mathbf{f}: \mathcal{X} \to \mathcal{Y}\) that maps inputs \(\mathbf{x} \in \mathcal{X}\) to outputs \(\mathbf{y} \in \mathcal{Y}\), where \(\mathcal{X}\) and \(\mathcal{Y}\) are the input and output feature spaces, respectively. The function \(\mathbf{f}\) is parameterized by weights \(\mathbf{\theta} \in \Theta\), which are typically learned from data through the training process described in Section 2.1.1. Individual neurons are functions parametrized by weights \(\mathbf{w} \in \mathbb{R}^d\) and biases \(b \in \mathbb{R}\), which are combined to produce an output \(\sigma(\mathbf{w}^T\mathbf{x} + b)\), where \(\sigma\) is a non-linear activation function. Thanks to non-linearities, sequences of neurons can learn to represent complex relations from input vector \(\mathbf{x}\).¹

2.1.1 Supervised Learning for Neural Networks

In the supervised learning paradigm, given a training dataset \(\mathcal{D}\) containing paired instances:

\[\mathcal{D} = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_N, y_N)\} \in (\mathcal{X} \times \mathcal{Y})^n \tag{2.1}\]

where \(\mathbf{x}_i\) is a vector of input features and \(\mathbf{y}_i\) is the expected output, a neural network is trained to learn a functional mapping \(\mathbf{f}\) from inputs \(\mathbf{x}\) to labels \(\mathbf{y}\) by minimizing the average value of a loss function \(\ell: \mathcal{Y} \times \mathcal{Y} \to \mathcal{R}\), such that \(\ell(\mathbf{f}(\mathbf{x}), \mathbf{y})\) quantifies the gap between predicted outcomes \(\tilde{y}\) and ground truth \(y\) over examples in \(\mathcal{D}\). The function \(\mathbf{f}\) is parameterized by weights \(\mathbf{\theta} \in \Theta\), which are optimized during training so as to minimize the loss function. Such optimization is typically performed using some variant of stochastic gradient descent (SGD), in which iterative steps \(1, \dots, t \dots, T\) are taken to update \(\mathbf{\theta}\) in the direction of the negative gradient of the loss function with respect to the weights:

\[\mathbf{\theta}_{t+1} \leftarrow \mathbf{\theta}_t - \eta \;\nabla_{\mathbf{\theta}} \,\ell(\mathbf{f}(\mathbf{x}_j; \mathbf{\theta}_t), \mathbf{y}_j) \tag{2.2}\]

where \(\eta\) is a chosen learning rate, and \(\mathbf{x}_j\) and \(\mathbf{y}_j\) are a subset of randomly sampled input-output pairs from the training set \(\mathcal{D}\), typically referred to as mini-batch. This iterative refinement of model parameters is repeated until convergence, i.e. until the model performance on a left-out validation set does not improve significantly, allowing for a robust convergence to a local minimum of the loss function, even for non-convex problems and high-dimensional parameter spaces.

We commonly refer to the inference process going from input \(\mathbf{x}\) to output \(\mathbf{y}\) as forward pass, and to the process of computing gradients and updating model parameters as backward pass.

2.1.2 Transformer Neural Networks

Transformers (Vaswani et al., 2017) are a class of neural networks that have become the de-facto standard for most natural language processing tasks, constituting the core neural network architecture employed throughout this thesis’ experiments. In essence, a transformer consists of a sequence of identical macro-layers, dubbed blocks, progressively contextualizing a sequence of input features \(\mathbf{Z} \in \mathbb{R}^{S \times d}\), where \(S\) is the sequence length and \(d\) is the size of each feature vector. Figure 2.1 illustrates the structure of a single transformer module, constituting the core of decoder-only language models such as GPT-3 (Brown et al., 2020) presented later in Section 2.1.3. Notably, the transformer architecture is characterized by its ability to process input sequences in parallel, as opposed to recurrent models (Rumelhart and McClelland, 1987; Hochreiter and Schmidhuber, 1997), making it highly efficient for training on large datasets.

Figure 2.1: An example transformer module with \(N\) blocks. We adopt the residual stream view of Elhage et al. (2021), with residual connections linearized to emphasize the read-write operations performed by attention and feed-forward network modules.

We now describe the main components of a transformer block in order of execution during the forward pass, using \(\mathbf{z}_i \in \mathbb{R}^d\) to denote the input representations at each component for sequence element \(i\). This will be useful for explaining steering intervention and vocabulary projection methods used in Chapter 7 and Chapter 10, respectively.

Layer normalization (LN). The layer normalization operation, also known as LayerNorm (Ba et al., 2016), is a common approach for stabilizing the training process of deep neural networks. In practice, layer normalization applies the transformation:

\[\text{LN}(\mathbf{z}_i) = \frac{\mathbf{z}_i - \mu(\mathbf{z}_i)}{\sigma(\mathbf{z}_i)} \odot \gamma + \beta \tag{2.3}\]

where \(\mu, \sigma\) are the mean and the standard deviation of \(\mathbf{z}\), and \(\gamma, \beta\) are learnable scale and bias parameters for the normalization. This operation helps to mitigate issues related to internal covariate shift, improving convergence during training. Recently, LayerNorm has been substituted with RMSNorm (Zhang and Sennrich, 2019), which removes the mean centering step and scales the input using the root mean square (RMS) statistic.

Multi-head self-attention (MHSA). The self-attention mechanism is the core component of the transformer architecture, allowing the model to contextualize its representations at each layer by combining information across the input sequence. While the original formulation of multi-head self-attention by Vaswani et al. (2017) involves a concatenation of attention head outputs before the final output projection, we follow here the more recent formulation by Kobayashi et al. (2021) and Elhage et al. (2021), which reformulates the attention output computation using the sum of individual attention heads, emphasizing the linear reading and writing operations within the attention computation.

Concretely, the self-attention module is composed by a series of \(H\) attention heads \(\text{Attn}_1, \ldots, \text{Attn}_H\), each computing the following weighted sum:

\[\text{Attn}_h(\mathbf{z}_i) = \sum_{j} \alpha^h_{ij} \mathbf{z}_j \mathbf{W}_V \mathbf{W}_O \tag{2.4}\]

Intuitively, the sharding of the attention mechanism into separate computations can be beneficial when processing the complex relations within different elements of the input sequence, for example, the lexical, syntactic and semantic dimensions of words in a text. The learnable weight matrices \(\mathbf{W}_V \in \mathbb{R}^{d \times d_h}\) and \(\mathbf{W}_O \in \mathbb{R}^{d_h \times d}\), where \(d_h\) represents the dimension of each head, can be combined into the so-called output-value (OV) circuit as \({\mathbf{W}_V \mathbf{W}_O = \mathbf{W}_{OV} \in \mathbb{R}^{d \times d}}\). For every key \(j\) given the current query position \(i < S\), the corresponding attention weight \(\alpha^h_{i}\) is computed as:

\[\alpha^h_{i} = \text{softmax}(\frac{\mathbf{z}_i \mathbf{W}_Q (\mathbf{W}_K \mathbf{Z})^T}{\sqrt{d_h}}) \tag{2.5}\]

Once again, the learnable weight matrices \(\mathbf{W}_Q \in \mathbb{R}^{d \times d_h}\) and \(\mathbf{W}_K \in \mathbb{R}^{d \times d_h}\) can be combined as the query-key (QK) circuit \({\mathbf{W}_Q \mathbf{W}_K^T = \mathbf{W}_{QK} \in \mathbb{R}^{d \times d}}\). This decomposition enables a view of QK and OV circuits as the units responsible for reading from (QK) and writing to (OV) the residual stream. Finally, the attention block output is the sum of individual attention heads: \[\text{Attn}(\mathbf{z}_i) = \sum_{h=1}^{H} \text{Attn}_h(\mathbf{z}_i) \tag{2.6}\]

Residual connection. The introduction of residual connections (He et al., 2016) in the transformer architecture allows the model to learn identity mappings more easily, facilitating the training of deeper networks and avoiding the vanishing gradient problem (Hochreiter, 1998). A residual connection is commonly applied to the output of the self-attention module, resulting in:

\[\text{ResAttn}(\mathbf{z}_i) = \text{Attn}\big(\text{LN}(\mathbf{z}_i)\big) + \mathbf{z}_i \tag{2.7}\]

Feedforward network (FFN). The feedforward network (FFN) in the transformer block is composed of two learnable weight matrices²: \(\mathbf{W}_{\text{in}} \in \mathbb{R}^{d \times d_{\text{ffn}}}\) and \(\mathbf{W}_{\text{out}} \in \mathbb{R}^{d_{\text{ffn}} \times d}\). \(\mathbf{W}_{\text{in}}\) reads from the residual stream state \(\mathbf{z}\), and its result is passed through an element-wise non-linear activation function \(\sigma\), producing a set of neuron activations. These get transformed by \(\mathbf{W}_{\text{out}}\) to produce the output \(\text{FFN}(\mathbf{z})\):

\[\text{FFN}(\mathbf{z}_i) = \sigma(\mathbf{z}_i \mathbf{W}_{\text{in}}) \mathbf{W}_{\text{out}} \tag{2.8}\]

The FFN operation was compared to a retrieval step from a key-value memory (Geva et al., 2021), with keys stored in columns of \(\mathbf{W}_{\text{in}}\) acting as pattern detectors over the input sequence, and values in rows of \(\mathbf{W}_{\text{out}}\) being upweighted by respective neuron activation. The overall block structure from Figure 2.1 can then be summarized as:

\[\text{Block}(\mathbf{z}_i) = \text{FFN}\Big(\text{LN}\big(\text{ResAttn}(\mathbf{z}_i)\big)\Big) + \text{ResAttn}(\mathbf{z}_i) \tag{2.9}\]

We will henceforth use \(\mathbf{z}_i^l\) to denote the output of the \(l\)-th block for the \(i\)-th element of the input sequence for transformer models.

2.1.3 Transformer Language Models

A language model is a probabilistic model that can assign probabilities to sequences of tokens. Formally, given an input sequence \(\mathbf{X} = \langle t_1, \dots, t_S \rangle\) of \(S\) tokens, which in the case of natural language are typically words or subword units (Sennrich et al., 2016) from a vocabulary \(\mathcal{V}\), a language model \(f\) computes the probability of the sequence as the product of token-level conditional probabilities:

\[P(\mathbf{X}) = P(t_1, \dots, t_S) = \prod_{i=1}^{S} P(t_i|t_1, \dots, t_{i-1}) \tag{2.10}\]

Language models operating under such formulation are typically referred to as auto-regressive or causal language models (CLMs, or simply LMs), to differentiate them from masked language models (MLMs) trained to fill the blanks in a sequence (Devlin et al., 2019). While MLMs were the main object of analysis of early interpretability research on transformer models (Tenney et al., 2019; Clark et al., 2019; Rogers et al., 2020), this dissertation focuses solely on CLMs, which after the advent of ChatGPT³ in 2022 became the dominant paradigm in the NLP and interpretability community. CLMs are typically decoder-only models, following the structure introduced in Section 2.1.2, or encoder-decoder models, such as the MT systems later discussed in Section 2.4.

Importantly, LMs can be used for generating text by iteratively sampling from the probability distribution over the next token \(t_{i}\) given the previous tokens \(t_1, \dots, t_{i-1}\), e.g. using the greedy decoding sampling method:

\[t_i^* = \underset{t \in \mathcal{V}}{\arg\,\max}\;P(\,\cdot\,|t_1, \ldots, t_{i-1}) \tag{2.11}\]

This sampling process can be repeated autoregressively, i.e. by adding the selected token \(t_i^*\) to the input sequence, until a special end-of-sequence token is generated, or until a maximum sequence length is reached.

We now turn to the additional components required to convert the generic transformer model presented in the previous section into a language model able to process and generate sequences of tokens. Figure 2.2 shows a stylized view of a transformer LM.

Figure 2.2: A transformer language model predicting the next word given a prefix.

Embedding layer. The first component of a transformer language model is the embedding layer, which maps input tokens to continuous vector representations, known as embeddings. Word embeddings such as Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) revolutionized the field of natural language processing by exploiting distributional semantics, i.e. the fact that words which frequently appear in similar contexts should have similar meaning (Harris, 1954), to learn word representations end-to-end using gradient descent. In transformers, the token embedding matrix \(\mathbf{E} \in \mathbb{R}^{|\mathcal{V}| \times d}\), where \(d\) is the size of the embedding vectors, and \(|\mathcal{V}|\) is the vocabulary size, is learned jointly with the rest of model parameters during training. The embedding layer maps each token \(t_i\) in the input sequence to its corresponding vector \(\mathbf{z}_i = \mathbf{E}[t_i]\). The resulting sequence of embeddings \(\mathbf{Z} \in \mathbb{R}^{S \times d}\) corresponds to the input to the first transformer block. It is important to note that representations produced by \(\mathbf{E}\) are not contextualized, i.e. the same token \(t_i\) will always be mapped to the same vector \(\mathbf{z}_i\), regardless of its meaning in the given sequence. For example, the word ring will always be mapped to the same vector, regardless of whether it is used as a noun or a verb. The transformer blocks are used to contextualize these representations, i.e. produce different vectors for the same token depending on the remainder of the sequence.

Positional encodings. While the sequential nature of language is an important factor in how we produce and process linguistic information, transformer models do not explicitly account for ordering across elements of the input sequence. For this reason, positional encodings injecting information about the position of each token in the sequence are commonly used in transformer-based language models. The most basic positional encoding is a fixed sinusoidal encoding (Vaswani et al., 2017), which is added directly to the input embeddings. Recent models, however, employ rotary position embeddings, allowing for the encoding of both absolute and relative positions between tokens, and allowing the model to generalize to longer contexts beyond those seen during training (Su et al., 2024).

Causal self-attention. The self-attention mechanism in transformer language models is causal, meaning that the attention weights for each token \(t_i\) are computed only over the tokens preceding it in the sequence, i.e. \(t_1, \ldots, t_{i-1}\). This ensures that the model can only attend to past tokens when predicting the next token, preserving the auto-regressive nature of the model. The causal self-attention mechanism is implemented by masking out future tokens in the attention computation, ensuring that \(\alpha^h\) is computed only for \(j \leq i\) in Equation 2.4, and that only representations \(Z_{\leq i}\) are used to compute the key vector in Equation 2.5.

Prediction head. The prediction head of a transformer language models consists of a so-called unembedding matrix \(\mathbf{W}_{U} \in \mathbb{R}^{d \times |\mathcal{V}|}\) mirroring the initial embedding operation, sometimes accompanied by a bias. The last residual stream state \(\mathbf{z}_S^L\), where \(L\) is the number of transformer blocks and \(S\) is the sequence length, gets transformed by this linear map converting the representation into a next-token distribution of logits, which is turned into a probability distribution via the softmax function:

\[P(\,\cdot\,|t_1, \ldots, t_{i-1}) = \text{softmax}(\mathbf{z}_i^L \mathbf{W}_{U}) \tag{2.12}\]

In light of the residual stream view presented in Section 2.1.2, showing that different model components read from and write to the residual stream, it is natural to believe that the predictions derived by applying the unembedding matrix to the final residual stream state \(\mathbf{z}_S^L\) are the product of an iterative refinement across model components (Jastrzebski et al., 2018). The logit lens method (nostalgebraist, 2020), which we study for error detection in Chapter 10, exploits this intuition to analyze how the model refines the prediction throughout the forward pass, by projecting intermediate residual stream states \(\mathbf{z}_S^l\), with \(l < L\), to the vocabulary space using \(\mathbf{W}_{U}\).

Language model pre-training. Modern language models such as those employed in this thesis are typically pre-trained on large web corpora spanning billions or trillions of tokens using the next-token prediction objective, i.e. minimizing the cross-entropy loss between the next-token distribution predicted by the model and the next observed token. This frames the language model training problem as an instance of supervised learning, which we presented in Section 2.1.1. Formally, given a minibatch \(D_t\) of corpus \(\mathcal{D}\) composed by sequences of tokens \(\mathbf{X_k} = \langle t_1, \ldots, t_{S_k} \rangle\), the loss for a single training step is computed as:

\[\mathcal{L}_{\text{step}} = -\frac{1}{|D_t|} \sum_{\mathbf{X_k} \in D_t} \sum_{i=1}^{S_k} \log P(t_i|t_1, \ldots, t_{i-1}) \tag{2.13}\]

Concretely, this corresponds to maximizing the likelihood of the observed tokens given the context provided by the preceding tokens, while minimizing the likelihood of all other incorrect tokens.

Language model post-training. After pre-training, language models can be used for generating text given some context, but mostly lack the ability to perform specific tasks without being provided explicit examples, or respond to queries as conversational agents. For this reasons, all language models used for our experiments underwent additional supervised fine-tuning (SFT, also known as instruction tuning), allowing them to learn input-output mappings for realistic user queries beyond natural text occurrences in the pre-training corpus (Howard and Ruder, 2018; Sanh et al., 2022). The fine-tuning process still involves the same \(\mathcal{L}_{\text{step}}\) loss function over a smaller, curated set of demonstrations. Some of the models we study—such as the Gemma 2 models from Chapter 7 or the Zephyr model from Chapter 5 —underwent an additional reinforcement learning from human feedback (RLHF) step, in which the model is fine-tuned to maximize the likelihood of human preferences over pairs of model generations, using a reward model trained on human preferences. This process is typically performed using Proximal Policy Optimization [PPO;Schulman et al. (2017)] or similar reinforcement learning algorithms. Unless otherwise specified, we use the term language model to refer to transformer language models that were first pre-trained and then fine-tuned, representing the main focus of this thesis.

2.2 Explaining Predictions with Input Attribution

Contrary to linear models, where learned coefficients directly correspond to the influence of their respective features towards predictions, neural networks’ outcomes cannot be directly interpreted due to the presence of multiple nonlinearities across layers, rendering the attribution of model prediction to individual input features non-trivial. Input attribution methods, also known as feature attribution, were introduced to address this issue by providing a principled way to assign importance scores to input features, clarifying the rationales behind model decisions (Zeiler et al., 2011).

Formally, for a model \(\mathbf{f} \in \mathcal{F}: \mathcal{X} \to \mathcal{Y}\), given an input \(\mathbf{x} \in \mathcal{X}\), we can define the attribution method \(\gamma\) as a functional:

\[\gamma: \mathcal{X} \times \mathcal{F} \to \mathbb{R}^{|\mathcal{X}|}\]

so that \(\mathbf{a}_{\mathbf{f}(\mathbf{x})} = \gamma(\mathbf{x}, \mathbf{f})\) is a vector of attribution scores quantifying the influence of each element of \(\mathbf{x}\) on the model predictive distribution \(\mathbf{f}(\mathbf{x})\), with higher scores representing greater importance (Fel, 2024). It is worth noting that attribution methods can rely on one or more specific outcomes \(\mathbf{y} \in \mathcal{Y}\) from the predictive distribution \(\mathbf{f}(\mathbf{x})\), such as perturbation-based approaches (Covert et al., 2021), or simply rely on the flow of information within the model to identify important input elements (Abnar and Zuidema, 2020). We call the former methods target-dependent, and we discuss them further in Chapter 4.

2.2.1 Attribution Method Categories

We now briefly summarize common families of input attribution methods, which are employed throughout the first part of this thesis. An in-depth overview of input attribution techniques for natural language processing can be found in Madsen et al. (2022).

Gradient-based attribution For neural network models like transformer LMs, gradients are a natural source of input saliency which can be exploited for attribution purposes (Simonyan et al., 2014; Li et al., 2016). A simple gradient-based attribution corresponds to a first-order Taylor expansion of the model at a point \(\mathbf{x}\), expressed as \(\nabla \mathbf{f}(\mathbf{x}) \cdot \mathbf{x} + \mathbf{b}\). The resulting gradient \(\nabla_\mathbf{x}^c {\mathbf{f}}\) captures intuitively the sensitivity of the model prediction \(c\) to each element in the input. In the case of transformer LMs, \(\nabla_\mathbf{x}^{t^*} {\mathbf{f}} \in \mathbb{R}^{S \times d}\), i.e. every dimension of the input embedding is associated with a attribution score, and the logit of the top predicted token \(t^*\) is used as differentiation target for gradient computation.⁴ These scores are generally aggregated at a token level to obtain a more intuitive overview of the influence of individual tokens. This is commonly done by taking the \(L^p\) norm of the gradient vector:

\[\text{Grad}_{\,\mathbf{f}(\mathbf{x}) \leftarrow t^*} = \|\nabla_{\mathbf{x}}^{t^*} \mathbf{f}\|_p \in \mathbb{R}^{S} \tag{2.14}\]

Figure 2.3 shows an example of gradient attribution on a language model. By taking the dot product between the gradient vector and the input embedding \({\nabla_{\mathbf{x}}^{t^*} \mathbf{f}\cdot \mathbf{x}}\), known as the gradient \(\times\) input method, this sensitivity information can be converted to an importance estimate. More elaborate gradient-based attribution methods employ perturbations of the input embedding (Sundararajan et al., 2017; Smilkov et al., 2017) or ad-hoc gradient propagation rules (Bach, 2015; Achtibat et al., 2024) to filter noisy gradient information.

Figure 2.3: Gradient-based attribution in a neural language model.

Gradient-based attribution methods are heavily used in the investigations of Chapter 3, Chapter 4 and Chapter 5, representing the majority of methods supported by the Inseq toolkit and the most effective approaches for contextual cues imputation in the PECoRe framework. Notably, gradient attribution can be exploited in a similar way to identify the importance of intermediate states \(\mathbf{z}\) in the model, as opposed to input representations \(\mathbf{x}\), i.e. using \(\nabla_{\mathbf{z}}^{t^*} \mathbf{f}\). The CAT method proposed in Chapter 3 case study adopts this attribution-based approach to locate factual knowledge across LM layers.

Perturbation-based attribution Another popular family of approaches estimates input importance by adding noise or ablating input elements and measuring the resulting impact on model predictions. For instance, the input token \(w_j\) at position \(j\) can be removed, and the resulting probability difference \(p(t^*|t_{<i}) - p(t_{\setminus w_j}^*|t_{<i})\), where \(t^*\) is the predicted token for current sequence position \(i\) and \(j < i\), can be used as an estimate for its importance. If the logit or probability given to \(w\) does not change, we conclude that the \(i\)-th token has no influence. A multitude of perturbation-based attribution methods exist in the literature, such as those based on local surrogate models such as LIME (Ribeiro et al., 2016), or those derived from game theory like SHAP (Lundberg and Lee, 2017). Notably, some architecture-specific methods such as Value Zeroing (Mohebbi et al., 2023) have been proposed to mitigate the disruptive impact of perturbations on model behaviors. A comprehensive framework unifying various perturbation-based approaches is presented by Covert et al. (2021).

Context mixing for attribution Model internals such as the attention weights \(\alpha\) presented in Section 2.1.2 were initially proposed as possible explanations for model behavior (Bahdanau et al., 2015), but were found unfaithful in reflecting the actual predictive behavior of language models (Jain and Wallace, 2019; Bastings and Filippova, 2020). This is because, contrary to other approaches, they only accounted for the importance of specific model components, rather than a more general notion of saliency across the full model. However, recent methods have proposed more refined estimates of token contributions exploiting internals to quantify the information flow within LMs. Some of these alternatives include the use of the norm of value-weighted vectors and output-value-weighted vectors (Kobayashi et al., 2020; Kobayashi et al., 2021), or the use of vectors’ distances to estimate token contributions (Ferrando et al., 2022). These methods result in a set of attribution scores \(\mathbf{a}_{\mathbf{f}(\mathbf{x})} \in \mathbb{R}^{S \times L}\), marking the contribution of position-specific representation across all layers \(1, \ldots, L\) of the model. These per-layer attributions reflecting context mixing patterns are often aggregated using techniques such as rollout (Abnar and Zuidema, 2020), resulting in one score per input token participating in the attention operation. Such context mixing approaches have shown competitive faithfulness compared to best gradient and perturbation-based methods, despite employing only a single forward pass to estimate contributions.

Contrastive input attribution An important limitation of input attribution methods for interpreting language models is that attributed output tokens belong to a large vocabulary space, often having semantically equivalent tokens competing for probability mass in next-word prediction (Holtzman et al., 2021). In this context, attribution scores are likely to misrepresent several overlapping factors such as grammatical correctness and semantic appropriateness driving the model prediction. Recent work addresses this issue by proposing a contrastive formulation of such methods, producing counterfactual explanations for why the model predicts token \(t^*\) instead of an alternative token \(t^\sim\). Yin and Neubig (2022) extend the vanilla gradient method of Equation 2.14 to the contrastive setting as:

\[\text{ContGrad}_{\,\mathbf{f}(\mathbf{x}) \leftarrow t^*, t^\sim} = \nabla_{\mathbf{x}}^{t^* - t^\sim} \mathbf{f} \tag{2.15}\]

We employ this formulation in the PECoRe framework in Chapter 4 and its extension of Chapter 5 to identify salient context cues for generated tokens that were highly influenced by context.

2.2.2 Evaluating and Using Attribution Methods

Plausibility and Faithfulness The evaluation of input attribution methods can be operationalized in terms of various desiderata. Plausibility, also referred to as “human-interpretability” (Lage et al., 2019), is a measure of “how convincing the interpretation is to humans” (Jacovi and Goldberg, 2020), i.e. how well the salient tokens identified by an attribution method are in agreement with those selected by human annotators. It is important to note that plausibility does not imply faithfulness, i.e. how accurately the rationale reflects the true reasoning process of the model (Wiegreffe and Pinter, 2019), since a good explanation of model behavior might not align with human intuition. Consider the following sentence from the BLiMP corpus (Warstadt et al., 2020).

\(\mathbf{x}\) = A report about the Impressionists has/\(*\)have won the competition.

For the sentence to be grammatically correct, the verb to have must be correctly inflected as has to agree with the preceding noun report. Hence, to evaluate the plausibility of a language model for this example, the model is provided with the prefix \(\mathbf{x}'\) =“A report about the Impressionists”. Then, attribution scores are computed for every input token towards the prediction of has as the next token. Finally, we verify whether these scores identify the token report as the most important to predict has. We note that the selection of the pair report-has in the canonical procedure described above is entirely based on grammatical correctness, and other potential pairs not matching these constraints are not considered (e.g. the usage of report to predict writing instead of has as a likely continuation). This common procedure might also cause reasonable behaviors to be labeled as implausible. For example, the indefinite article A might be identified as the most important token to predict has since it is forcibly followed by a singular noun and can co-occur with has more frequently than report in the model’s training data. These limitations in the standard hypothesis-driven approach to plausibility evaluation motivate our proposal for PECoRe as a data-driven alternative in Chapter 4.

Limitations of input attribution methods While input attribution methods are commonly used to debug failure cases and identify biases in models’ predictions (McCoy et al., 2019), popular approaches were shown to be insensitive to variations in the model and data generating process (Adebayo et al., 2018; Sixt et al., 2020), to disagree with each others’ predictions (Atanasova et al., 2020; Crabbé and Schaar, 2023; Krishna et al., 2024) and to show limited capacity in detecting unseen spurious correlations (Adebayo et al., 2020; Adebayo et al., 2022). Importantly, popular methods were found provably unreliable at predicting counterfactual model behavior in realistic settings (Bilodeau et al., 2024). Apart from theoretical limitations, perturbation-based approaches also suffer from out-of-distribution predictions induced by unrealistic noised or ablated inputs, and from high computational cost of targeted ablations for granular input elements.

Tools for input attribution The captum library (Kokhlikyan et al., 2020) is part of the Pytorch ecosystem providing access to several gradient and perturbation-based input attribution methods for any Pytorch-based model, with the recent addition of utilities for simplifying attribution analyses of generative LMs (Miglani et al., 2023). Several captum-based tools provide convenient APIs for input attribution of transformer-based models, notably Transformers Interpret (Pierse, 2021), ferret (Attanasio et al., 2023) and Ecco (Alammar, 2021), which are mainly centered around language classification tasks. SHAP (Lundberg and Lee, 2017) is a popular toolkit mainly centered on perturbation-based input attribution methods and model-agnostic explanations for various data modalities. The saliency library⁵ provides framework-agnostic implementations for mainly gradient-based input attribution methods, while LIT (Tenney et al., 2020) is a framework-agnostic tool providing a convenient set of utilities and an intuitive interface for interpretability studies spanning input attribution, concept-based explanations and counterfactual behavior evaluation. It notably includes a visual tool for debugging complex LLM prompts (Tenney et al., 2024). More recent low-level interpretability tools such as nnsight (Fiotto-Kaufman et al., 2025) also support attribution, without explicitly providing abstractions to facilitate its usage. inseq, which we introduce in Chapter 3 as part of this thesis’ contributions, is one of the most popular tools for input attribution of generative LMs, supporting advanced approaches for contrastive context attribution (Sarti et al., 2024) and context mixing evaluation.

2.3 Conditioning Language Model Generations

This section describes the two main families of approaches for conditioning the behavior of language models during text generation. First, we present methods for modifying the input context by providing relevant information retrieved from external sources, or demonstrations of desired behavior, which we use in Chapter 5, Chapter 6, and 7. Then, we discuss approaches for modifying the model’s internal representations to achieve targeted interventions in the generation process, which we compare to prompting methods in Chapter 7.

2.3.1 Controlling Input Context

Large language models have become widely popular due to their ability to adjust their predictions in light of few examples or relevant information provided in an input context (prompt), without requiring additional training (Brown et al., 2020). Prompting LLMs to exploit their in-context learning skills has become pervasive in the NLP community, with much effort devoted to designing effective prompts for various tasks (Dong et al., 2024).

Few-shot prompting is an effective approach to adapt LLMs to new tasks by providing a few demonstrations of the desired behavior in the input context. For example, to perform a translation, a few source language examples can be provided in the prompt with their respective target language translations, and the model is expected to translate new source entries used as queries (Figure 2.4, left). Zero-shot prompting is a more challenging task, where the model is expected to perform well on a new task without any demonstrations, relying solely on its pre-trained knowledge. While effective, several studies highlighted the brittleness of prompting to unexpected factors such as the order of provided examples (Lu et al., 2022). In this thesis, we use few-shot prompting in our attribute-controlled translation experiments of Chapter 6 and our literary translation experiments of Chapter 7.

Figure 2.4: **Left:** Few-shot prompting for English\(\rightarrow\)Italian translation. **Right:** Retrieval-augmented generation for factual question answering. Relevant paragraphs are dynamically retrieved and infilled in the prompt using their similarity to the query to improve answer quality.

Retrieval-augmented generation (RAG) is a different approach for conditioning generation where the model is provided with relevant context paragraphs retrieved on-the-fly from an external dataset, such as Wikipedia or a domain-specific corpus. This context is then used to inform the model’s predictions, allowing it to generate more accurate and relevant responses without relying solely on its potentially faulty pre-training knowledge (Figure 2.4, right). RAG has been shown to be effective in improving the factual accuracy of model outputs and reducing hallucinations (Lewis et al., 2020; Petroni et al., 2020). However, it is not directly obvious which retrieved paragraphs are motivating the model’s predictions, a challenge we address via input attribution in Chapter 5. Chapter 6 also employs a similarity retrieval component to control the examples selected for few-shot prompting, showing that example selection leads to better performances in machine translation with LLMs.

2.3.2 Controlling Model Representations

Techniques for conditioning model behavior by modifying the model’s internal representations are commonly referred to as steering methods, and often exploit the linear structure of model activations to achieve simple targeted interventions. Indeed, the linear representation hypothesis states that latent properties of interest—for example, the tone of a response—are encoded as linear subspaces of the representation space in language model activation (Park et al., 2023). Such property was already observed in early work on word embeddings (Mikolov et al., 2013), where the direction of the vector between two words was shown to encode their semantic relationship, e.g. \(\mathbf{z}_{\text{king}} - \mathbf{z}_{\text{man}} + \mathbf{z}_{\text{woman}} \approx \mathbf{z}_{\text{queen}}\).

Recent work highlighted the effectiveness of linear interventions on language models representations using directions identified by a probing classifier, i.e. a model \(\mathbf{p}: \mathbb{R}^{d} \to \mathcal{C}\) trained to predict a specific property of interest \(c \in \mathcal{C}\) from the intermediate representation of a trained transformer LM (Köhn, 2015; Gupta et al., 2015; see Belinkov, 2022 for a review). For instance, adding negative multiples of the sentiment direction (\(\mathbf{c}_\text{sent}\)) to the residual stream, i.e. modifying the activation \(\mathbf{z}^l\) as \({\tilde{\mathbf{z}}^l \leftarrow \mathbf{z}^l - \alpha \mathbf{c}_\text{sent}}\), where here \(\alpha\) is a pre-selected steering coefficient controlling the intensity of the intervention, is sufficient to generate a text exhibiting the opposite sentiment label (Tigges et al., 2024). This simple procedure, known as activation addition, has become popular for conditioning desired attributes in model generations, including multiple properties at once (Scalena et al., 2024). Some of its variants omit probing classifiers and employ other unsupervised methods for computing feature directions, such as K-Means clustering of representations for examples showing a desired property (Zou et al., 2024), or mean difference between representations for positive and negative sets of demonstrations (Marks and Tegmark, 2024; Arditi et al., 2024).

Wu et al. (2024) describe a broader framework for representation steering, proposing the use of learnable interventions for conditioning generation at specific steps with variable intensity. Formally, an intervention \(I\) can be defined as a tuple composed by an intervention function \(\xi: \mathbb{R}^d \to \mathbb{R}^d\) with learnable parameters, a set of input positions \(P \subseteq \{1, \dots, S\}\) that the intervention is applied to and the layer \(l\) at which the intervention is applied. This framework, dubbed representation fine-tuning (ReFT), allows to learn interventions overriding \(\mathbf{z}^l\) as:

\[z^l_i = \begin{cases} \xi(\mathbf{z}^l_i), & \text{if}\; i \in P \\ \mathbf{z}^l_i, & \text{otherwise} \end{cases} \tag{2.16}\]

The intervention function can be learned by minimizing the normal cross-entropy loss with a next token prediction objective, optimizing only the parameters of the intervention function. Activation addition (ActAdd) can then be described as a special case in this broader framework, where the intervention function \(\xi\) is constant and applied at all generation steps. In the experiments of Chapter 7, we use ActAdd and ReFT as baselines for our proposed steering method.

The final steering approach we discuss in this section involves the use of sparse autoencoders[SAEs; Huben et al. (2024)] for conditioning model behavior. SAEs have become widely adopted for analyzing the representations learned by transformer LMs thanks to their ability to address polysemanticity, i.e. the entanglement of multiple concepts within learned model representations. Indeed, neurons in transformer LMs were observed to activate on diverse and semantically distinct contexts, with concepts being encoded in a distributed manner across multiple units (Smolensky, 1986; Olah, 2023). In light of this, and given the disparity between the relatively low-dimensional representations learned by transformer LMs and the vast array of abilities they acquire during training, latent concept representations were speculated to be encoded in superposition across various model units (Arora et al., 2018), i.e. that multiple neurons jointly encode the presence of a single concept (Figure 2.5, left). A concrete example of this phenomenon is given by Elhage et al. (2022), where superposition is observed in presence of a long tail of sparse concepts in the training dataset.

Figure 2.5: **Left:** Concepts encoded in a 2-dimensional parameter space. (a) Polysemanticity can be observed when concept do not align with the standard basis, i.e. they are encoded jointly by multiple units. (b) If concepts align perfectly with neurons, these neurons are *monosemantic*. (c) When the number of concepts exceeds the number of parameters, polysemanticity is inevitable and *superposition* is observed. **Right:** Sparse autoencoder (SAE) trained to reconstruct a model’s internal representations \(\mathbf{z}\). Interpretable SAE concepts are found in rows of \(\mathbf{W}_{\text{dec}}\). Biases are omitted for clarity.

A possible strategy to disentangle concepts in superposition involves finding an overcomplete feature basis via dictionary learning (Olshausen and Field, 1997; Donoho and Elad, 2003). SAEs are simple autoencoder neural networks, i.e. models trained to reconstruct their input, that can be trained to reconstruct internal representations \(\mathbf{z} \in \mathbb{R}^{d}\) of a neural network exhibiting superposition. Their training objective encourages the model to learn a sparse coding of the input representation through an ad-hoc loss term, resulting in a sparse dictionary of learned concepts. Huben et al. (2024) and Bricken et al. (2023) propose training SAEs on transformer LM representations using the form:

\[ \begin{aligned} \text{SAE}(\mathbf{z}) &= h(\mathbf{z})\,\mathbf{W}_{\text{dec}} + \mathbf{b}_{\text{dec}} \\ \text{with}\; h(\mathbf{z}) &= \sigma\big((\mathbf{z} - \mathbf{b}_{\text{dec}})\mathbf{W}_{\text{enc}} + \mathbf{b}_{\text{enc}}\big) \\ \end{aligned} \tag{2.17}\]

using the loss function:

\[\mathcal{L}(\mathbf{z}) = \|\mathbf{z} - \text{SAE}(\mathbf{z})\|_2^2 + \alpha \|h(\mathbf{z})\|_1 \tag{2.18}\]

where \(\sigma\) is a non-linear activation function, \(\mathbf{W}_{\text{enc}}\) and \(\mathbf{W}_{\text{dec}}\) are the encoder and decoder learned weight matrices, respectively, and \(\alpha\) is a hyperparameter controlling the sparsity of the learned representation. The first term in Equation 2.18 is the reconstruction term, accounting for the quality of reconstruction, while the second term is the sparsity term, which promotes sparsity. The SAE architecture is illustrated in Figure 2.5 (right).

If \(h(\mathbf{z}) \in \mathbb{R}^{m}\) and \(m \gg d\), \(\mathbf{z}\) can be approximated as a sparse linear combination of the learned rows in the dictionary \({\mathbf{W}_{\text{dec}} \in \mathbb{R}^{m \times d}}\), ideally representing monosemantic concepts. Similarly to activation addition, these concepts can be used to steer model behavior by scaling them using a steering coefficient before reconstruction, resulting in a modified representation \(\tilde{\mathbf{z}}\). We use a similar approach in our SAE-based steering method we present in Chapter 7.

2.4 Machine Translation

Machine translation is a long-standing task in natural language processing, with the goal of automatically translating text from a source language to another target language. In this section, we provide a brief overview of the evolution of machine translation approaches, describe how transformer LM architectures are commonly used for machine translation, and how such models can handle multiple languages and contextual information.

The history of machine translation can be summarized in three main phases. Between the 1960s and the 1980s, the first successes of machine translation were attained by rule-based systems exploiting various techniques, ranging from direct translation using dictionaries with a set of reordering rules to ambitious methods aiming to exploit an interlingua to act as a bridge when mapping meaning across languages (Hutchins, 2001). As for most rule-based methods, however, these approaches were limited by the need of ad-hoc rules, which could hardly account for less frequent and challenging settings. From the 1990s onwards, the statistical paradigm took foot by exploiting large bilingual corpora made available by the birth of the World Wide Web to train statistical language models parametrized as tables of co-occurrence probabilities (Och et al., 1999), with popular approaches aiming to segment challenging sentences into simpler phrases for ease of translation via co-occurrences (Koehn et al., 2003) or syntactic analysis (Hadiwinoto, 2017). In 2013, the advent of word embeddings coincided with the first MT systems based on continuous language representations parametrized by neural networks (Kalchbrenner and Blunsom, 2013), marking the advent of the neural MT (NMT) paradigm that remains the current state-of-the-art for machine translation. While the architecture of NMT systems has barely changed since the introduction of the transformer, as for most NLP tasks the introduction of large pre-trained language models has led to general-purpose models able to handle various translation-related task via light tuning and ad-hoc prompting (Alves et al., 2024).

Provided that machine translation involves the generation of a sequence of translated target tokens, it is straightforward to see how such task can fit well into the sequence-to-sequence framework adopted by neural language models. Given a sequence of tokens \(\mathbf{x} = (x_1, x_2, \ldots, x_{S_s})\) in the source language \(s\), a language model can be trained to generate a sequence of target tokens \(\mathbf{y} = (y_1, y_2, \ldots, y_{S_t})\) in the target language \(t\) using the classic cross-entropy loss function. The transformer module we presented in Section 2.1.3 corresponds to the decoder-only architecture currently preferred for language modeling, involving a single stack of blocks. However, the original model proposed by Vaswani et al. (2017) followed the traditional encoder-decoder structure adopted in MT, with an additional dedicated component for encoding source information and influencing the generation of target tokens.

The encoder-decoder transformer architecture for machine translation is illustrated in Figure 2.6. The encoder processes the source sentence \(\mathbf{x}\) and produces a sequence of contextualized representations \(\mathbf{Z}^{L_{\text{enc}}}_{\text{enc}} \in \mathbb{R}^{S_s \times d_\text{enc}}\) capturing the meaning of the source sentence. When generating the \(i\)-th token in the target sentence, every block of the decoder then attends to the target prefix \(\mathbf{y}_{<i}\) using the self-attention module (MHSA) presented in Section 2.1.2, and complements this with a multi-head cross-attention (MHCA) mechanism integrating information from encoder representations \(\mathbf{Z}^{L_{\text{enc}}}_{\text{enc}}\). Functionally, the cross-attention module is identical to self-attention, but employs encoder representations to generate key and value vectors, while the query vectors are generated from the decoder representations.

Figure 2.6: Transformer encoder-decoder architecture for neural machine translation. The encoder processes the source sentence and produces a sequence of contextualized representations, while the decoder generates the target sentence using causal self-attention (MHSA) and cross-attention (MHCA) mechanisms. The last decoder state is projected to the vocabulary space by the prediction head, and the next word is selected.

While encoder-decoder transformers were traditionally trained from scratch on the machine translation task, the current state-of-the-art adapts pre-trained decoder-only LLMs with ad-hoc supervised tuning (Cui et al., 2025; Rei et al., 2024; Xu et al., 2024). Our experiments reflect this paradigm shift: initial MT experiments in Chapter 4, Chapter 8 and Chapter 9 employ traditional encoder-decoder, single-purpose translation models, while in Chapter 6 and Chapter 7 we generate translations by prompting general-purpose LLMs. Finally, Chapter 10 evaluates methods on both model types.

Multilingual machine translation Even before the advent of LLM-based translation systems, an important trend in MT research involved the training of massively multilingual MT (MMT) models capable of producing direct translations across hundreds of translation directions (Aharoni et al., 2019). Such approach was shown to bring improvements over previous methods requiring an intermediate translation step into a high-resource pivot language when two less-resourced languages were used as source and target (Kim et al., 2019). MMT models are typically trained on large multilingual web corpora with similarity-matched sentence pairs in different languages (Schwenk et al., 2021), using special language tags such as <eng_Latn> as prefixes to mark source and target languages. After training, a translation into a specific language can be produced by prepending the respective language tag to the target sequence, biasing model generation towards tokens matching that language. This thesis makes ample use of encoder-decoder MMT models, such as mBART-50 (Tang et al., 2021), trained to translate from English to 50 languages (one-to-many MMT), M2M-100 (Fan et al., 2021), with many-to-many translation between 100 languages, and finally No Language Left Behind [NLLB; NLLB Team et al. (2024)], covering 200 languages in all directions. Decoder-only LLMs are generally trained on variable amounts of multilingual data⁶, and hence exhibit some degree of multilingual ability without additional MT tuning.

Context-aware machine translation Inter-sentential context is often fundamental for resolving discourse-level ambiguities during translation (Müller et al., 2018; Bawden et al., 2018; Voita et al., 2019; Fernandes et al., 2023b). Traditional MT systems were trained at segment level due to their limited ability in handling long context, potentially losing important contextual information that spans beyond sentence boundaries, resulting in lower performances in realistic settings (Läubli et al., 2018; Toral et al., 2018a). Context-aware MT approaches aimed to address this limitation by incorporating document-level information to improve translation quality and consistency, leading to improved performance when translating cohesive discourse phenomena such as anaphora resolution, lexical cohesion, and maintaining consistent terminology within a document (Voita et al., 2018; Maruf and Haffari, 2018). Initial context-aware approaches for NMT employed methods ranging from concatenating multiple source sentences to employing hierarchical attention mechanisms that explicitly model document structure (Miculicich et al., 2018; Zhang et al., 2018). We use one such methods, namely concatenating context and current source text using a special <brk> tag, for the NMT models we analyze in Chapter 4. Recent LLM-based translation systems can naturally process longer contexts and maintain better consistency across document boundaries (Wang et al., 2023; Briakou et al., 2024).

2.5 MT Post-Editing and Evaluation

The landscape of machine translation has undergone a fundamental transformation in recent decades, shifting from a tool primarily designed for professional translators to a technology accessed by millions of lay users worldwide (Savoldi et al., 2025). In this section, we review MT post-editing tools and practices, and discuss how MT outputs are evaluated by means of automatic metrics and human annotators.

2.5.1 Post-editing MT

Since the inception of MT technologies in professional translation workflow, human post-editing has been a crucial step to ensure quality and mitigate potential critical errors, especially for low-resource settings (Wagner, 1983; Church and Hovy, 1993). The industry distinguishes between two primary post-editing levels: light post-editing, which focuses on correcting only critical errors affecting comprehension while tolerating stylistic imperfections, and full post-editing, which aims to achieve human translation quality standards. The choice between these approaches involves trade-offs between effort investment and quality requirements, with light post-editing being faster while maintaining acceptable quality for many use cases (Plitt and Masselot, 2010). Seminal post-editing studies highlighted an increase in translators’ productivity following MT adoption (Guerberof, 2009; Green et al., 2013; Läubli et al., 2013; Plitt and Masselot, 2010; Parra Escartín and Arcedillo, 2015). However, they also struggled to identify generalizable findings due to confounding factors like output quality, content domains, and high variance across language pairs and human subjects. With the advent of NMT, productivity gains of the new approach were extensively compared to those of statistical MT (Castilho et al., 2017; Bentivogli et al., 2016; Toral et al., 2018b; Läubli et al., 2019). Initial results were promising for NMT due to its better fluency and overall results. Moreover, translators were shown to prefer NMT over SMT for post-editing, although a pronounced productivity increase was not always present. In more recent times, various works explored the usage of adaptive MT systems that learn from post-editing feedback in real-time (Turchi et al., 2017; Karimova et al., 2018), with the goal of progressively reducing repetitive corrections and adapting to translator preferences. Notably, recent estimates confirm that human-machine collaboration can match or even exceed the quality of human-only translations, with potential cost reductions estimated at around 60% the price of full human post-editing (Liu et al., 2024).

The main metric of evaluation for post-editing in the industry is productivity, often operationalized as the amount of source characters or word revised per minute. On the other hand, post-editing research often complements productivity measurements with editing effort alongside its temporal, technical and cognitive components (Krings, 2001), corresponding to editing time, number of keystrokes and pauses between keystrokes during the editing process, respectively. Importantly, the cognitive and temporal demands of post-editing were found to vary significantly depending on various factors, such as error types and user expertise. For example, Daems et al. (2017) found that certain error categories have disproportionate impacts on post-editing effort, with adequacy errors often requiring more cognitive resources than fluency errors, even though the latter may be more immediately apparent to users (Martindale and Carpuat, 2018). Domain-specific considerations further complicate this landscape, as technical domains may tolerate certain stylistic variations while requiring precise terminology, whereas literary translation may prioritize creative renditions of meaning (Guerberof-Arenas and Toral, 2022).

Professional translators typically post-edit texts through computer-assisted translation (CAT) tools, which are interfaces designed to enhance human translators’ productivity by providing access to keyboard shortcuts, quality estimation (which we discuss in Section 2.6) and other assistive technologies (Bowker, 2002). A common functionality of CATs is the integration of translation memories (TMs), which are bilingual databases storing previously translated content that can be retrieved and reused for similar segments, mimicking the functioning of early example-based MT systems (Garcia, 2009). Additional features often include terminology management systems (termbases) for maintaining consistency in technical terms and brand names, automatic text segmentation, and quality assurance modules such as spellcheckers for detecting errors and inconsistencies. Modern CAT tools have evolved from standalone desktop software to cloud-based platforms accessible via web browsers (Moran et al., 2014; Federico et al., 2014), with recent surveys indicating that 88% of professional translators use at least one CAT tool for their work.⁷ While many CAT tools nowadays offer multiple advanced features, including LLM-based AI assistants, in our user studies of Chapter 8 and Chapter 9, we employ simple research-oriented interfaces with minimal text editing functionalities to ensure equal proficiency across subjects. In Chapter 8 we employ PET (Aziz et al., 2012), a simple desktop-based post-editing tool supporting various languages, while in Chapter 9 we use a custom-built web interface supporting editing over highlighted error spans.

2.5.2 MT Evaluation

The industrial context had historically an important influence on MT evaluation practices, encouraging researchers to focus on evaluation efficiency, combining automatic metrics with human assessment, and metrics that could provide concrete benefits when employed in professional translation workflows.

Automatic MT Metrics. Automatic evaluation metrics for machine translation have been widely adopted since the early 2000s, with the most popular metrics being BLEU (Papineni et al., 2002). BLEU is a simple and inexpensive metric measuring lexical similarity between a candidate translation \(\hat y\) and its given reference \(y\) as the number of \(n\)-grams \(G_n = {\hat y_1, \dots, \hat y_n, \hat y_2, \dots, \hat y_{n+1}, \dots}\) shared between them, normalized by the total n-gram count:

\[p_n(y, \hat y) = \frac{\sum_{s \in G_n} \min(C(s,\hat y), C(s,y))}{\sum_{s \in G_n} C(s,\hat y)}\]

where \(C(s, y)\) is the count of n-gram \(s\) in sequence \(y\). The complete BLEU score also incorporates a brevity penalty to discourage overly short translations. BLEU is computed at segment-level for an entire corpus of candidate and reference translations, and averaged to obtain a corpus-level score. Multiple variants of BLEU have been proposed to account for length bias, multiple references, with other metrics such as chrF (Popović, 2015) adopting similar lexicon-based approaches at the character level, or aligning n-grams across the two sequences (Banerjee and Lavie, 2005). Other lexical metrics such as the Translation Error Rate (Snover et al., 2006) or Word Error Rate (WER) have been used to connect the quality of the candidate sequence to the number of edits required to convert it into the reference, grounding the evaluation in post-editing technical effort. While these metrics provide rapid assessment of translation quality with minimal computational overhead, they suffer from several limitations: sensitivity to lexical variations that may not reflect translation quality differences, poor correlation with human judgments for high-quality neural MT outputs, and limited generalization across different writing systems (Bugliarello et al., 2020).

Following calls from the MT research community (Freitag et al., 2022), the limitations of lexical metrics led to the widespread adoption of learned metrics trained to predict translation quality from large amounts of annotated examples. Most of the widely used learned MT metrics employ transformer-based encoder-only pretrained LMs such as BERT (Devlin et al., 2019) or the cross-lingual model XLM (Conneau and Lample, 2019). Among the most notable metrics, Bleurt (Sellam et al., 2020) is a BERT-based model using multi-task loss on synthetic data to perform regression of human quality judgments, while comet (Rei et al., 2020) feeds source text, candidate and reference translation triples to a dual cross-lingual encoder structure that jointly learns to estimate quality and rank multiple candidate translations. In most of our MT evaluations we employ the comet metric due to its excellent performance across hundreds of languages, which resulted in top-scoring submissions at multiple WMT metrics shared tasks (Rei et al., 2020; Rei et al., 2021; Rei et al., 2022a).⁸ However, learned metrics introduce their own challenges, including non-trivial computational requirements, potential biases inherited from training data, and questions about generalization to out-of-domain content (Amrhein and Sennrich, 2022)

Human evaluation of MT. Human evaluation, despite its challenges due to inconsistencies across annotators, cultural and linguistic biases, and high costs, remains the gold standard for assessing machine translation quality, providing crucial insights that automatic metrics may fail to capture (Freitag et al., 2021). Historically, human assessment of MT was centered around the notions of adequacy (also accuracy or fidelity), comprehensibility and fluency (or grammaticality) (White et al., 1994; Callison-Burch et al., 2007), with adequacy measuring how well the original meaning is conveyed, comprehensibility reflecting how understandable MT is without the original source, and fluency judging whether appropriate target grammar is employed (Popović, 2020). MT evaluation campaigns since 2017 adopted a continuous direct assessment (DA) of translation quality using scalar ratings— for example, using a 0-100 scale as in Graham et al. (2013) —or comparative ranking of multiple system outputs (Bojar et al., 2017).

More recently, the introduction of the Multidimensional Quality Metric (MQM) (Lommel et al., 2013) has provided more structured evaluation protocols. MQM is an established framework allowing annotators to identify and categorize specific spans in a translated text as accuracy, fluency, and style issues, and assign them a level of severity (typically, a 3-way classification into minor/major/critical). Freitag et al. (2021) experiments with various scoring configurations, resulting in the scoring formula:

\[\text{MQM} = (\text{\# Major Err.} \times 5) + (\text{\# Minor Err.} \times 1) + (\text{\# Punct. Err.} \times 0.1)\]

with higher scores corresponding to worse translation, resulting in a high correlation with judgments from expert raters. However, such scheme has been criticized due to its potential length bias, with recent proposals for calibrated and non-linear scoring models accounting for similar issues (Lommel et al., 2024). An example description of MQM error categories and severity levels we employed for our study in Chapter 9 is presented in Table 9.1.

Recent evaluation campaigns such as WMT 2024 (Kocmi et al., 2024a) have increasingly adopted the MQM protocol for their evaluation, emphasizing in particular the importance of expert vs. non-expert annotators, with studies showing that translation professionals provide more consistent and reliable judgments compared to crowd-sourced annotations (Freitag et al., 2021). The advent of large language models has introduced new challenges for human evaluation, as the quality gap between human and machine translation continues to narrow, requiring more fine-grained assessment criteria and larger annotator pools to achieve reliable results (Kocmi et al., 2024a). The main limiting factor towards the diffusion of the MQM evaluation protocol is its cost, since it involves a thorough annotation of error spans. Recently, the Error Span Annotation (ESA) protocol (Kocmi et al., 2024b) was introduced as a potential compromise between DA and MQM ratings, soliciting annotators to provide a 0-100 quality rating only after a light pass of error span identification, without requiring a full MQM error type categorization. The error annotation is intended to prime annotators to ground their quality judgments in empirical evidence, and ESA scores were observed to correlate strongly with MQM ones, while being 32% cheaper to obtain (Kocmi et al., 2024b). For this reason, we adopt a variant of the ESA protocol when conducting the quality assessment phase of our QE4PE study in Chapter 9. Zouhar et al. (2025) propose to use a language model to assist in the error span identification process, potentially further reducing the cost and effort involved in the ESA protocol.

2.6 Quality Estimation for MT

The automatic MT metrics presented in Section 2.5 require the use of a reference translation to measure the quality of a given candidate. While effective, these metrics cannot be employed to evaluate translation candidates on the fly, for example before presenting them to human post-editors, or as a ranking procedure in advanced decoding strategies (Rei et al., 2022b). Moreover, the presence of low-quality references can lead to biased evaluations of MT quality that do not reflect the translation quality without tying it to a specific gold standard (Freitag et al., 2023). Quality estimation metrics (QE), also known as reference-free MT metrics, are an alternative category of techniques designed to address these limitations by predicting translation quality without requiring reference translations (Specia et al., 2018). Contrary to traditional MT evaluation, QE can be performed at various levels of granularity. On the one hand, when operating at the segment or document levels, QE methods typically returns a score between 0 and 1 reflecting the overall quality of the translation, which can be then used to guide post-editors to focus on problematic segments (Tamchyna, 2021). On the other hand, word-level QE metrics can provide more granular information about translation issues, and typically operate by marking individual words with binary OK/BAD labels or, more recently, following the severity scheme introduced by the MQM framework.

Initial approaches to QE were mostly based on the uncertainty extracted from MT models (Blatz et al., 2004; Specia et al., 2009), but with time began focusing on supervised approaches involving ad-hoc model training (Turchi et al., 2013; Turchi et al., 2014; Kepler et al., 2019; Thompson and Post, 2020, inter alia). Advances in segment- and word-level QE research are regularly assessed in annual WMT campaigns (Fomicheva et al., 2021; Zerva et al., 2022; Zerva et al., 2024; Blain et al., 2023), where the best-performing QE systems have recently employed transformer-based language models trained to predict quality scores, in a fashion similar to reference-based metrics. In particular, reference-less counterparts to the comet models were introduced for QE applications, including a smaller model for efficient inference (Rei et al., 2022b).

More recently, the widespread adoption of the MQM paradigm and the advances in LLM capabilities led to new QE metrics predicting quality at various granularity levels. Notably, Kocmi and Federmann (2023) prompt GPT-4 with an annotation scheme mimicking MQM to produce fine-grained quality assessments, from which they derive a segment-level score, while Fernandes et al. (2023a) develop a similar AutoMQM framework using the PaLM-2 LLM. While these approaches usually employ proprietary models, Guerreiro et al. (2024) propose a state-of-the-art open-source QE model extending comet to jointly predict quality estimation at the word and the sentence level, combining sentence-level and word-level error span prediction for improved explainability of results. xcomet metrics come in a 3.5B (XL) and 10.7B (XXL) size and support both reference-based and reference-less usage, hence enabling usage for quality estimation purposes. Concretely, xcomet models are transformer encoders fine-tuned from pre-trained XLMR encoders (Goyal et al., 2021) using a mix of sentence-level Direct Assessment scores and word-level MQM error spans. We use their resulting systems for our user study of Chapter 9 and our metric comparison in Chapter 10.

Aside from supervised models, a return to unsupervised methods exploiting models uncertainty and their internal mechanisms was brought on by Fomicheva et al. (2020). In their work, such approaches were shown to rival state-of-the-art supervised QE models in predicting translation quality at the segment level. These methods typically rely on the model’s confidence in its predictions, often using metrics such as predictive probability or the entropy of the predictive distribution to mark low-confidence tokens as potential errors. The appeal of such methods lies in their efficiency, exploiting the knowledge of the MT model for error detection without requiring additional training on expensive human annotations. While such methods have been the object of multiple studies (Dale et al., 2023; Xu et al., 2023; Himmi et al., 2024; surveyed by Leiter et al., 2024), including a shared task dedicated to explainable QE metrics (Fomicheva et al., 2021), their evaluation was typically focused on segment-level evaluation quality, with word-level error spans being generally obtained by attributing the predictions of supervised segment-level metrics (Rubino et al., 2021; Rei et al., 2023). By contrast, recent work on LLMs evaluates various metrics to detect errors from the generator model, without additional systems involved, both at the sentence (Fadeeva et al., 2023) and at the token level (Fadeeva et al., 2024). Our evaluation of Chapter 10 involves various unsupervised metrics at the word level, employing the edits from our user studies of previous chapters as sources of word-level error spans to evaluate unsupervised word-level QE methods across multiple label sets. A notable technique for unsupervised QE is Monte Carlo Dropout (MCD) (Gal and Ghahramani, 2016). The dropout mechanism (Srivastava et al., 2014), commonly used for regularization during training, is employed at inference time by MCD to produce a set of noisy predictions from a unique model, approximating Bayesian inference. For a given input \(\mathbf{x}\), \(T\) forward passes are performed through the network. In each pass \(t \in T\), a different random dropout mask \(\Theta_t\) is applied on model parameters, resulting in slightly different output probabilities \(p(\mathbf{x} \mid \Theta_t)\). The set of \(T\) predictions \(\{p(\mathbf{x} \mid \Theta_1), \dots, p(\mathbf{x} \mid \Theta_T)\}\) can be seen as samples from an approximate posterior distribution. These can be used, for example, to quantify model uncertainty as the variance of the set of probabilities for a specific token. We employ such method, showing promising performances in our evaluation of Chapter 10, to produce unsupervised error highlights for our QE4PE user study in Chapter 9.

From a practical standpoint, QE methods are widely used in the translation industry for triaging automatic translations, with integrations in popular CAT tools to present users with segment-level quality scores (Tamchyna, 2021). While QE usage has been found helpful to increase the confidence and speed of human assessment (Mehandru et al., 2023; Zouhar et al., 2025), an incautious usage of these techniques can lead to a misplaced over-reliance on model predictions (Zouhar et al., 2021). Moreover, the effectiveness of QE-assisted post-editing depends critically on the accuracy of quality predictions, with inaccurate highlights potentially misleading translators and reducing overall productivity (Shenoy et al., 2021). Interfaces supporting word-level error highlights were developed for studying MT post-editing (Coppers et al., 2018; Herbig et al., 2020) and code reviewing (Sun et al., 2022; Vasconcelos et al., 2025), with results suggesting that striking the right balance of user-provided information is fundamental to improve the editing experience and prevent cognitive overload. Our user study of Chapter 9 is one of few works going beyond accuracy evaluations to measure the actual impact of word-level QE systems when integrated in human post-editing workflows.

Samira Abnar and Willem Zuidema. 2020. Quantifying attention flow in transformers. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th annual meeting of the association for computational linguistics, pages 4190–4197, Online. Association for Computational Linguistics.

Reduan Achtibat, Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Aakriti Jain, Thomas Wiegand, Sebastian Lapuschkin, and Wojciech Samek. 2024. AttnLRP: Attention-aware layer-wise relevance propagation for transformers. In Proceedings of the 41st international conference on machine learning, Vienna, Austria. JMLR.org.

Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. 2018. Sanity checks for saliency maps. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in neural information processing systems, volume 31, pages 9505–9515, Montréal, Canada. Curran Associates, Inc.

Julius Adebayo, Michael Muelly, Harold Abelson, and Been Kim. 2022. Post hoc explanations may be ineffective for detecting unknown spurious correlation. In International conference on learning representations.

Julius Adebayo, Michael Muelly, Ilaria Liccardi, and Been Kim. 2020. Debugging tests for model explanations. In Proceedings of the 34th international conference on neural information processing systems, Red Hook, NY, USA. Curran Associates Inc.

Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Massively multilingual neural machine translation. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), pages 3874–3884, Minneapolis, Minnesota. Association for Computational Linguistics.

J Alammar. 2021. Ecco: An open source library for the explainability of transformer language models. In Heng Ji, Jong C. Park, and Rui Xia, editors, Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing: System demonstrations, pages 249–257, Online. Association for Computational Linguistics.

Duarte Miguel Alves, José Pombal, Nuno M Guerreiro, Pedro Henrique Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, José G. C. de Souza, and Andre Martins. 2024. Tower: An open multilingual large language model for translation-related tasks. In First conference on language modeling.

Chantal Amrhein and Rico Sennrich. 2022. Identifying weaknesses in machine translation metrics through minimum Bayes risk decoding: A case study for COMET. In Yulan He, Heng Ji, Sujian Li, Yang Liu, and Chua-Hui Chang, editors, Proceedings of the 2nd conference of the asia-pacific chapter of the association for computational linguistics and the 12th international joint conference on natural language processing (volume 1: Long papers), pages 1125–1141, Online only. Association for Computational Linguistics.

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. Refusal in language models is mediated by a single direction. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in neural information processing systems, volume 37, pages 136037–136083, Red Hook, NY, USA. Curran Associates, Inc.

Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2018. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495.

Pepa Atanasova, Jakob Grue Simonsen, Christina Lioma, and Isabelle Augenstein. 2020. A diagnostic study of explainability techniques for text classification. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 3256–3274, Online. Association for Computational Linguistics.

Giuseppe Attanasio, Eliana Pastor, Chiara Di Bonaventura, and Debora Nozza. 2023. Ferret: A framework for benchmarking explainers on transformers. In Danilo Croce and Luca Soldaini, editors, Proceedings of the 17th conference of the european chapter of the association for computational linguistics: System demonstrations, pages 256–266, Dubrovnik, Croatia. Association for Computational Linguistics.

Wilker Aziz, Sheila Castilho, and Lucia Specia. 2012. PET: A tool for post-editing and assessing machine translation. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the eighth international conference on language resources and evaluation (LREC‘12), pages 3982–3987, Istanbul, Turkey. European Language Resources Association (ELRA).

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. Arxiv Preprint.

Alexander AND Montavon Bach Sebastian AND Binder. 2015. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, 10(7):1–46.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann LeCun, editors, Proceedings of the 3rd international conference on learning representations (ICLR), San Diego, CA, USA.

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare Voss, editors, Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.

Jasmijn Bastings, Sebastian Ebert, Polina Zablotskaia, Anders Sandholm, and Katja Filippova. 2022. “Will you find these shortcuts?” A protocol for evaluating the faithfulness of input salience methods for text classification. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 conference on empirical methods in natural language processing, pages 976–991, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Jasmijn Bastings and Katja Filippova. 2020. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? In Afra Alishahi, Yonatan Belinkov, Grzegorz Chrupała, Dieuwke Hupkes, Yuval Pinter, and Hassan Sajjad, editors, Proceedings of the third BlackboxNLP workshop on analyzing and interpreting neural networks for NLP, pages 149–155, Online. Association for Computational Linguistics.

Rachel Bawden, Rico Sennrich, Alexandra Birch, and Barry Haddow. 2018. Evaluating discourse phenomena in neural machine translation. In Marilyn Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long papers), pages 1304–1313, New Orleans, Louisiana. Association for Computational Linguistics.

Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219.

Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, and Marcello Federico. 2016. Neural versus phrase-based machine translation quality: A case study. In Jian Su, Kevin Duh, and Xavier Carreras, editors, Proceedings of the 2016 conference on empirical methods in natural language processing, pages 257–267, Austin, Texas. Association for Computational Linguistics.

Blair Bilodeau, Natasha Jaques, Pang Wei Koh, and Been Kim. 2024. Impossibility theorems for feature attribution. Proceedings of the National Academy of Sciences, 121(2):e2304406120.

Frederic Blain, Chrysoula Zerva, Ricardo Rei, Nuno M. Guerreiro, Diptesh Kanojia, José G. C. de Souza, Beatriz Silva, Tânia Vaz, Yan Jingxuan, Fatemeh Azadi, Constantin Orasan, and André Martins. 2023. Findings of the WMT 2023 shared task on quality estimation. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of the eighth conference on machine translation, pages 629–653, Singapore. Association for Computational Linguistics.

John Blatz, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, Alex Kulesza, Alberto Sanchis, and Nicola Ueffing. 2004. Confidence estimation for machine translation. In COLING 2004: Proceedings of the 20th international conference on computational linguistics, pages 315–321, Geneva, Switzerland. COLING.

Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Shujian Huang, Matthias Huck, Philipp Koehn, Qun Liu, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Raphael Rubino, Lucia Specia, and Marco Turchi. 2017. Findings of the 2017 conference on machine translation (WMT17). In Ondřej Bojar, Christian Buck, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, and Julia Kreutzer, editors, Proceedings of the second conference on machine translation, pages 169–214, Copenhagen, Denmark. Association for Computational Linguistics.

Lynne Bowker. 2002. Computer-aided translation technology: A practical introduction. University of Ottawa Press.

Eleftheria Briakou, Jiaming Luo, Colin Cherry, and Markus Freitag. 2024. Translating step-by-step: Decomposing the translation process for improved translation quality of long-form texts. In Barry Haddow, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the ninth conference on machine translation, pages 1301–1317, Miami, Florida, USA. Association for Computational Linguistics.

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, et al. 2023. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, et al. 2020. Language models are few-shot learners. In Proceedings of the 34th international conference on neural information processing systems, Red Hook, NY, USA. Curran Associates Inc.

Emanuele Bugliarello, Sabrina J. Mielke, Antonios Anastasopoulos, Ryan Cotterell, and Naoaki Okazaki. 2020. It‘s easier to translate out of English than into it: Measuring neural translation difficulty by cross-mutual information. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th annual meeting of the association for computational linguistics, pages 1640–1649, Online. Association for Computational Linguistics.

Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 2007. (Meta-) evaluation of machine translation. In Chris Callison-Burch, Philipp Koehn, Cameron Shaw Fordyce, and Christof Monz, editors, Proceedings of the second workshop on statistical machine translation, pages 136–158, Prague, Czech Republic. Association for Computational Linguistics.

Sheila Castilho, Joss Moorkens, Federico Gaspari, Iacer Calixto, John Tinsley, and Andy Way. 2017. Is neural machine translation the new state of the art? The Prague Bulletin of Mathematical Linguistics, 108(1):109–120.

Kenneth W. Church and Eduard H. Hovy. 1993. Good applications for crummy machine translation. Machine Translation, 8(4):239–258.

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? An analysis of BERT‘s attention. In Tal Linzen, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes, editors, Proceedings of the 2019 ACL workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pages 276–286, Florence, Italy. Association for Computational Linguistics.

Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett, editors, Advances in neural information processing systems, volume 32. Curran Associates, Inc.

Sven Coppers, Jan Van den Bergh, Kris Luyten, Karin Coninx, Iulianna Van der Lek-Ciudin, Tom Vanallemeersch, and Vincent Vandeghinste. 2018. Intellingo: An intelligible translation environment. In Proceedings of the 2018 CHI conference on human factors in computing systems, pages 1–13.

Ian Covert, Scott Lundberg, and Su-In Lee. 2021. Explaining by removing: A unified framework for model explanation. Journal of Machine Learning Research, 22(209):1–90.

Jonathan Crabbé and Mihaela van der Schaar. 2023. Evaluating the robustness of interpretability methods through explanation invariance and equivariance. In Thirty-seventh conference on neural information processing systems.

Menglong Cui, Pengzhi Gao, Wei Liu, Jian Luan, and Bin Wang. 2025. Multilingual machine translation with open large language models at practical scale: An empirical study. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 conference of the nations of the americas chapter of the association for computational linguistics: Human language technologies (volume 1: Long papers), pages 5420–5443, Albuquerque, New Mexico. Association for Computational Linguistics.

Joke Daems, Sonia Vandepitte, Robert J. Hartsuiker, and Lieve Macken. 2017. Identifying the machine translation error types with the greatest impact on post-editing effort. Frontiers in Psychology, 8.

David Dale, Elena Voita, Loic Barrault, and Marta R. Costa-jussà. 2023. Detecting and mitigating hallucinations in machine translation: Model internal workings alone do well, sentence similarity Even better. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pages 36–50, Toronto, Canada. Association for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. 2024. A survey on in-context learning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 conference on empirical methods in natural language processing, pages 1107–1128, Miami, Florida, USA. Association for Computational Linguistics.

David L. Donoho and Michael Elad. 2003. Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ<sup>1</sup> minimization. Proceedings of the National Academy of Sciences, 100(5):2197–2202.

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. 2022. Toy models of superposition. Transformer Circuits Thread.

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, et al. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread. https://transformer-circuits.pub/2021/framework/index.html.

Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, and Maxim Panov. 2024. Fact-checking the output of large language models via token-level uncertainty quantification. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the association for computational linguistics: ACL 2024, pages 9367–9385, Bangkok, Thailand. Association for Computational Linguistics.

Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, Timothy Baldwin, and Artem Shelmanov. 2023. LM-polygraph: Uncertainty estimation for language models. In Yansong Feng and Els Lefever, editors, Proceedings of the 2023 conference on empirical methods in natural language processing: System demonstrations, pages 446–461, Singapore. Association for Computational Linguistics.

Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Çelebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, and Armand Joulin. 2021. Beyond english-centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48.

Marcello Federico, Nicola Bertoldi, Marco Trombetti, and Alessandro Cattelan. 2014. MateCat: An open source CAT tool for MT post-editing. In Proceedings of the 11th conference of the association for machine translation in the americas: tutorials, Vancouver, Canada. Association for Machine Translation in the Americas.

Thomas Fel. 2024. Sparks of explainability: Recent advancements in explaining large vision models. PhD thesis, University of Toulouse.

Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, André Martins, Graham Neubig, Ankush Garg, Jonathan Clark, Markus Freitag, and Orhan Firat. 2023a. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of the eighth conference on machine translation, pages 1066–1083, Singapore. Association for Computational Linguistics.

Patrick Fernandes, Kayo Yin, Emmy Liu, André Martins, and Graham Neubig. 2023b. When does translation require context? A data-driven, multilingual exploration. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pages 606–626, Toronto, Canada. Association for Computational Linguistics.

Javier Ferrando, Gerard I. Gállego, and Marta R. Costa-jussà. 2022. Measuring the mixing of contextual information in the transformer. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 conference on empirical methods in natural language processing, pages 8698–8714, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Javier Ferrando, Gabriele Sarti, Arianna Bisazza, and Marta R. Costa-jussà. 2024. A primer on the inner workings of transformer-based language models. Arxiv Preprint.

Jaden Fried Fiotto-Kaufman, Alexander Russell Loftus, Eric Todd, Jannik Brinkmann, Koyena Pal, Dmitrii Troitskii, Michael Ripa, Adam Belfki, Can Rager, Caden Juang, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Nikhil Prakash, Carla E. Brodley, Arjun Guha, Jonathan Bell, Byron C Wallace, et al. 2025. NNsight and NDIF: Democratizing access to open-weight foundation model internals. In The thirteenth international conference on learning representations.

Marina Fomicheva, Piyawat Lertvittayakumjorn, Wei Zhao, Steffen Eger, and Yang Gao. 2021. The Eval4NLP shared task on explainable quality estimation: Overview and results. In Yang Gao, Steffen Eger, Wei Zhao, Piyawat Lertvittayakumjorn, and Marina Fomicheva, editors, Proceedings of the 2nd workshop on evaluation and comparison of NLP systems, pages 165–178, Punta Cana, Dominican Republic. Association for Computational Linguistics.

Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Frédéric Blain, Francisco Guzmán, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, and Lucia Specia. 2020. Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics, 8:539–555.

Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.

Markus Freitag, Nitika Mathur, Chi-kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian Thompson, Tom Kocmi, Frederic Blain, Daniel Deutsch, Craig Stewart, Chrysoula Zerva, Sheila Castilho, Alon Lavie, and George Foster. 2023. Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of the eighth conference on machine translation, pages 578–628, Singapore. Association for Computational Linguistics.

Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, Eleftherios Avramidis, Tom Kocmi, George Foster, Alon Lavie, and André F. T. Martins. 2022. Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Philipp Koehn, Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Tom Kocmi, André Martins, Makoto Morishita, et al., editors, Proceedings of the seventh conference on machine translation (WMT), pages 46–68, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.

Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33rd international conference on machine learning, volume 48, pages 1050–1059, New York, NY, USA. Proceedings of Machine Learning Research (PLMR).

Ignacio Garcia. 2009. Beyond translation memory: Computers and the professional translator. The Journal of Specialised Translation.

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 conference on empirical methods in natural language processing, pages 5484–5495, Online; Punta Cana, Dominican Republic. Association for Computational Linguistics.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT Press.

Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, and Alexis Conneau. 2021. Larger-scale transformers for multilingual masked language modeling. In Anna Rogers, Iacer Calixto, Ivan Vulić, Naomi Saphra, Nora Kassner, Oana-Maria Camburu, Trapit Bansal, and Vered Shwartz, editors, Proceedings of the 6th workshop on representation learning for NLP (RepL4NLP-2021), pages 29–33, Online. Association for Computational Linguistics.

Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. Continuous measurement scales in human evaluation of machine translation. In Antonio Pareja-Lora, Maria Liakata, and Stefanie Dipper, editors, Proceedings of the 7th linguistic annotation workshop and interoperability with discourse, pages 33–41, Sofia, Bulgaria. Association for Computational Linguistics.

Spence Green, Jeffrey Heer, and Christopher D. Manning. 2013. The efficacy of human post-editing for language translation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 439–448, New York, NY, USA. Association for Computing Machinery.

Ana Guerberof. 2009. Productivity and quality in MT post-editing. In Beyond translation memories: New tools for translators workshop, Ottawa, Canada.

Ana Guerberof-Arenas and Antonio Toral. 2022. Creativity in translation: Machine translation as a constraint for literary texts. Translation Spaces, 11(2):184–212.

Nuno M. Guerreiro, Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre Colombo, and André F. T. Martins. 2024. Xcomet: Transparent machine translation evaluation through fine-grained error detection. Transactions of the Association for Computational Linguistics, 12:979–995.

Abhijeet Gupta, Gemma Boleda, Marco Baroni, and Sebastian Padó. 2015. Distributional vectors encode referential attributes. In Lluís Màrquez, Chris Callison-Burch, and Jian Su, editors, Proceedings of the 2015 conference on empirical methods in natural language processing, pages 12–21, Lisbon, Portugal. Association for Computational Linguistics.

Christian Hadiwinoto. 2017. Book review: Syntax-based statistical machine translation by philip Williams, rico Sennrich, matt post and philipp Koehn. Computational Linguistics, 43(4):893–896.

Zellig S. Harris. 1954. Distributional structure. Word, 10(2-3):146–162.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In 2016 IEEE conference on computer vision and pattern recognition (CVPR), pages 770–778, Los Alamitos, CA, USA. IEEE Computer Society.

Nico Herbig, Tim Düwel, Santanu Pal, Kalliopi Meladaki, Mahsa Monshizadeh, Antonio Krüger, and Josef van Genabith. 2020. MMPE: A Multi-Modal Interface for Post-Editing Machine Translation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th annual meeting of the association for computational linguistics, pages 1691–1702, Online. Association for Computational Linguistics.

Anas Himmi, Guillaume Staerman, Marine Picot, Pierre Colombo, and Nuno M Guerreiro. 2024. Enhanced hallucination detection in neural machine translation through simple detector aggregation. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 conference on empirical methods in natural language processing, pages 18573–18583, Miami, Florida, USA. Association for Computational Linguistics.

Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 6(2):107–116.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.

Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. 2021. Surface form competition: Why the highest probability answer isn‘t always right. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 conference on empirical methods in natural language processing, pages 7038–7051, Online; Punta Cana, Dominican Republic. Association for Computational Linguistics.

Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.

Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. 2024. Sparse autoencoders find highly interpretable features in language models. In The twelfth international conference on learning representations.

William J. Hutchins. 2001. Machine translation over fifty years. Histoire Épistémologie Langage, 23:7–31.

Alon Jacovi and Yoav Goldberg. 2020. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th annual meeting of the association for computational linguistics, pages 4198–4205, Online. Association for Computational Linguistics.

Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), pages 3543–3556, Minneapolis, Minnesota. Association for Computational Linguistics.

Stanisław Jastrzebski, Devansh Arpit, Nicolas Ballas, Vikas Verma, Tong Che, and Yoshua Bengio. 2018. Residual connections encourage iterative inference. In International conference on learning representations.

Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard, editors, Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1700–1709, Seattle, Washington, USA. Association for Computational Linguistics.

Sariya Karimova, Patrick Simianer, and Stefan Riezler. 2018. A user-study on online adaptation of neural machine translation to human post-edits. Machine Translation, 32(4):309–324.

Fabio Kepler, Jonay Trénous, Marcos Treviso, Miguel Vera, and André F. T. Martins. 2019. OpenKiwi: An open source framework for quality estimation. In Marta R. Costa-jussà and Enrique Alfonseca, editors, Proceedings of the 57th annual meeting of the association for computational linguistics: System demonstrations, pages 117–122, Florence, Italy. Association for Computational Linguistics.

Yunsu Kim, Petre Petrov, Pavel Petrushkov, Shahram Khadivi, and Hermann Ney. 2019. Pivot-based transfer learning for neural machine translation between non-English languages. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 866–876, Hong Kong, China. Association for Computational Linguistics.

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. 2020. Attention is not only a weight: Analyzing transformers with vector norms. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 7057–7075, Online. Association for Computational Linguistics.

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. 2021. Incorporating Residual and Normalization Layers into Analysis of Masked Language Models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 conference on empirical methods in natural language processing, pages 4547–4568, Online; Punta Cana, Dominican Republic. Association for Computational Linguistics.

Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Benjamin Marie, Christof Monz, Kenton Murray, Masaaki Nagata, Martin Popel, Maja Popović, et al. 2024a. Findings of the WMT24 general machine translation shared task: The LLM era is here but MT is not solved yet. In Barry Haddow, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the ninth conference on machine translation, pages 1–46, Miami, Florida, USA. Association for Computational Linguistics.

Tom Kocmi and Christian Federmann. 2023. GEMBA-MQM: Detecting translation quality error spans with GPT-4. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of the eighth conference on machine translation, pages 768–775, Singapore. Association for Computational Linguistics.

Tom Kocmi, Vilém Zouhar, Eleftherios Avramidis, Roman Grundkiewicz, Marzena Karpinska, Maja Popović, Mrinmaya Sachan, and Mariya Shmatova. 2024b. Error span annotation: A balanced approach for human evaluation of machine translation. In Barry Haddow, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the ninth conference on machine translation, pages 1440–1453, Miami, Florida, USA. Association for Computational Linguistics.

Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 human language technology conference of the north American chapter of the association for computational linguistics, pages 127–133.

Arne Köhn. 2015. What‘s in an embedding? Analyzing word embeddings through multilingual evaluation. In Lluís Màrquez, Chris Callison-Burch, and Jian Su, editors, Proceedings of the 2015 conference on empirical methods in natural language processing, pages 2067–2073, Lisbon, Portugal. Association for Computational Linguistics.

Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, and Orion Reblitz-Richardson. 2020. Captum: A unified and generic model interpretability library for PyTorch. ArXiv.

Hans P. Krings. 2001. Repairing texts: Empirical investigations of machine translation post-editing processes. Kent State University Press.

Satyapriya Krishna, Tessa Han, Alex Gu, Steven Wu, Shahin Jabbari, and Himabindu Lakkaraju. 2024. The disagreement problem in explainable machine learning: A practitioner’s perspective. Transactions on Machine Learning Research.

Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Sam Gershman, and Finale Doshi-Velez. 2019. An evaluation of the human-interpretability of explanation. ArXiv, abs/1902.00006.

Samuel Läubli, Chantal Amrhein, Patrick Düggelin, Beatriz Gonzalez, Alena Zwahlen, and Martin Volk. 2019. Post-editing productivity with neural machine translation: An empirical assessment of speed and quality in the banking and finance domain. In Mikel Forcada, Andy Way, Barry Haddow, and Rico Sennrich, editors, Proceedings of machine translation summit XVII: Research track, pages 267–272, Dublin, Ireland. European Association for Machine Translation.

Samuel Läubli, Mark Fishel, Gary Massey, Maureen Ehrensberger-Dow, and Martin Volk. 2013. Assessing post-editing efficiency in a realistic translation environment. In Sharon O’Brien, Michel Simard, and Lucia Specia, editors, Proceedings of the 2nd workshop on post-editing technology and practice, Nice, France.

Samuel Läubli, Rico Sennrich, and Martin Volk. 2018. Has machine translation achieved human parity? A case for document-level evaluation. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 conference on empirical methods in natural language processing, pages 4791–4796, Brussels, Belgium. Association for Computational Linguistics.

Seungjun Lee, Jungseob Lee, Hyeonseok Moon, Chanjun Park, Jaehyung Seo, Sugyeong Eo, Seonmin Koo, and Heuiseok Lim. 2023. A survey on evaluation metrics for machine translation. Mathematics, 11(4).

Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, and Steffen Eger. 2024. Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 25(75):1–49.

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th international conference on neural information processing systems, Red Hook, NY, USA. Curran Associates Inc.

Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2016. Visualizing and understanding neural models in NLP. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, Proceedings of the 2016 conference of the north American chapter of the association for computational linguistics: Human language technologies, pages 681–691, San Diego, California. Association for Computational Linguistics.

Zhongtao Liu, Parker Riley, Daniel Deutsch, Alison Lui, Mengmeng Niu, Apurva Shah, and Markus Freitag. 2024. Beyond human-only: Evaluating human-machine collaboration for collecting high-quality translation data. In Barry Haddow, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the ninth conference on machine translation, pages 1095–1106, Miami, Florida, USA. Association for Computational Linguistics.

Arle Richard Lommel, Aljoscha Burchardt, and Hans Uszkoreit. 2013. Multidimensional quality metrics: A flexible system for assessing translation quality. In Proceedings of translating and the computer 35, London, UK. Aslib.

Arle Lommel, Serge Gladkoff, Alan Melby, Sue Ellen Wright, Ingemar Strandvik, Katerina Gasova, Angelika Vaasa, Andy Benzo, Romina Marazzato Sparano, Monica Foresi, Johani Innis, Lifeng Han, and Goran Nenadic. 2024. The multi-range theory of translation quality measurement: MQM scoring models and statistical quality control. In Marianna Martindale, Janice Campbell, Konstantin Savenkov, and Shivali Goel, editors, Proceedings of the 16th conference of the association for machine translation in the americas (volume 2: presentations), pages 75–94, Chicago, USA. Association for Machine Translation in the Americas.

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 8086–8098, Dublin, Ireland. Association for Computational Linguistics.

Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Proceedings of the 31st international conference on neural information processing systems, volume 30, pages 4768–4777, Long Beach, California, USA. Curran Associates Inc.

Andreas Madsen, Siva Reddy, and Sarath Chandar. 2022. Post-hoc interpretability for neural NLP: A survey. ACM Comput. Surv., 55(8).

Samuel Marks and Max Tegmark. 2024. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. In Proceedings of the 1st conference on language modeling (COLM).

Marianna Martindale and Marine Carpuat. 2018. Fluency over adequacy: A pilot study in measuring user trust in imperfect MT. In Colin Cherry and Graham Neubig, editors, Proceedings of the 13th conference of the association for machine translation in the Americas (volume 1: Research track), pages 13–25, Boston, MA. Association for Machine Translation in the Americas.

Sameen Maruf and Gholamreza Haffari. 2018. Document context neural machine translation with memory networks. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1275–1284, Melbourne, Australia. Association for Computational Linguistics.

R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th annual meeting of the association for computational linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.

Nikita Mehandru, Sweta Agrawal, Yimin Xiao, Ge Gao, Elaine Khoong, Marine Carpuat, and Niloufar Salehi. 2023. Physician detection of clinical harm in machine translation: Quality estimation aids in reliance and backtranslation identifies critical errors. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 conference on empirical methods in natural language processing, pages 11633–11647, Singapore. Association for Computational Linguistics.

Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, and James Henderson. 2018. Document-level neural machine translation with hierarchical attention networks. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 conference on empirical methods in natural language processing, pages 2947–2954, Brussels, Belgium. Association for Computational Linguistics.

Vivek Miglani, Aobo Yang, Aram Markosyan, Diego Garcia-Olano, and Narine Kokhlikyan. 2023. Using captum to explain generative language models. In Liling Tan, Dmitrijs Milajevs, Geeticka Chauhan, Jeremy Gwinnup, and Elijah Rippeth, editors, Proceedings of the 3rd workshop for natural language processing open source software (NLP-OSS 2023), pages 165–173, Singapore. Association for Computational Linguistics.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors, Proceedings of the 2013 conference of the north American chapter of the association for computational linguistics: Human language technologies, pages 746–751, Atlanta, Georgia. Association for Computational Linguistics.

Hosein Mohebbi, Willem Zuidema, Grzegorz Chrupała, and Afra Alishahi. 2023. Quantifying context mixing in transformers. In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th conference of the european chapter of the association for computational linguistics, pages 3378–3400, Dubrovnik, Croatia. Association for Computational Linguistics.

John Moran, Christian Saam, and Dave Lewis. 2014. Towards desktop-based CAT tool instrumentation. In Sharon O’Brien, Michel Simard, and Lucia Specia, editors, Proceedings of the 11th conference of the association for machine translation in the americas, pages 99–112, Vancouver, Canada. Association for Machine Translation in the Americas.

Mathias Müller, Annette Rios, Elena Voita, and Rico Sennrich. 2018. A large-scale test set for the evaluation of context-aware pronoun translation in neural machine translation. In Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, and Karin Verspoor, editors, Proceedings of the third conference on machine translation: Research papers, pages 61–72, Brussels, Belgium. Association for Computational Linguistics.

NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, et al. 2024. Scaling neural machine translation to 200 languages. Nature, 630(8018):841–846.

nostalgebraist. 2020. Interpreting GPT: The logit lens. AI Alignment Forum.

Franz Josef Och, Christoph Tillmann, and Hermann Ney. 1999. Improved alignment models for statistical machine translation. In 1999 joint SIGDAT conference on empirical methods in natural language processing and very large corpora.

Chris Olah. 2023. Distributed representations: Composition & superposition. Transformer Circuits Thread.

Bruno A. Olshausen and David J. Field. 1997. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23):3311–3325.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th annual meeting of the association for computational linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.

Kiho Park, Yo Joong Choe, and Victor Veitch. 2023. The linear representation hypothesis and the geometry of large language models. In Causal representation learning workshop at NeurIPS 2023.

Carla Parra Escartín and Manuel Arcedillo. 2015. Machine translation evaluation made fuzzier: A study on post-editing productivity and evaluation metrics in commercial settings. In Proceedings of machine translation summit XV: papers, Miami, USA.

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.

Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. 2020. How context affects language models’ factual predictions. In Automated knowledge base construction.

Charles Pierse. 2021. Transformers interpret.

Mirko Plitt and François Masselot. 2010. A Productivity Test of Statistical Machine Translation Post-Editing in a Typical Localisation Context. The Prague Bulletin of Mathematical Linguistics, 93(1).

Maja Popović. 2015. ChrF: Character n-gram F-score for automatic MT evaluation. In Ondřej Bojar, Rajan Chatterjee, Christian Federmann, Barry Haddow, Chris Hokamp, Matthias Huck, Varvara Logacheva, and Pavel Pecina, editors, Proceedings of the tenth workshop on statistical machine translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.

Maja Popović. 2020. Informative manual evaluation of machine translation output. In Donia Scott, Nuria Bel, and Chengqing Zong, editors, Proceedings of the 28th international conference on computational linguistics, pages 5059–5069, Barcelona, Spain (Online). International Committee on Computational Linguistics.

Ricardo Rei, José G. C. de Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and André F. T. Martins. 2022a. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Philipp Koehn, Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Tom Kocmi, André Martins, Makoto Morishita, et al., editors, Proceedings of the seventh conference on machine translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.

Ricardo Rei, Ana C Farinha, José G. C. de Souza, Pedro G. Ramos, André F. T. Martins, Luisa Coheur, and Alon Lavie. 2022b. Searching for COMETINHO: The little metric that could. In Helena Moniz, Lieve Macken, Andrew Rufener, Loïc Barrault, Marta R. Costa-jussà, Christophe Declercq, Maarit Koponen, Ellie Kemp, Spyridon Pilos, Mikel L. Forcada, Carolina Scarton, Joachim Van den Bogaert, Joke Daems, Arda Tezcan, Bram Vanroy, and Margot Fonteyne, editors, Proceedings of the 23rd annual conference of the european association for machine translation, pages 61–70, Ghent, Belgium. European Association for Machine Translation.

Ricardo Rei, Ana C Farinha, Chrysoula Zerva, Daan van Stigt, Craig Stewart, Pedro Ramos, Taisiya Glushkova, André F. T. Martins, and Alon Lavie. 2021. Are references really needed? Unbabel-IST 2021 submission for the metrics shared task. In Loic Barrault, Ondrej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussa, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Tom Kocmi, Andre Martins, Makoto Morishita, et al., editors, Proceedings of the sixth conference on machine translation, pages 1030–1040, Online. Association for Computational Linguistics.

Ricardo Rei, Nuno M. Guerreiro, Marcos Treviso, Luisa Coheur, Alon Lavie, and André Martins. 2023. The inside story: Towards better understanding of machine translation neural evaluation metrics. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st annual meeting of the association for computational linguistics (volume 2: Short papers), pages 1089–1105, Toronto, Canada. Association for Computational Linguistics.

Ricardo Rei, Jose Pombal, Nuno M. Guerreiro, João Alves, Pedro Henrique Martins, Patrick Fernandes, Helena Wu, Tania Vaz, Duarte Alves, Amin Farajian, Sweta Agrawal, Antonio Farinhas, José G. C. De Souza, and André Martins. 2024. Tower v2: Unbabel-IST 2024 submission for the general MT shared task. In Barry Haddow, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the ninth conference on machine translation, pages 185–204, Miami, Florida, USA. Association for Computational Linguistics.

Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. COMET: A neural framework for MT evaluation. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why should i trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, New York, NY, USA. Association for Computing Machinery.

Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8:842–866.

Raphael Rubino, Atsushi Fujita, and Benjamin Marie. 2021. Error identification for machine translation with metric embedding and attention. In Yang Gao, Steffen Eger, Wei Zhao, Piyawat Lertvittayakumjorn, and Marina Fomicheva, editors, Proceedings of the 2nd workshop on evaluation and comparison of NLP systems, pages 146–156, Punta Cana, Dominican Republic. Association for Computational Linguistics.

David E. Rumelhart and James L. McClelland. 1987. Learning internal representations by error propagation. In Parallel distributed processing: Explorations in the microstructure of cognition: foundations, pages 318–362. MIT Press.

Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, et al. 2022. Multitask prompted training enables zero-shot task generalization. In Proceedings of the tenth international conference on learning representations (ICLR).

Gabriele Sarti, Grzegorz Chrupała, Malvina Nissim, and Arianna Bisazza. 2024. Quantifying the plausibility of context reliance in neural machine translation. In The twelfth international conference on learning representations (ICLR 2024), Vienna, Austria. OpenReview.

Beatrice Savoldi, Alan Ramponi, Matteo Negri, and Luisa Bentivogli. 2025. Translation in the hands of many: Centering lay users in machine translation interactions.

Daniel Scalena, Gabriele Sarti, and Malvina Nissim. 2024. Multi-property steering of large language models with dynamic activation composition. In Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, and Hanjie Chen, editors, Proceedings of the 7th BlackboxNLP workshop: Analyzing and interpreting neural networks for NLP, pages 577–603, Miami, Florida, US. Association for Computational Linguistics.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms.

Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, Armand Joulin, and Angela Fan. 2021. CCMatrix: Mining billions of high-quality parallel sentences on the web. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers), pages 6490–6500, Online. Association for Computational Linguistics.

Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. BLEURT: Learning robust metrics for text generation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th annual meeting of the association for computational linguistics, pages 7881–7892, Online. Association for Computational Linguistics.

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Katrin Erk and Noah A. Smith, editors, Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.

Raksha Shenoy, Nico Herbig, Antonio Krüger, and Josef van Genabith. 2021. Investigating the helpfulness of word-level quality estimation for post-editing machine translation output. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 conference on empirical methods in natural language processing, pages 10173–10185, Online; Punta Cana, Dominican Republic. Association for Computational Linguistics.

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep inside convolutional networks: Visualising image classification models and saliency maps. In Yoshua Bengio and Yann LeCun, editors, 2nd international conference on learning representations, (ICLR), Banff, AB, Canada.

Leon Sixt, Maximilian Granz, and Tim Landgraf. 2020. When explanations lie: Why many modified BP attributions fail. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th international conference on machine learning, volume 119, pages 9046–9057. PMLR.

Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. 2017. SmoothGrad: Removing noise by adding noise.

Paul Smolensky. 1986. Neural and conceptual interpretation of PDP models.

Matthew Snover, Bonnie Dorr, Rich Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th conference of the association for machine translation in the americas: Technical papers, pages 223–231, Cambridge, Massachusetts, USA. Association for Machine Translation in the Americas.

Lucia Specia, Carolina Scarton, Gustavo Henrique Paetzold, and Graeme Hirst. 2018. Quality estimation for machine translation. Morgan & Claypool Publishers.

Lucia Specia, Marco Turchi, Nicola Cancedda, Nello Cristianini, and Marc Dymetman. 2009. Estimating the sentence-level quality of machine translation systems. In Lluís Màrquez and Harold Somers, editors, Proceedings of the 13th annual conference of the european association for machine translation, Barcelona, Spain. European Association for Machine Translation.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958.

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.

Jiao Sun, Swabha Swayamdipta, Jonathan May, and Xuezhe Ma. 2022. Investigating the benefits of free-form rationales. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the association for computational linguistics: EMNLP 2022, pages 5867–5882, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In Proceedings of the 34th international conference on machine learning (ICML), volume 70, pages 3319–3328, Sydney, Australia. Journal of Machine Learning Research (JMLR).

Aleš Tamchyna. 2021. Deploying MT quality estimation on a large scale: Lessons learned and open questions. In Janice Campbell, Ben Huyck, Stephen Larocca, Jay Marciano, Konstantin Savenkov, and Alex Yanishevsky, editors, Proceedings of machine translation summit XVIII: Users and providers track, pages 291–305, Virtual. Association for Machine Translation in the Americas.

Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, and Angela Fan. 2021. Multilingual translation from denoising pre-training. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Findings of the association for computational linguistics: ACL-IJCNLP 2021, pages 3450–3466, Online. Association for Computational Linguistics.

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th annual meeting of the association for computational linguistics, pages 4593–4601, Florence, Italy. Association for Computational Linguistics.

Ian Tenney, Ryan Mullins, Bin Du, Shree Pandya, Minsuk Kahng, and Lucas Dixon. 2024. Interactive prompt debugging with sequence salience. Arxiv.

Ian Tenney, James Wexler, Jasmijn Bastings, Tolga Bolukbasi, Andy Coenen, Sebastian Gehrmann, Ellen Jiang, Mahima Pushkarna, Carey Radebaugh, Emily Reif, and Ann Yuan. 2020. The language interpretability tool: Extensible, interactive visualizations and analysis for NLP models. In Qun Liu and David Schlangen, editors, Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations, pages 107–118, Online. Association for Computational Linguistics.

Brian Thompson and Matt Post. 2020. Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 90–121, Online. Association for Computational Linguistics.

Curt Tigges, Oskar J. Hollinsworth, Atticus Geiger, and Neel Nanda. 2024. Language models linearly represent sentiment. In Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, and Hanjie Chen, editors, Proceedings of the 7th BlackboxNLP workshop: Analyzing and interpreting neural networks for NLP, pages 58–87, Miami, Florida, US. Association for Computational Linguistics.

Antonio Toral, Sheila Castilho, Ke Hu, and Andy Way. 2018a. Attaining the unattainable? Reassessing claims of human parity in neural machine translation. In Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, and Karin Verspoor, editors, Proceedings of the third conference on machine translation: Research papers, pages 113–123, Brussels, Belgium. Association for Computational Linguistics.

Antonio Toral, Martijn Wieling, and Andy Way. 2018b. Post-editing effort of a novel with statistical and neural machine translation. Frontiers in Digital Humanities, 5:1–11.

Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cantòn Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. ArXiv.

Marco Turchi, Antonios Anastasopoulos, José G. C. de Souza, and Matteo Negri. 2014. Adaptive quality estimation for machine translation. In Kristina Toutanova and Hua Wu, editors, Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 710–720, Baltimore, Maryland. Association for Computational Linguistics.

Marco Turchi, Matteo Negri, M. Amin Farajian, and Marcello Federico. 2017. Continuous learning from human post-edits for neural machine translation. The Prague Bulletin of Mathematical Linguistics, 108:233–244.

Marco Turchi, Matteo Negri, and Marcello Federico. 2013. Coping with the subjectivity of human judgements in MT quality estimation. In Ondrej Bojar, Christian Buck, Chris Callison-Burch, Barry Haddow, Philipp Koehn, Christof Monz, Matt Post, Herve Saint-Amand, Radu Soricut, and Lucia Specia, editors, Proceedings of the eighth workshop on statistical machine translation, pages 240–251, Sofia, Bulgaria. Association for Computational Linguistics.

Vladimir N. Vapnik. 1995. The nature of statistical learning theory. Springer-Verlag New York, Inc.

Helena Vasconcelos, Gagan Bansal, Adam Fourney, Q. Vera Liao, and Jennifer Wortman Vaughan. 2025. Generation probabilities are not enough: Uncertainty highlighting in AI code completions. ACM Trans. Comput.-Hum. Interact., 32(1).

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in neural information processing systems, volume 30. Curran Associates, Inc.

Elena Voita, Rico Sennrich, and Ivan Titov. 2019. When a good translation is wrong in context: Context-aware machine translation improves on deixis, ellipsis, and lexical cohesion. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th annual meeting of the association for computational linguistics, pages 1198–1212, Florence, Italy. Association for Computational Linguistics.

Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan Titov. 2018. Context-aware neural machine translation learns anaphora resolution. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1264–1274, Melbourne, Australia. Association for Computational Linguistics.

Elizabeth Wagner. 1983. Rapid post-editing of systran. In Veronica Lawson, editor, Proceedings of translating and the computer 5: Tools for the trade, London, UK. Aslib.

Longyue Wang, Chenyang Lyu, Tianbo Ji, Zhirui Zhang, Dian Yu, Shuming Shi, and Zhaopeng Tu. 2023. Document-level machine translation with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 conference on empirical methods in natural language processing, pages 16646–16661, Singapore. Association for Computational Linguistics.

Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. 2020. BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392.

John S. White, Theresa A. O’Connell, and Francis E. O’Mara. 1994. The ARPA MT evaluation methodologies: Evolution, lessons, and future approaches. In Proceedings of the first conference of the association for machine translation in the americas, Columbia, Maryland, USA.

Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 11–20, Hong Kong, China. Association for Computational Linguistics.

Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. 2024. ReFT: Representation finetuning for language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in neural information processing systems, volume 37, pages 63908–63962. Curran Associates, Inc.

Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Hassan Awadalla. 2024. A paradigm shift in machine translation: Boosting translation performance of large language models. In The twelfth international conference on learning representations.

Weijia Xu, Sweta Agrawal, Eleftheria Briakou, Marianna J. Martindale, and Marine Carpuat. 2023. Understanding and detecting hallucinations in neural machine translation via model introspection. Transactions of the Association for Computational Linguistics, 11:546–564.

Kayo Yin and Graham Neubig. 2022. Interpreting language models with contrastive explanations. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 conference on empirical methods in natural language processing, pages 184–198, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus. 2011. Adaptive deconvolutional networks for mid and high level feature learning. In 2011 international conference on computer vision (ICCV), pages 2018–2025.

Chrysoula Zerva, Frederic Blain, José G. C. De Souza, Diptesh Kanojia, Sourabh Deoghare, Nuno M. Guerreiro, Giuseppe Attanasio, Ricardo Rei, Constantin Orasan, Matteo Negri, Marco Turchi, Rajen Chatterjee, Pushpak Bhattacharyya, Markus Freitag, and André Martins. 2024. Findings of the quality estimation shared task at WMT 2024: Are LLMs closing the gap in QE? In Barry Haddow, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the ninth conference on machine translation, pages 82–109, Miami, Florida, USA. Association for Computational Linguistics.

Chrysoula Zerva, Frédéric Blain, Ricardo Rei, Piyawat Lertvittayakumjorn, José G. C. de Souza, Steffen Eger, Diptesh Kanojia, Duarte Alves, Constantin Orăsan, Marina Fomicheva, André F. T. Martins, and Lucia Specia. 2022. Findings of the WMT 2022 shared task on quality estimation. In Philipp Koehn, Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Tom Kocmi, André Martins, Makoto Morishita, et al., editors, Proceedings of the seventh conference on machine translation (WMT), pages 69–99, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.

Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. In Proceedings of the 33rd international conference on neural information processing systems, Red Hook, NY, USA. Curran Associates Inc.

Jiacheng Zhang, Huanbo Luan, Maosong Sun, Feifei Zhai, Jingfang Xu, Min Zhang, and Yang Liu. 2018. Improving the transformer translation model with document-level context. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 conference on empirical methods in natural language processing, pages 533–542, Brussels, Belgium. Association for Computational Linguistics.

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, et al. 2024. Enhancing neural network transparency through representation analysis. OpenReview.

Vilém Zouhar, Tom Kocmi, and Mrinmaya Sachan. 2025. AI-assisted human evaluation of machine translation. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 conference of the nations of the americas chapter of the association for computational linguistics: Human language technologies (volume 1: Long papers), pages 4936–4950, Albuquerque, New Mexico. Association for Computational Linguistics.

Vilém Zouhar, Michal Novák, Matúš Žilinec, Ondřej Bojar, Mateo Obregón, Robin L. Hill, Frédéric Blain, Marina Fomicheva, Lucia Specia, and Lisa Yankovskaya. 2021. Backtranslation feedback improves user confidence in MT, not quality. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 151–161, Online. Association for Computational Linguistics.

More details on neural networks can be found in Goodfellow et al. (2016).↩︎
Bias terms can be omitted, following the practice of recent models such as Llama (Touvron et al., 2023)↩︎
https://openai.com/index/chatgpt ↩︎
Probability scores are commonly used as differentiation targets, see discussion in Bastings et al. (2022).↩︎
https://github.com/PAIR-code/saliency ↩︎
Since the push towards proprietary model serving, details about the distribution of training data across languages in tech reports are often scarce.↩︎
https://go.proz.com/blog/cat-tool-use-by-translators-who-is-using ↩︎
A comprehensive overview of MT metrics was released by Lee et al. (2023).↩︎