4 Quantifying Context Usage in Neural Machine Translation

Chapter Summary

This chapter investigates how context-aware machine translation models leverage contextual information. For this purpose, we introduce Plausibility Evaluation of Context Reliance (PECoRe), an end-to-end interpretability framework designed to quantify context usage in language models’ generations. Our approach leverages model internals to contrastively identify context-sensitive target tokens in generated texts and link them to contextual cues justifying their prediction. We demonstrate the framework’s effectiveness by assessing the plausibility of context-aware machine translation models, comparing model rationales with human annotations across several discourse-level phenomena. We integrate PECoRe in the Inseq toolkit API and apply it to unannotated model outputs to identify context-mediated predictions and highlight instances of (im)plausible context usage throughout generation.

This chapter is adapted from the paper Quantifying the Plausibility of Context Reliance in Neural Machine Translation (Sarti et al., 2024a). Section 4.6 is adapted from the case study in Democratizing Advanced Attribution Analyses of Generative Language Models with the Inseq Toolkit (Sarti et al., 2024b).

An interpretation will be meaningful to the extent that it accurately reflects some isomorphism to the real world.

– Douglas R. Hofstadter, Gödel, Escher, Bach: An Eternal Golden Braid (1979)

4.1 Introduction

Research in NLP interpretability defines various desiderata for rationales of model behaviors, i.e. the contributions of input tokens toward model predictions computed using input attribution (Madsen et al., 2022). One such property is plausibility, corresponding to the alignment between model rationales and salient input words identified by human annotators (Jacovi and Goldberg, 2020). Low-plausibility rationales typically occur alongside generalization failures or biased predictions and can be helpful in identifying cases where models are “right for the wrong reasons” (McCoy et al., 2019).

However, while plausibility has an intuitive interpretation for classification tasks involving a single prediction, extending this methodology to generative language models presents several challenges. First, LMs have a large output space in which semantically equivalent tokens (e.g. “PC” and “computer”) are competing candidates for next-word prediction (Holtzman et al., 2021). Moreover, LMs’ generations are the product of optimization pressures to ensure independent properties such as semantic relatedness, topical coherence and grammatical correctness, which can hardly be captured by a single attribution score (Yin and Neubig, 2022). Finally, since autoregressive generation involves an iterative prediction process, model rationales could be extracted for every generated token. This raises the issue of which generated tokens can have plausible contextual explanations.

Recent attribution techniques for explaining language models incorporate contrastive alternatives to disentangle different aspects of model predictions (e.g. the choice of “meowing” over “screaming” for “The cat is ___” is motivated by semantic appropriateness, but not by grammaticality) (Ferrando et al., 2023; Sarti et al., 2023). However, these studies circumvent the issues above by focusing their evaluation on a single generation step matching a phenomenon of interest. For example, given the sentence “The pictures of the cat ___”, a plausible rationale for the prediction of the word “are” should reflect the role of “pictures” in subject-verb agreement. While this approach can be helpful to validate model rationales, it confines plausibility assessment to a small set of handcrafted benchmarks where tokens with plausible explanations are known in advance. Moreover, it risks overlooking important patterns of context usage, including those that do not immediately match linguistic intuitions. In light of this, we suggest that identifying which generated tokens were most affected by contextual input information should be an integral part of plausibility evaluation for language generation tasks.

To achieve this goal, we propose a novel interpretability framework, which we dub Plausibility Evaluation of Context Reliance (PECoRe). PECoRe enables the end-to-end extraction of cue-target token pairs consisting of context-sensitive generated tokens and their respective influential contextual cues from language model generations, as shown in Figure 4.1. These pairs can uncover context dependence in naturally occurring generations and, for cases where human annotations are available, help quantify the plausibility of context usage in language models. Importantly, our approach is compatible with modern attribution methods using contrastive targets (Yin and Neubig, 2022), avoids relying on reference translations to avoid problematic distributional shifts (Vamvas and Sennrich, 2021b), and can be applied to unannotated inputs to identify context usage in model generations.

Figure 4.1: Examples of sentence-level English$\rightarrow$Italian translation with lack-of-context errors and their correct contextual counterpart. In the contextual case context-sensitive source tokens are disambiguated using source (ⓢ) or target-based (ⓣ) contextual cues to produce correct context-sensitive target tokens. PECoRe enables the end-to-end extraction of cue-target pairs (e.g. she-alla pastorella, le pecore-le).

After formalizing our proposed approach in Section 4.3, we apply PECoRe to contextual machine translation to study the plausibility of context reliance in bilingual and multilingual MT models. While PECoRe can easily be used alongside encoder-decoder and decoder-only language models for interpreting context usage in any text generation task, we focus our evaluation on MT because of its constrained output space facilitating automatic assessment and the availability of MT datasets annotated with human rationales of context usage. We thoroughly test PECoRe on well-known discourse phenomena, benchmarking several context sensitivity metrics and attribution methods to identify cue-target pairs. We conclude by applying PECoRe to unannotated examples and showcasing some reasonable and questionable cases of context reliance in MT model translations.¹

In sum, we make the following contributions:

We introduce PECoRe, an interpretability framework to detect and attribute context reliance in language models. PECoRe enables a quantitative evaluation of plausibility for language generation beyond the limited artificial settings explored in previous literature.
We compare the effectiveness of context sensitivity metrics and input attribution methods for context-aware MT, showing the limitations of metrics currently in use.
We apply PECoRe to naturally-occurring translations to identify interesting discourse-level phenomena and discuss issues in the context usage abilities of context-aware MT models.

4.2 Related Work

Context Usage in Language Generation An appropriate² usage of input information is fundamental in tasks such as summarization (Maynez et al., 2020) to ensure the soundness of generated texts. While appropriateness is traditionally verified post-hoc using trained models (Durmus et al., 2020; Kryscinski et al., 2020; Goyal and Durrett, 2021), recent interpretability works aim to gauge input influence on model predictions using internal properties of language models, such as the mixing of contextual information across model layers (Kobayashi et al., 2020; Ferrando et al., 2022b; Mohebbi et al., 2023) or the layer-by-layer refinement of next token predictions (Geva et al., 2022; Belrose et al., 2023). Recent attribution methods can disentangle factors influencing generation in language models (Yin and Neubig, 2022) and were successfully used to detect and mitigate hallucinatory behaviors (Tang et al., 2022; Dale et al., 2023a; Dale et al., 2023b). Our proposed method adopts this intrinsic perspective to identify context reliance without ad hoc trained components.

Context Usage in Neural Machine Translation Despite advances in context-aware MT (Voita et al., 2018; Voita et al., 2019; Lopes et al., 2020; Majumder et al., 2022; Jin et al., 2023; inter alia, surveyed by Maruf et al., 2021), only a few works explored whether context usage in MT models aligns with human intuition. Notably, some studies focused on which parts of context inform model predictions, finding that supposedly context-aware MT models are often incapable of using contextual information (Kim et al., 2019; Fernandes et al., 2021) and tend to pay attention to irrelevant words (Voita et al., 2018), with an overall poor agreement between human annotations and model rationales (Yin et al., 2021). Other works instead investigated which parts of generated texts are influenced by context, proposing various contrastive methods to detect gender biases, over- and under-translations (Vamvas and Sennrich, 2021a; Vamvas and Sennrich, 2022), and to identify various discourse-level phenomena in MT corpora (Fernandes et al., 2023). While these two directions have generally been investigated separately, our work proposes a unified framework to enable an end-to-end evaluation of context-reliance plausibility in language models.

Plausibility evaluation in NLP Plausibility evaluation for NLP models has primarily focused on classification models (DeYoung et al., 2020; Atanasova et al., 2020; Attanasio et al., 2023). While few works investigate plausibility in language generation (Vafa et al., 2021; Ferrando et al., 2023), such evaluations typically involve a single generation step to complete a target sentence with a token connected to preceding information (e.g. subject/verb agreement, as in “The pictures of the cat [is/are]”), effectively biasing the evaluation by using a pre-selected token of interest. On the contrary, our framework proposes a more comprehensive evaluation of generation plausibility that includes the identification of context-sensitive generated tokens as an important prerequisite.

4.3 The PECoRe Framework

PECoRe is a two-step framework for identifying context dependence in generative language models. First, context-sensitive tokens identification (CTI) selects which tokens among those generated by the model were influenced by the presence of the preceding context (e.g. the feminine options “alla pastorella, le” in Figure 4.1). Then, contextual cues imputation (CCI) attributes the prediction of context-sensitive tokens to specific cues in the provided context (e.g. the feminine cues “she, Le pecore” in Figure 4.1). Cue-target pairs formed by influenced target tokens and their respective influential context cues can then be compared to human rationales to assess the models’ plausibility of context reliance for contextual phenomena of interest. Figure 4.2 provides an overview of the two steps applied to the context-aware MT setting discussed by this work. A more general formalization of the framework for language generation is proposed in the following sections.

Figure 4.2: The PECoRe framework applied to an encoder-decoder MT model. **Left:** Context-sensitive token identification (CTI). ⓵: A context-aware MT model translates source context ($C_x$) and current ($x$) sentences into target context ($C_{\hat y}$) and current ($\hat y$) outputs. ⓶: $\hat y$ is force-decoded in the non-contextual setting instead of natural output $\tilde y$. ⓷: Contrastive metrics are collected throughout the model for every $\hat y$ token to compare the two settings. ⓸: Selector $s_\text{cti}$ maps metrics to binary context-sensitive labels for every $\hat y_i$. **Right:** Contextual cues imputation (CCI). ⓵: Non-contextual target $\tilde y^*$ is generated from contextual prefix $\hat y_{<t}$. ⓶: Function $f_\text{tgt}$ is selected to contrast model predictions with ($\hat y_t$) and without ($\tilde y_t^*$) input context. ⓷: Attribution method $f_\text{att}$ using $f_\text{tgt}$ as target scores contextual cues driving $\hat y_t$ prediction. ⓸: Selector $s_\text{cci}$ selects relevant cues, and cue-target pairs are assembled.

4.3.1 Notation

Let $X_\text{ctx}^{i}$ be the sequence of contextual inputs containing $N$ tokens from vocabulary $\mathcal{V}$, composed by current input $x$, generation prefix $y_{<i}$ and context $C$. Let $X_\text{no-ctx}^{i}$ be the non-contextual input in which $C$ tokens are excluded.³ $P_\text{ctx}^{i} = P\left(x,\, y_{<i},\,C,\,\theta\right)$ is the discrete probability distribution over $\mathcal{V}$ at generation step $i$ of a language model with $\theta$ parameters receiving contextual inputs $X_\text{ctx}^{i}$. Similarly, $P_\text{no-ctx}^{i} = P\left(x,\, y_{<i},\,\theta\right)$ is the distribution obtained from the same model for non-contextual input $X_\text{no-ctx}^{i}$. Both distributions are equivalent to vectors in the probability simplex in $\mathbb{R}^{|\mathcal{V}|}$, and we use $P_\text{ctx}(y_i)$ to denote the probability of next token $y_i$ in $P_\text{ctx}^{i}$, i.e. $P(y_i\,|\,x,\,y_{<i},\,C)$.

4.3.2 Context-sensitive Token Identification (CTI)

CTI adapts the contrastive conditioning paradigm proposed by Vamvas and Sennrich (2021a) to detect input context influence on model predictions using the contrastive pair $P_\text{ctx}^{i}, P_\text{no-ctx}^{i}$. Both distributions are relative to the contextual target sentence $\hat y = \{\hat y_1 \dots \hat y_n\}$, corresponding to the sequence produced by a decoding strategy of choice in the presence of input context. In Figure 4.2, the contextual target sentence $\hat y=$ “Sont-elles à l’hôtel?” is generated when $x$ and contexts $C_x, C_{\hat y}$ are provided as inputs, while non-contextual target sentence $\tilde y =$ “Ils sont à l’hôtel?” would be produced when only $x$ is provided. In the latter case, $\hat y$ is instead force-decoded from the non-contextual setting to enable a direct comparison of matching outputs. We define a set of contrastive metrics $\mathcal{M} = \{m_1, \dots, m_M\}$, where each $m: \displaystyle \Delta_{|\mathcal{V}|} \times \Delta_{|\mathcal{V}|} \mapsto \mathbb{R}$ maps a contrastive pair of probability vectors to a continuous score. For example, the difference in next token probabilities for contextual and non-contextual settings, i.e. $P_\text{diff}(\hat y_i) = P_\text{ctx}(\hat y_i) - P_\text{no-ctx}(\hat y_i)$, might be used for this purpose.⁴ Target tokens with high contrastive metric scores can be identified as context-sensitive, provided $C$ is the only added parameter in the contextual setting. Finally, a selector function $s_\text{cti}: \displaystyle \mathbb{R}^{| \mathcal{M} |} \mapsto \{0,1\}$ (e.g. a statistical threshold selecting salient scores) is used to classify every $\hat y_i$ as context-sensitive or not.

4.3.3 Contextual Cues Imputation (CCI)

CCI applies the contrastive attribution paradigm (Yin and Neubig, 2022) to trace the generation of every context-sensitive token in $\hat y$ back to the context $C$, identifying the cues that drive model predictions.

Definition 4.1 Let $s, s'$ be the resulting scores of two attribution target functions $f_\text{tgt}, f'_\text {tgt}$. An attribution method $f_\text{att}$ is if importance scores $A$ are computed in relation to the outcome of its attribution target function, i.e. whenever the following condition is verified.

\[f_\text{att}(x, y_{<t}, C, \theta, s) \neq f_\text{att}(x, y_{<t}, C, \theta, s') \;\; \forall s \neq s'\]

In practice, common gradient-based attribution approaches (Simonyan et al., 2014; Sundararajan et al., 2017) are target-dependent as they rely on the outcome predicted by the model (typically the logit or the probability of the predicted class) as the differentiation target to backpropagate importance to model input features. Similarly, perturbation-based approaches (Zeiler and Fergus, 2014) use the variation in prediction probability for the predicted class when noise is added to some of the model inputs to quantify the importance of the noised features.

On the contrary, recent approaches that rely solely on model internals to define input importance are generally target-insensitive. For example, attention weights used as model rationales, either in their raw form or after a rollout procedure to obtain a unified score (Abnar and Zuidema, 2020), are independent of the predicted outcome. Similarly, value zeroing scores (Mohebbi et al., 2023) reflect only the representational dissimilarity across model layers before and after zeroing value vectors, and as such do not explicitly account for model predictions.

Definition 4.2 Let $\mathcal{T}$ be the set of indices corresponding to context-sensitive tokens identified by the CTI step, such that $t \in \hat y$ and $\forall t \in \mathcal{T}, s_\text{cti}(m_1^{t}, \dots, m_M^{t}) = 1$. Let also $f_\text{tgt}: \Delta_{|\mathcal{V}|} \times \dots \mapsto \mathbb{R}$ be a contrastive attribution target function representing an attribution target of interest, for example, the difference in next-token probabilities between the contextual option $\hat y_t$ and the non-contextual option $\tilde y^*_t$ from the same contextual distribution $P_\text{ctx}^{t}$, plus any additional required parameter. The contrastive attribution method $f_\text{att}$ is a composite function quantifying the importance of contextual inputs to determine the output of $f_\text{tgt}$ for a given model with $\theta$ parameters.

\[f_\text{att}(\hat y_{t}) = f_\text{att}(x, \hat y_{<t}, C, \theta, f_\text{tgt}) = f_\text{att}\big(x, \hat y_{<t}, C, \theta, f_\text{tgt}(P_\text{ctx}^t, \dots)\big)\]

Remark 4.1. The non-contextual next token $\tilde y^*_t$ can be computed using the contextual prefix $\hat y_{<t} = \{ \hat y_1, \dots, \hat y_{t - 1}\}$ (e.g. $\hat y_{<t} =$“Sont-” in Figure 4.2) and non-contextual inputs $X_\text{no-ctx}^{t}$. This is conceptually equivalent to predicting the next token of a new non-contextual sequence $\tilde y^*$ which, contrary to the original $\tilde y$, starts from a forced contextual prefix $\hat y_{<t}$ (e.g. “ils” in $\tilde y^* =$ “ils à l’hôtel?” in Figure 4.2).

Remark 4.2. A $f_\text{tgt}$ making use of both $P_\text{ctx}^{t}$ and $P_\text{no-ctx}^{t}$, e.g. the KL divergence between the contextual and non-contextual probability distributions (Kullback and Leibler, 1951), can ultimately result in non-zero $f_\text{att}(\hat y_t)$ scores, even when $\hat y_t = \tilde y^*_t$, i.e. even when the next predicted token is the same, since probabilities $P_\text{ctx}(\hat y_t), P_{no-ctx}(\tilde y^*_t)$ are likely to differ beyond top-1 predictions. This is a desirable property of $f_\text{att}$, as it allows the attribution method to capture the influence of context on the model’s decision-making process, even in the case where the predicted token remains unchanged.

Remark 4.3. Our formalization of $f_\text{att}$ generalizes the method proposed by Yin and Neubig (2022) to support any target-dependent attribution method, such as popular gradient-based approaches (Simonyan et al., 2014; Sundararajan et al., 2017), and any contrastive attribution target $f_\text{tgt}$.

$f_\text{att}$ produces a sequence of attribution scores $A_t = \{a_1, \dots, a_N\}$ matching contextual input length $N$. From those, only the subset $A_{t\,\text{ctx}}$ of scores corresponding to context input sequence $C$ are passed to selector function $s_\text{cci}: \displaystyle \mathbb{R} \mapsto \{0,1\}$, which predicts a set $\mathcal{C}_{t}$ of indices corresponding to contextual cues identified by CCI, such that $\forall c \in \mathcal{C}_t, \forall a \in A_{t\,\text{ctx}}, s_\text{cci}(a_{c}) = 1$.

Having collected all context-sensitive generated token indices $\mathcal{T}$ using CTI and their contextual cues through CCI ($C_t$), PECoRe ultimately returns a sequence $S_\text{ct}$ of all identified cue-target pairs:

\[ \begin{aligned} \mathcal{T} &= \text{CTI}(C, x, \hat y, \theta, \mathcal{M}, s_\text{cti}) = \{t \;|\; s_\text{cti}(m_1^t, \dots, m_M^t) = 1 \} \\ \mathcal{C} &= \text{CCI}(\mathcal{T}, C, x, \hat y, \theta, f_\text{att}, f_\text{tgt}, s_\text{cci}) = \{ c \;|\; s_\text{cci}(a_c) = 1 \,\forall a_c \in A_{t\,\text{ctx}}, \forall t \in \mathcal{T}\} \\ S &= \texttt{PECoRe}(C, x, \theta, s_\text{cti}, s_\text{cci}, \mathcal{M}, f_\text{att}, f_\text{tgt}) = \{ (C_c, \hat y_t) \;|\; \forall t \in \mathcal{T}, \forall c \in \mathcal{C}_t, \forall \mathcal{C}_t \in \mathcal{C} \} \end{aligned} \]

A pseudocode implementation for the PECoRe algorithm is provided in Algorithm 1.

\begin{algorithm} \caption{PECoRe cue-target extraction process} \begin{algorithmic} \Require $C, x$ (Input context and current sequences), $\theta$ (Model parameters), $s_{\text{cti}}, s_{\text{cci}}$ (Selector functions), $\mathcal{M}$ (Contrastive metrics), $f_\text{att}$ (Contrastive attribution method), $f_\text{tgt}$ (Contrastive attribution target function) \Procedure{PECoRe}{$C, x, \theta, s_\text{cti}, s_\text{cci}, \mathcal{M}, f_\text{att}, f_\text{tgt}$} \State $\hat y = \textnormal{generate(}C, x, \theta$) using any decoding strategy and parameters \State $\mathcal{T} = \textnormal{CTI(}C, x, \hat y, \theta, \mathcal{M}, s_\text{cti}\textnormal{)}$ \ForAll{$t \in \mathcal{T}$} \State $\mathcal{C}_t = \textnormal{CCI(}t, C, x, \hat y, \theta, f_\text{att}, f_\text{tgt}, s_\text{cci}\textnormal{)}$ \ForAll{$c \in \mathcal{C}_t$} \State Store $(C_t^c, \hat y_t)$ in $S_\text{ct}$ \EndFor \EndFor \State \textbf{return} $S_\text{ct}$ // Set of cue-target pairs \EndProcedure \Procedure{CTI}{$C, x, \hat y, \theta, \mathcal{M}, s_\text{cti}$} \State $\mathcal{T} = \emptyset$ // Empty set for context-sensitive indices of $\hat y$ tokens \ForAll{$\hat{y}_i \in \hat{y}$} \ForAll{$m \in \mathcal{M}$} \State $m^i = m \big(P_{\text{ctx}}(\hat{y}_i), P_{\text{no-ctx}}(\hat{y}_i) \big)$ \EndFor \If{$(s_{\text{cti}}(m_1^i, \dots, m_M^i) = 1$)} \State Store $i$ in set $\mathcal{T}$ \EndIf \EndFor \State \textbf{return} $\mathcal{T}$ \EndProcedure \Procedure{CCI}{$t, C, x, \hat y, \theta, f_\text{att}, f_\text{tgt}, s_\text{cci}$} \State $\mathcal{C}_t = \emptyset$ // Empty set for contextual cues for target token $t$ \State Generate constrained non-contextual target current sequence $\tilde y^*$ from $\hat y_{<t}$ \State Use attribution method $f_\text{att}$ with target $f_\text{tgt}$ to get importance scores $A_t$ \State Identify the subset $A_{t\,\text{ctx}}$ corresponding to tokens of context $C = \{ C_1, \dots, C_K\}$ \ForAll{$a_i \in A_{t\,\text{ctx}} = \{a_1, \dots, a_K\}$} \If{$s_\text{cci}(a_i) = 1$} \State Store $C_i$ in $\mathcal{C}_t$ \EndIf \EndFor \State \textbf{return} $\mathcal{C}_t$ \EndProcedure \end{algorithmic} \end{algorithm}

4.4 Context Reliance Plausibility in Context-aware MT

This section describes our evaluation of PECoRe in a controlled setup. We experiment with several contrastive metrics and attribution methods for CTI and CCI (Section 4.4.2, Section 4.4.5), evaluating them in isolation to quantify the performance of individual components. An end-to-end evaluation is also performed in Section 4.4.5 to establish the applicability of PECoRe in a naturalistic setting.

4.4.1 Experimental Setup

Evaluation Datasets Evaluating generation plausibility requires human annotations for context-sensitive tokens in target sentences and disambiguating cues in their preceding context. To our knowledge, the SCAT dataset (Yin et al., 2021) is the only resource matching these requirements. SCAT is an English$\rightarrow$French corpus with human annotations of anaphoric pronouns and disambiguating context on OpenSubtitles2018 dialogue translations (Lison et al., 2018; Lopes et al., 2020). SCAT examples were extracted automatically using lexical heuristics and thus contain only a limited set of anaphoric pronouns (it, they $\rightarrow$ il/elle, ils/elles), with no guarantees of contextual cues being found in preceding context.

The original SCAT test set contains 1000 examples with automatically identified context-sensitive pronouns it/they (marked by <p>...</p>) and human-annotated contextual cues aiding their disambiguation (marked by <hon>...</hoff>). Of these, we find 38 examples containing malformed tags and several more examples where an unrelated word containing it or they was wrongly marked as context-sensitive (e.g. the soccer ball h<p>it</p> your chest). Moreover, due to the original extraction process adopted for SCAT, there is no guarantee that contextual cues will be contained in the preceding context, as they could also appear in the same sentence, defeating the purpose of our context usage evaluation. Thus, we prefilter the entire corpus to retain only sentences with well-formed tags and inter-sentential contextual cues identified by the original annotators. Moreover, a manual inspection procedure is carried out to validate the original cue tags and discard problematic sentences, obtaining a final set of 250 examples with inter-sentential pronoun coreference, which we name SCAT+⁵.

Additionally, we manually annotate contextual cues in DiscEval-MT (Bawden et al., 2018), another English$\rightarrow$French corpus containing handcrafted examples for anaphora resolution (ana) and lexical choice (lex). In the case of DiscEval-MT, we use minimal pairs in the original dataset to automatically mark differing tokens as context-sensitive. Then, contextual cues are manually labeled separately by two annotators with good familiarity with both English and French. Cue annotations are compared across the two splits, resulting in very high agreement due the simplicity of the corpus ($97\%$ overlap for ana, $90\%$ for lex).⁶

Our final evaluation set contains 250 SCAT+ and 400 DiscEval-MT translations across two discourse phenomena. Table 4.1 provides some examples for the three data splits.

SCAT+

$C_x$: I loathe that song. But why did you bite poor Birdie’s head off? Because I’ve heard it more times than I care to. It haunts me. Just stop, for a moment.

$C_y$: Je hais cette chanson (song, feminine). Mais pourquoi avoir parlé ainsi à la pauvre Birdie ? Parce que j’ai entendu ce chant plus que de fois que je ne le peux. Elle (she) me hante. Arrêtez-vous un moment.

$x$: How does it haunt you?

$y$: Comment peut-elle (she) vous hanter?

$C_x$: - Ah! Sven! It’s been so long. - Riley, it’s good to see you. - You, too. How’s the boat? Uh, it creaks, it groans.

$C_y$: Sven ! - Riley, contente de te voir. - Content aussi. Comment va le bateau (boat, masculine)? Il (he) craque de partout.

$x$: Not as fast as it used to be.

$y$: Il (he) n’est pas aussi rapide qu’avant.

DiscEval-MT ana

$C_x$: But how do you know the woman isn’t going to turn out like all the others?

$C_y$: Mais comment tu sais que la femme (woman, feminine) ne finira pas comme toutes les autres?

$x$: This one’s different.

$y$: Celle-ci (This one, feminine) est différente.

$C_x$: Can you authenticate these signatures, please?

$C_y$: Pourriez-vous authentifier ces signatures (feminine), s’il vous plaît?

$x$: Yes, they’re mines.

$y$: Oui, ce sont les miennes (mines, feminine).

DiscEval-MT lex

$C_x$: Do you think you can shoot it from here?

$C_y$: Tu penses que tu peux le tirer (shoot) dessus à partir d’ici?

$x$: Hand me that bow.

$y$: Passe-moi cet arc (bow, weapon).

$C_x$: Can I help you with the wrapping?

$C_y$: Est-ce que je peux t’aider pour l’emballage (wrapping)?

$x$: Hand me that bow.

$y$: Passe-moi ce ruban (bow, gift wrap).

Table 4.1: Examples from the SCAT+ and DiscEval-MT datasets used in our analysis with highlighted context-sensitive tokens and contextual cues used for plausibility evaluation using PECoRe. Glosses are added for French words of interest to facilitate understanding.

Models We evaluate two bilingual Opus models (Tiedemann and Thottingal, 2020) using the transformer base architecture (Vaswani et al., 2017, Small and Large), and mBART-50 1-to-many (Tang et al., 2021), a larger multilingual MT model supporting 50 target languages, using the 🤗 transformers library (Wolf et al., 2020). We fine-tune models using extended translation units (Tiedemann and Scherrer, 2017) with contextual inputs marked by break tags such as source context <brk> source current to produce translations in the format target context <brk> target current, where context and current target sentences are generated. We perform context-aware fine-tuning on 242k IWSLT 2017 English$\rightarrow$French examples (Cettolo et al., 2017), using a dynamic context size of 0-4 preceding sentences to ensure robustness to different context lengths and allow contextless usage. To further improve models’ context sensitivity, we continue fine-tuning on the SCAT training split, containing 11k examples with inter- and intra-sentential pronoun anaphora.

Model	SCAT+			DiscEval-MT (ana)			DiscEval-MT (lex)
Model	bleu	ok	ok-cs	bleu	ok	ok-cs	bleu	ok	ok-cs
Opus Small (def.)	29.1	0.14	-	43.9	0.40	-	30.5	0.29	-
Opus Small S+T$_{\text{ctx}}$	39.1	0.81	0.59	48.1	0.60	0.24	33.5	0.36	0.07
Opus Large (def.)	29.0	0.16	-	39.2	0.41	-	31.2	0.31	-
Opus Large S+T$_{\text{ctx}}$	40.3	0.83	0.58	48.9	0.68	0.31	34.8	0.38	0.10
mBART-50 (def.)	23.8	0.26	-	33.4	0.42	-	24.5	0.25	-
mBART-50 S+T$_{\text{ctx}}$	37.6	0.82	0.55	49.0	0.62	0.32	29.3	0.30	0.07

Table 4.2: Translation quality of English$\rightarrow$French MT models before (def.) and after (S+T$_\text{ctx}$) context-aware MT fine-tuning. ok: % of translations with correct disambiguation for discourse phenomena. ok-cs: % of translations where the correct disambiguation is achieved only when context is provided.

Model Disambiguation Accuracy We estimate contextual disambiguation accuracy by verifying whether annotated (gold) context-sensitive words are found in model outputs. Results before and after context-aware fine-tuning are shown in Table 4.2. We find that fine-tuning improves translation quality and disambiguation accuracy across all tested models, with larger gains for anaphora resolution datasets that closely match the fine-tuning data. To gain further insight into these results, we use context-aware models to translate examples with and without context and identify a subset of context-sensitive translations (ok-cs) for which the correct target word is generated only when input context is provided to the model. Interestingly, we find a non-negligible amount of translations that are correctly disambiguated even in the absence of input context (corresponding to ok minus ok-cs in Table 4.2). For these examples, the correct prediction of ambiguous words aligns with model biases, such as defaulting to masculine gender for anaphoric pronouns (Stanovsky et al., 2019) or using the most frequent sense for word sense disambiguation. Provided that such examples are unlikely to exhibit context reliance, we focus particularly on the ok-cs subset results in our following evaluation.

4.4.2 Metrics for Context-sensitive Target Identification

The following contrastive metrics are evaluated for detecting context-sensitive tokens in the CTI step.

Relative Context Saliency We use contrastive gradient norm attribution (Yin and Neubig, 2022) to compute input importance towards predicting the next token $\hat y_i$ with and without input context. Positive importance scores are obtained for every input token using the L2 gradient vectors norm (Bastings et al., 2022), and relative context saliency is obtained as the proportion between the normalized importance for context tokens $c \in C_x, C_y$ and the overall input importance, following previous work quantifying MT input contributions (Voita et al., 2021; Ferrando et al., 2022a; Edman et al., 2024).

\[\nabla_\text{ctx} (P_\text{ctx}^{i}, P_\text{no-ctx}^{i}) = \frac{\sum_{c \in C_x, C_y} \big\| \nabla_c \big( P_\text{ctx}(\hat y_i) - P_\text{no-ctx}(\hat y_i) \big) \big\|}{\sum_{t \in X_\text{ctx}^{i}} \big\| \nabla_t \big( P_\text{ctx}(\hat y_i) - P_\text{no-ctx}(\hat y_i) \big) \big\|}\]

Likelihood Ratio (LR) and Pointwise Contextual Cross-mutual Information (P-CXMI) Proposed by Vamvas and Sennrich (2021a) and Fernandes et al. (2023), respectively, both metrics frame context dependence as a ratio of contextual and non-contextual probabilities.

\[\text{LR}(P_\text{ctx}^{i}, P_\text{no-ctx}^{i}) = \frac{P_\text{ctx}(\hat{y}_i)}{P_\text{ctx}(\hat{y}_i) + P_\text{no-ctx}(\hat{y}_i)}\]

\[\text{P-CXMI}(P_\text{ctx}^{i}, P_\text{no-ctx}^{i}) = - \log \frac{P_\text{ctx}(\hat{y}_i)}{P_\text{no-ctx}(\hat{y}_i)}\]

KL-Divergence (Kullback and Leibler, 1951) between $P_\text{ctx}^{i}$ and $P_\text{no-ctx}^{i}$ is the only metric we evaluate that considers the full distribution rather than the probability of the predicted token. We include it to test the intuition that the impact of context inclusion might extend beyond top-1 token probabilities.

\[D_\text{KL}(P_\text{ctx}^{i} \| P_\text{no-ctx}^{i}) = \sum_{\hat{y}_i \in \mathcal{V}} P_\text{ctx}(\hat{y}_i) \log \frac{P_\text{ctx}(\hat{y}_i)}{P_\text{no-ctx}(\hat{y}_i)}\]

4.4.3 Plausibility Evaluation Metrics

In practice, the CTI and CCI steps in PECoRe produce a sequence of continuous scores that are later binarized using selectors $s_\text{cti}, s_\text{cci}$, introduced in Section 4.3. To evaluate their validity, those are compared to a sequence $I_h$ of the same length containing binary values, where 1s correspond to the cues identified by human annotators, while the rest are set to 0. In our experiments, we use two standard plausibility metrics introduced by DeYoung et al. (2020):

Token-level Macro F1 is the harmonic mean of precision and recall at the token level, using $I_h$ as the ground truth and the post-selector binarized scores as predictions. Macro-averaging is used to account for the sparsity of cues in $I_h$. We use this metric in our primary analysis, as the discretization step is more likely to reflect realistic plausibility performance, since it matches more closely the annotation process used to derive $I_h$. We note that Macro F1 can be considered a lower bound for plausibility, as the results depend heavily on the choice of the selector used for discretization.

Area Under Precision-Recall Curve (AUPRC) is computed as the area under the curve obtained by varying a threshold over token importance scores and computing the precision and recall for resulting discretized $I_m$ predictions while keeping $I_h$ as the ground truth. Contrary to Macro F1, AUPRC is selector-independent and accounts for tokens’ relative ranking and degree of importance. Consequently, it acts as an upper bound for plausibility, as if the optimal selector was used. Results using AUPRC are presented in Section A.2.2 for completeness, but we focus on Macro F1 in the primary analysis.

4.4.4 CTI Plausibility Results

Figure 4.3 presents our metrics evaluation for CTI, with results for the full test sets and the subsets of context-sensitive sentences (ok-cs) highlighted in Table 4.2. To keep our evaluation simple, we use a naive $s_\text{cti}$ selector tagging all tokens with metric scores one standard deviation above the per-example mean as context-sensitive. We also include a stratified random baseline matching the frequency of occurrence of context-sensitive tokens in each dataset. Datapoints in Figure 4.3 are sentence-level macro F1 scores computed for every dataset example.

Figure 4.3: Macro F1 of contrastive metrics for context-sensitive target token identification (CTI) using Opus Large on the full datasets (left) or on ok-cs context-sensitive subsets (right).

Pointwise metrics (LR, P-CXMI) show high plausibility for the context-sensitive subsets ok-cs across all datasets and models, but achieve lower performances on the full test set, especially for lexical choice phenomena less present in MT models’ training. KL-Divergence performs on par with or better than pointwise metrics, suggesting that distributional shifts beyond top prediction candidates can provide helpful information for detecting context sensitivity. On the contrary, the poor performance of context saliency suggests that aggregate context reliance cannot reliably predict context sensitivity. A manual examination of misclassified examples reveals several context-sensitive tokens that were not annotated as such, as they did not match the dataset’s phenomena of interest, but were still identified by CTI metrics. Table 4.3 presents several examples illustrating the contextual influence of French pronoun formality, whereas SCAT+ examples focus solely on gender disambiguation for anaphoric pronouns. This suggests that our evaluation of CTI metrics’ plausibility can be considered a lower bound for actual method accuracy, as it is restricted to the two phenomena available in the datasets we used (anaphora resolution and lexical choice), rather than the broad set of contextual dependence phenomena. These results further underscore the importance of data-driven, end-to-end approaches like PECoRe in limiting the influence of selection bias during evaluation.

Pronoun Grammatical Formality, SCAT+

$C_x$: […] That demon that was in you, it wants you. But not like before. I think it loves you.

$C_y$: […] Ce démon qui était en vous, il vous veut. Mais pas comme avant. Je pense qu’il vous aime.

$x$: And it’s powerless without you.

$y$: Et il est impuissant sans vous (you, 2nd p. plur., formal).

$C_x$: You threaten my father again, I’ll kill you myself… on this road. You hear me?

$C_y$: Tu menaces encore mon père, je te tuerai moi-même… sur cette route. Tu m’entends?

$x$: Now it is with you as well.

$y$: Maintenant elle est aussi avec toi (you, 2nd p. sing., informal).

$C_x$: She went back to Delhi. What do you think? […] Girls, I tell you.

$C_y$: Elle est revenue à Delhi. Qu’en penses-tu? […] Les filles, je te le dis.

$x$: I wish they were all like you.

$y$: J’aimerais qu’elles soient toutes comme toi (you, 2nd p. sing., informal).

Table 4.3: Examples of SCAT+ sentences with context-sensitive target tokens identified by CTI but not originally labeled as context-dependent in the dataset, since they do not match the gendered pronoun rule match used to create SCAT+. Relevant formality contextual cues are highlighted, and glosses are added for French words of interest to facilitate understanding.

4.4.5 Methods for Contextual Cues Imputation

The following attribution methods are evaluated for detecting contextual cues in the CCI step.

Contrastive Gradient Norm (Yin and Neubig, 2022) estimates the input tokens’ contributions towards predicting a target token, rather than a contrastive alternative. We use this method to explain the generation of context-sensitive tokens in the presence and absence of context.

\[A_{t\,\text{ctx}} = \{\,\| \nabla_c \big(f_\text{tgt}(P_\text{ctx}^{i}, \dots) \big)\|\,|\, \forall c \in C\}\]

For the choice of $f_\text{tgt}$, we evaluate both probability difference $P_\text{ctx}(\hat y_i) - P_\text{no-ctx}(\hat y_i)$, conceptually similar to the original formulation, and the KL-Divergence of contextual and non-contextual distributions $D_\text{KL}(P_\text{ctx}^{i} \| P_\text{no-ctx}^{i})$. We use $\nabla_\text{diff}$ and $\nabla_\text{KL}$ to identify gradient norm attribution in the two settings. $\nabla_\text{KL}$ scores can be seen as the contribution of input tokens towards the shift in probability distribution caused by the presence of input context.⁷

Attention Weights Following previous work, we use the mean attention weight across all heads and layers (Attention Mean, Kim et al. (2019)) and the weight for the head obtaining the highest plausibility per-dataset (Attention Best, Yin et al. (2021)) as importance measures for CCI. Attention Best can be seen as a best-case estimate of attention performance but is not a viable metric in real settings, provided that the best attention head to capture a phenomenon of interest is unknown beforehand. Since attention weights are model byproducts unaffected by predicted outputs, we use only attention scores for the contextual setting $P_\text{ctx}^{i}$ and ignore the contextless alternative when using these metrics.

4.4.6 CCI Plausibility Results

Figure 4.4: Macro F1 of CCI methods over full datasets using Opus Large models trained with only source context (left) or with source+target context (right). Boxes and red median lines show CCI results based on gold context-sensitive tokens. Dotted bars show median CCI scores obtained from context-sensitive tokens identified by KL-Divergence during CTI (E2E settings).

We conduct a controlled CCI evaluation using gold context-sensitive tokens as the starting point to attribute contextual cues. Provided that gold context-sensitive tokens are only available in annotated reference translations, a simple option when applying CCI to those would involve using references as model generations. However, this was shown to be problematic by previous research, as it would induce a distributional discrepancy in model predictions (Vamvas and Sennrich, 2021b). For this reason, we let the model generate a natural translation and instead try to align tags to this new sentence using the awesome aligner (Dou and Neubig, 2021) with labse multilingual embeddings (Feng et al., 2022). While this process is not guaranteed to always result in accurate tags, it provides a good approximation of gold CTI annotations for model generation, which is suitable for our assessment. This corresponds to the baseline plausibility evaluation described in Section 2.2.2, allowing us to evaluate attribution methods in isolation, assuming perfect identification of context-sensitive tokens. Figure 4.4 presents our results. Scores in the right plot are relative to the context-aware Opus Large model of Section 4.4.4 using both source and target context. Instead, the left plot presents results for an alternative version of the same model that was fine-tuned using only the source context (i.e., translating $C_x, x \rightarrow y$ without producing the target context $C_y$). Source-only context was used in previous context-aware MT studies (Fernandes et al., 2022), and we include it in our analysis to assess how the presence of target context impacts model plausibility. We finally validate the end-to-end plausibility of PECoRe-detected pairs using context-sensitive tokens identified by the best CTI metric from Section 4.4.4 (KL-Divergence) as the starting point for CCI, and using a simple statistical selector equivalent to the one used for CTI evaluation.

First, contextual cues are more easily detected for the source-only model using all evaluated methods. This finding corroborates previous evidence highlighting how context usage issues might emerge when lengthy context is provided (Fernandes et al., 2021; Shi et al., 2023). When moving from gold CTI tags to the end-to-end setting (E2E) we observe a larger drop in plausibility for the SCAT+ and DiscEval-MT ana datasets that more closely match the fine-tuning data of analyzed MT models. This suggests that standard evaluation practices may overestimate model plausibility for in-domain settings and that our proposed framework can effectively mitigate this issue. Interestingly, the Attention Best method suffers the most from end-to-end CCI application, while other approaches are more mildly affected. This can result from attention heads failing to generalize to other discourse-level phenomena at test time, providing further evidence of the limitations of attention as an explanatory metric (Jain and Wallace, 2019; Bastings and Filippova, 2020). While $\nabla_\text{diff}$ and $\nabla_\text{KL}$ appear as the most robust choices across the two datasets, per-example variability remains high across the board, leaving space for improvement for more plausible attribution methods in future work.

4.5 Detecting Context Reliance in the Wild

We continue our analysis by applying the PECoRe method to the popular Flores-101 MT benchmark (Goyal et al., 2022), containing groups of 3-5 contiguous sentences from English Wikipedia. While previous sections used labeled examples to evaluate the effectiveness of PECoRe components, here we apply our framework end-to-end to unannotated MT outputs and inspect the resulting cue-target pairs to identify the successes and failures of context-aware MT models.

Specifically, we apply PECoRe to the context-aware Opus Large and mBART-50 models of Section 4.4.1, using KL-Divergence as CTI metric and $\nabla_\text{KL}$ as CCI attribution method. We set $s_\text{cti}$ and $s_\text{cci}$ to two standard deviations above the per-example average score to focus our analysis on very salient tokens.

1. Acronym Translation (English → French, correct but more generic)

$C_x$: Across the United States of America, there are approximately 400,000 known cases of Multiple Sclerosis (MS) […]

$C_y$: Aux États-Unis, il y a environ 400 000 cas connus de sclérose en plaques […]

$x$: MS affects the central nervous system, which is made up of the brain, the spinal cord and the optic nerve.

$\tilde y$: La SEP affecte le système nerveux central, composé du cerveau, de la moelle épinière et du nerf optique.

$\hat y$: La maladie affecte le système nerveux central, composé du cerveau, de la moelle épinière et du nerf optique.

2. Anaphora Resolution (English → French, incorrect)

$C_x$: The terrified King and Madam Elizabeth were forced back to Paris by a mob of market women.

$C_y$: Le roi et Madame Elizabeth ont été forcés à revenir à Paris par une foule de femmes du marché.

$x$: In a carriage, they traveled back to Paris surrounded by a mob of people screaming and shouting threats […]

$\tilde y$: Dans une carriole, ils sont retournés à Paris entourés d’une foule de gens hurlant et criant des menaces […]

$\hat y$: Dans une carriole, elles sont retournées à Paris entourées d’une foule de gens hurlant et criant des menaces […]

Table 4.4: Flores-101 examples with cue-target pairs identified by PECoRe in Opus Large contextual translations. Context-sensitive tokens generated instead of their non-contextual counterparts are identified by CTI, and contextual cues justifying their predictions are retrieved by CCI. Other changes in $\hat y$ are not considered context-sensitive by PECoRe.

3. Numeric format cohesion (English → French, incorrect)

$C_x$: The game kicked off at 10:00am with great weather apart from mid morning drizzle […]

$C_y$: Le match a commencé à 10:00 du matin avec un beau temps à part la nuée du matin […]

$x$: South Africa started on the right note when they had a comfortable 26-00 win against Zambia.

$\tilde y$: L’Afrique du Sud a commencé sur la bonne note quand ils ont eu une confortable victoire de 26 contre le Zambia.

$\hat y$: L’Afrique du Sud a commencé sur la bonne note quand ils ont eu une confortable victoire de 26:00 contre le Zambia.

4. Lexical cohesion (English → Turkish, correct)

$C_x$: The activity of all stars in the system was found to be driven by their luminosity, their rotation, and nothing else.

$C_y$: Sistemdeki bütün ulduzların faaliyetlerinin, parlaklıkları, rotasyonları ve başka hiçbir şeyin etkisi altında olduğunu ortaya çıkardılar.

$x$: The luminosity and rotation are used together to determine a star’s Rossby number, which is related to plasma flow.

$\tilde y$: Parlaklık ve döngü, bir akışıyla ilgili Rossby sayısını belirlemek için birlikte kullanılıyor.

$\hat y$: Parlaklık ve rotasyon, bir akışıyla ilgili Rossby sayısını belirlemek için birlikte kullanılıyor.

Table 4.5: Flores-101 examples with cue-target pairs identified by PECoRe in mBART-50 contextual translations. Context-sensitive tokens generated instead of their non-contextual counterparts are identified by CTI, and contextual cues justifying their predictions are retrieved by CCI. Other changes in $\hat y$ are not considered context-sensitive by PECoRe.

Table 4.4 and Table 4.5 show some examples annotated with PECoRe outputs. In the first example, the acronym MS, standing for Multiple Sclerosis, is translated generically as la maladie (the illness) in the contextual output, but as SEP (the French acronym for MS, i.e. sclérose en plaques) when context is not provided. PECoRe shows how this choice is mostly driven by the MS mention in source context $C_x$ while the term sclérose en plaques in target context $C_y$ is not identified as influential, possibly motivating the choice for the more generic option.

In the second example, the prediction of pronoun elles (they, feminine) depends on the context noun phrase mob of market women (foule de femmes du marché in French). However, the correct pronoun referent is Le roi et Madame Elizabeth (the king and Madam Elizabeth), so the pronoun should be the masculine default ils, commonly used for mixed-gender groups in French. PECoRe identifies this as a context-dependent failure due to an issue with the MT model’s anaphora resolution.

The third example presents an interesting case of erroneous numeric format cohesion that would typically go undetected when relying on pre-defined linguistic hypotheses. In this sentence, the score 26-00 is translated as 26 in the contextless output and as 26:00 in the context-aware translation. The 10:00 time indications found by PECoRe in the contexts suggest this is a case of problematic lexical cohesion.

Finally, we include an example of context usage for English$\rightarrow$Turkish translation to test the contextual capabilities of the default mBART-50 model without context-aware fine-tuning. Again, PECoRe shows how the word rotasyon (rotation) is selected over döngü (loop) as the correct translation in the contextual case due to the presence of the lexically similar word rotasyonları in the previous context.

4.6 Integrating PECoRe in Inseq

To facilitate the use of PECoRe in future research, a flexible implementation of the framework was incorporated into the Inseq toolkit presented in Chapter 3. Since its v0.6.0 Inseq offers the CLI command attribute-context, supporting all contrastive step functions and attribution methods in the library, and compatible with any decoder-only and encoder-decoder generative language model. Figure 4.5 provides an example employing the Inseq API to attribute a language model answer to input context paragraphs, similarly to the retrieval-augmented generation task we discuss in Chapter 5.⁸ In the example, the StableLM 2 Zephyr 1.6B language model⁹ is prompted with contexts retrieved from Wikipedia to provide a long-form answer to a query about population in the Hawaiian islands. When referring to “the information provided” in ⓵, PECoRe identifies the indices of the two documents containing relevant information as salient. The name of Ni’ihau, a small island with barely any population, is also found important when the model produces an additional remark on their population in ⓶. However, we observe that the answer in the context is not identified as salient by PECoRe during generation, suggesting that the model might be relying on memorization. We test the hypothesis by prompting the model in a closed-book setting without context paragraphs, finding that the model can indeed respond correctly without context. Moreover, as expected, the island of Ni’ihau is never mentioned in the contextless response. Additional examples of PECoRe usage for other generation tasks are provided in Section A.2.3.

Figure 4.5: Example of context attribution for open-book QA using the Inseq-powered PECoRe demo. Context-sensitive tokens and contextual cues are highlighted.

4.7 Conclusion

We introduced PECoRe, a novel interpretability framework for detecting and attributing context usage in language models’ generations. PECoRe extends the standard plausibility evaluation procedure adopted in interpretability research by proposing a two-step procedure to identify context-sensitive generated tokens and match them to contextual cues contributing to their prediction. We applied PECoRe to context-aware MT, finding that context-sensitive tokens and their disambiguating rationales can be detected consistently and with reasonable accuracy across several datasets, models and discourse phenomena. Moreover, an end-to-end application of our framework without human annotations revealed incorrect context usage, leading to problematic MT model outputs.

While our evaluation is mainly focused on the machine translation domain, thanks to its generality and its integration in the Inseq framework PECoRe can easily be applied to other context-dependent language generation tasks such as question answering and summarization, as also demonstrated in the previous section. Future applications of our methodology could investigate the usage of in-context demonstrations and chain-of-thought reasoning in large language models (Brown et al., 2020; Wei et al., 2022), and explore PECoRe usage for different model architectures and input modalities. In the next chapter, we extend PECoRe for attributing context usage in retrieval-augmented generation tasks, where the model is expected to rely on external knowledge sources to produce answers to user queries.

Samira Abnar and Willem Zuidema. 2020. Quantifying attention flow in transformers. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th annual meeting of the association for computational linguistics, pages 4190–4197, Online. Association for Computational Linguistics.

Pepa Atanasova, Jakob Grue Simonsen, Christina Lioma, and Isabelle Augenstein. 2020. A diagnostic study of explainability techniques for text classification. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 3256–3274, Online. Association for Computational Linguistics.

Giuseppe Attanasio, Eliana Pastor, Chiara Di Bonaventura, and Debora Nozza. 2023. Ferret: A framework for benchmarking explainers on transformers. In Danilo Croce and Luca Soldaini, editors, Proceedings of the 17th conference of the european chapter of the association for computational linguistics: System demonstrations, pages 256–266, Dubrovnik, Croatia. Association for Computational Linguistics.

Jasmijn Bastings, Sebastian Ebert, Polina Zablotskaia, Anders Sandholm, and Katja Filippova. 2022. “Will you find these shortcuts?” A protocol for evaluating the faithfulness of input salience methods for text classification. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 conference on empirical methods in natural language processing, pages 976–991, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Jasmijn Bastings and Katja Filippova. 2020. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? In Afra Alishahi, Yonatan Belinkov, Grzegorz Chrupała, Dieuwke Hupkes, Yuval Pinter, and Hassan Sajjad, editors, Proceedings of the third BlackboxNLP workshop on analyzing and interpreting neural networks for NLP, pages 149–155, Online. Association for Computational Linguistics.

Rachel Bawden, Rico Sennrich, Alexandra Birch, and Barry Haddow. 2018. Evaluating discourse phenomena in neural machine translation. In Marilyn Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long papers), pages 1304–1313, New Orleans, Louisiana. Association for Computational Linguistics.

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. 2023. Eliciting latent predictions from transformers with the tuned lens. ArXiv, abs/2303.08112.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, et al. 2020. Language models are few-shot learners. In Proceedings of the 34th international conference on neural information processing systems, Red Hook, NY, USA. Curran Associates Inc.

Mauro Cettolo, Marcello Federico, Luisa Bentivogli, Jan Niehues, Sebastian Stüker, Katsuhito Sudoh, Koichiro Yoshino, and Christian Federmann. 2017. Overview of the IWSLT 2017 evaluation campaign. In Sakriani Sakti and Masao Utiyama, editors, Proceedings of the 14th international conference on spoken language translation, pages 2–14, Tokyo, Japan. International Workshop on Spoken Language Translation.

David Dale, Elena Voita, Loic Barrault, and Marta R. Costa-jussà. 2023a. Detecting and mitigating hallucinations in machine translation: Model internal workings alone do well, sentence similarity Even better. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pages 36–50, Toronto, Canada. Association for Computational Linguistics.

David Dale, Elena Voita, Janice Lam, Prangthip Hansanti, Christophe Ropers, Elahe Kalbassi, Cynthia Gao, Loic Barrault, and Marta Costa-jussà. 2023b. HalOmi: A manually annotated benchmark for multilingual hallucination and omission detection in machine translation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 conference on empirical methods in natural language processing, pages 638–653, Singapore. Association for Computational Linguistics.

Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C. Wallace. 2020. ERASER: A benchmark to evaluate rationalized NLP models. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th annual meeting of the association for computational linguistics, pages 4443–4458, Online. Association for Computational Linguistics.

Zi-Yi Dou and Graham Neubig. 2021. Word alignment by fine-tuning embeddings on parallel corpora. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty, editors, Proceedings of the 16th conference of the european chapter of the association for computational linguistics: Main volume, pages 2112–2128, Online. Association for Computational Linguistics.

Esin Durmus, He He, and Mona Diab. 2020. FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th annual meeting of the association for computational linguistics, pages 5055–5070, Online. Association for Computational Linguistics.

Lukas Edman, Gabriele Sarti, Antonio Toral, Gertjan van Noord, and Arianna Bisazza. 2024. Are character-level translations worth the wait? Comparing ByT5 and mT5 for machine translation. Transactions of the Association for Computational Linguistics, 12:392–410.

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. Language-agnostic BERT sentence embedding. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 878–891, Dublin, Ireland. Association for Computational Linguistics.

Patrick Fernandes, António Farinhas, Ricardo Rei, José G. C. de Souza, Perez Ogayo, Graham Neubig, and Andre Martins. 2022. Quality-aware decoding for neural machine translation. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 1396–1412, Seattle, United States. Association for Computational Linguistics.

Patrick Fernandes, Kayo Yin, Emmy Liu, André Martins, and Graham Neubig. 2023. When does translation require context? A data-driven, multilingual exploration. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pages 606–626, Toronto, Canada. Association for Computational Linguistics.

Patrick Fernandes, Kayo Yin, Graham Neubig, and André F. T. Martins. 2021. Measuring and increasing context usage in context-aware machine translation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers), pages 6467–6478, Online. Association for Computational Linguistics.

Javier Ferrando, Gerard I. Gállego, Belen Alastruey, Carlos Escolano, and Marta R. Costa-jussà. 2022a. Towards opening the black box of neural machine translation: Source and target interpretations of the transformer. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 conference on empirical methods in natural language processing, pages 8756–8769, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Javier Ferrando, Gerard I. Gállego, and Marta R. Costa-jussà. 2022b. Measuring the mixing of contextual information in the transformer. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 conference on empirical methods in natural language processing, pages 8698–8714, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Javier Ferrando, Gerard I. Gállego, Ioannis Tsiamas, and Marta R. Costa-jussà. 2023. Explaining how transformers use context to build predictions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pages 5486–5513, Toronto, Canada. Association for Computational Linguistics.

Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. 2022. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 conference on empirical methods in natural language processing, pages 30–45, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. 2022. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.

Tanya Goyal and Greg Durrett. 2021. Annotating and modeling fine-grained factuality in summarization. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 1449–1462, Online. Association for Computational Linguistics.

Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. 2021. Surface form competition: Why the highest probability answer isn‘t always right. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 conference on empirical methods in natural language processing, pages 7038–7051, Online; Punta Cana, Dominican Republic. Association for Computational Linguistics.

Alon Jacovi and Yoav Goldberg. 2020. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th annual meeting of the association for computational linguistics, pages 4198–4205, Online. Association for Computational Linguistics.

Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), pages 3543–3556, Minneapolis, Minnesota. Association for Computational Linguistics.

Linghao Jin, Jacqueline He, Jonathan May, and Xuezhe Ma. 2023. Challenges in context-aware neural machine translation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 conference on empirical methods in natural language processing, pages 15246–15263, Singapore. Association for Computational Linguistics.

Yunsu Kim, Duc Thanh Tran, and Hermann Ney. 2019. When and why is document-level context useful in neural machine translation? In Andrei Popescu-Belis, Sharid Loáiciga, Christian Hardmeier, and Deyi Xiong, editors, Proceedings of the fourth workshop on discourse in machine translation (DiscoMT 2019), pages 24–34, Hong Kong, China. Association for Computational Linguistics.

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. 2020. Attention is not only a weight: Analyzing transformers with vector norms. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 7057–7075, Online. Association for Computational Linguistics.

Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. 2020. Evaluating the factual consistency of abstractive text summarization. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.

Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86.

Pierre Lison, Jörg Tiedemann, and Milen Kouylekov. 2018. OpenSubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).

António Lopes, M. Amin Farajian, Rachel Bawden, Michael Zhang, and André F. T. Martins. 2020. Document-level neural MT: A systematic comparison. In André Martins, Helena Moniz, Sara Fumega, Bruno Martins, Fernando Batista, Luisa Coheur, Carla Parra, Isabel Trancoso, Marco Turchi, Arianna Bisazza, Joss Moorkens, Ana Guerberof, Mary Nurminen, Lena Marg, and Mikel L. Forcada, editors, Proceedings of the 22nd annual conference of the european association for machine translation, pages 225–234, Lisboa, Portugal. European Association for Machine Translation.

Andreas Madsen, Siva Reddy, and Sarath Chandar. 2022. Post-hoc interpretability for neural NLP: A survey. ACM Comput. Surv., 55(8).

Suvodeep Majumder, Stanislas Lauly, Maria Nadejde, Marcello Federico, and Georgiana Dinu. 2022. A baseline revisited: Pushing the limits of multi-segment models for context-aware translation. ArXiv, abs/2210.10906.

Sameen Maruf, Fahimeh Saleh, and Gholamreza Haffari. 2021. A survey on document-level neural machine translation: Methods and evaluation. ACM Comput. Surv., 54(2).

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th annual meeting of the association for computational linguistics, pages 1906–1919, Online. Association for Computational Linguistics.

R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th annual meeting of the association for computational linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.

Hosein Mohebbi, Willem Zuidema, Grzegorz Chrupała, and Afra Alishahi. 2023. Quantifying context mixing in transformers. In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th conference of the european chapter of the association for computational linguistics, pages 3378–3400, Dubrovnik, Croatia. Association for Computational Linguistics.

Gabriele Sarti, Grzegorz Chrupała, Malvina Nissim, and Arianna Bisazza. 2024a. Quantifying the plausibility of context reliance in neural machine translation. In The twelfth international conference on learning representations (ICLR 2024), Vienna, Austria. OpenReview.

Gabriele Sarti, Nils Feldhus, Jirui Qi, Malvina Nissim, and Arianna Bisazza. 2024b. Democratizing advanced attribution analyses of generative language models with the inseq toolkit. In xAI-2024 late-breaking work, demos and doctoral consortium joint proceedings, pages 289–296, Valletta, Malta. CEUR.org.

Gabriele Sarti, Nils Feldhus, Ludwig Sickert, Oskar van der Wal, Malvina Nissim, and Arianna Bisazza. 2023. Inseq: An interpretability toolkit for sequence generation models. In Danushka Bollegala, Ruihong Huang, and Alan Ritter, editors, Proceedings of the 61st annual meeting of the association for computational linguistics (volume 3: System demonstrations), pages 421–435, Toronto, Canada. Association for Computational Linguistics.

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Schärli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th international conference on machine learning, Honolulu, Hawaii, USA. JMLR.org.

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep inside convolutional networks: Visualising image classification models and saliency maps. In Yoshua Bengio and Yann LeCun, editors, 2nd international conference on learning representations, (ICLR), Banff, AB, Canada.

Gabriel Stanovsky, Noah A. Smith, and Luke Zettlemoyer. 2019. Evaluating gender bias in machine translation. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th annual meeting of the association for computational linguistics, pages 1679–1684, Florence, Italy. Association for Computational Linguistics.

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In Proceedings of the 34th international conference on machine learning (ICML), volume 70, pages 3319–3328, Sydney, Australia. Journal of Machine Learning Research (JMLR).

Joel Tang, Marina Fomicheva, and Lucia Specia. 2022. Reducing hallucinations in neural machine translation with feature attribution. ArXiv.

Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, and Angela Fan. 2021. Multilingual translation from denoising pre-training. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Findings of the association for computational linguistics: ACL-IJCNLP 2021, pages 3450–3466, Online. Association for Computational Linguistics.

Jörg Tiedemann and Yves Scherrer. 2017. Neural machine translation with extended context. In Bonnie Webber, Andrei Popescu-Belis, and Jörg Tiedemann, editors, Proceedings of the third workshop on discourse in machine translation, pages 82–92, Copenhagen, Denmark. Association for Computational Linguistics.

Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT – building open translation services for the world. In André Martins, Helena Moniz, Sara Fumega, Bruno Martins, Fernando Batista, Luisa Coheur, Carla Parra, Isabel Trancoso, Marco Turchi, Arianna Bisazza, Joss Moorkens, Ana Guerberof, Mary Nurminen, Lena Marg, and Mikel L. Forcada, editors, Proceedings of the 22nd annual conference of the european association for machine translation, pages 479–480, Lisboa, Portugal. European Association for Machine Translation.

Keyon Vafa, Yuntian Deng, David Blei, and Alexander Rush. 2021. Rationales for sequential predictions. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 conference on empirical methods in natural language processing, pages 10314–10332, Online; Punta Cana, Dominican Republic. Association for Computational Linguistics.

Jannis Vamvas and Rico Sennrich. 2021a. Contrastive conditioning for assessing disambiguation in MT: A case study of distilled bias. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 conference on empirical methods in natural language processing, pages 10246–10265, Online; Punta Cana, Dominican Republic. Association for Computational Linguistics.

Jannis Vamvas and Rico Sennrich. 2021b. On the limits of minimal pairs in contrastive evaluation. In Jasmijn Bastings, Yonatan Belinkov, Emmanuel Dupoux, Mario Giulianelli, Dieuwke Hupkes, Yuval Pinter, and Hassan Sajjad, editors, Proceedings of the fourth BlackboxNLP workshop on analyzing and interpreting neural networks for NLP, pages 58–68, Punta Cana, Dominican Republic. Association for Computational Linguistics.

Jannis Vamvas and Rico Sennrich. 2022. As little as possible, as much as necessary: Detecting over- and undertranslations with contrastive conditioning. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th annual meeting of the association for computational linguistics (volume 2: Short papers), pages 490–500, Dublin, Ireland. Association for Computational Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in neural information processing systems, volume 30. Curran Associates, Inc.

Elena Voita, Rico Sennrich, and Ivan Titov. 2019. Context-aware monolingual repair for neural machine translation. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 877–886, Hong Kong, China. Association for Computational Linguistics.

Elena Voita, Rico Sennrich, and Ivan Titov. 2021. Analyzing the source and target contributions to predictions in neural machine translation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers), pages 1126–1140, Online. Association for Computational Linguistics.

Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan Titov. 2018. Context-aware neural machine translation learns anaphora resolution. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1264–1274, Melbourne, Australia. Association for Computational Linguistics.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in neural information processing systems, volume 35, pages 24824–24837. Curran Associates, Inc.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, et al. 2020. Transformers: State-of-the-art natural language processing. In Qun Liu and David Schlangen, editors, Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations, pages 38–45, Online. Association for Computational Linguistics.

Kayo Yin, Patrick Fernandes, Danish Pruthi, Aditi Chaudhary, André F. T. Martins, and Graham Neubig. 2021. Do context-aware translation models pay the right attention? In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers), pages 788–801, Online. Association for Computational Linguistics.

Kayo Yin and Graham Neubig. 2022. Interpreting language models with contrastive explanations. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 conference on empirical methods in natural language processing, pages 184–198, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, 13th european conference on computer vision (ECCV), pages 818–833, Switzerland. Springer International Publishing.

Code: https://github.com/gsarti/pecore ↩︎
We avoid using the term faithfulness due to its ambiguous usage in interpretability research.↩︎
In the contextual MT example of Figure 4.2, $C$ includes source context $C_x$ and target context $C_y$.↩︎
We use $m^i$ to denote the result of $m(P_\text{ctx}^{i}, P_\text{no-ctx}^{i})$. Several metrics are presented in Section 4.4.2.↩︎
SCAT+ is available on the Hugging Face Hub: inseq/scat↩︎
Our modified version of DiscEval-MT is available on the Hugging Face Hub: inseq/disc_eval_mt.↩︎
Provided that $P_\text{no-ctx}(\hat y_i)$ does not depend on context, the $\nabla_\text{KL}$ gradient is functionally equivalent to the gradient for the cross-entropy function $H(P_\text{ctx}, P_\text{no-ctx}) = - \sum_{\hat{y}_i \in \mathcal{V}} P_\text{ctx}(\hat{y}_i) \log P_\text{no-ctx}(\hat{y}_i)$).↩︎
The interface is available at: https://huggingface.co/spaces/gsarti/pecore.↩︎
stabilityai/stablelm-2-zephyr-1_6b↩︎