3 Attributing Language Model Generations with the Inseq Toolkit

Chapter Summary

This first experimental chapter presents the Inseq interpretability toolkit, which is employed for multiple analyses throughout this thesis. Inseq is a Python library that democratizes access to interpretability analyses of language models by enabling intuitive extraction of models’ internal information and saliency scores throughout the generation process. After introducing Inseq design and features, we demonstrate its capabilities through applications that highlight gender biases in machine translation models and factual knowledge localization within the GPT-2 language model. Thanks to its extensible interface, which supports cutting-edge techniques, Inseq can drive future advances in explainable natural language generation, centralizing best practices and enabling reproducible model evaluations.

This chapter is adapted from the papers Inseq: An Interpretability Toolkit for Sequence Generation Models (Sarti et al., 2023) and Democratizing Advanced Attribution Analyses of Generative Language Models with the Inseq Toolkit (Sarti et al., 2024).

As in manufacture so in science, retooling is an extravagance to be reserved for the occasion that demands it. The significance of crises is the indication they provide that an occasion for retooling has arrived.

– Thomas S. Kuhn, The Structure of Scientific Revolutions (1970)

Recent years saw an increase in studies and tools aimed at improving our behavioral or mechanistic understanding of neural language models (Belinkov and Glass, 2019).

Many studies applied such techniques to modern deep learning architectures, including transformers (Vaswani et al., 2017), leveraging gradients (Baehrens et al., 2010; Sundararajan et al., 2017), attention patterns (Xu et al., 2015; Clark et al., 2019) and input perturbations (Zeiler and Fergus, 2014; Feng et al., 2018) to quantify input importance, often leading to controversial outcomes in terms of faithfulness, plausibility and overall usefulness of such explanations (Adebayo et al., 2018; Jain and Wallace, 2019; Jacovi and Goldberg, 2020; Zafar et al., 2021).

However, input attribution techniques have mainly been applied to classification settings (Atanasova et al., 2020; Wallace et al., 2020; Madsen et al., 2022; Chrysostomou and Aletras, 2022), with relatively little interest in the more convoluted mechanisms underlying generation. Classification attribution is a single-step process resulting in one importance score per input token, often allowing for intuitive interpretations in relation to the predicted class. Sequential attribution¹ instead involves a computationally expensive multi-step iteration producing a matrix \(A_{ij}\) representing the importance of every input \(i\) in the prediction of every generation outcome \(j\) (Figure 3.1).

Figure 3.1: Example of Inseq usage with a 🤗 `transformers` causal language model. Given a prompt, attribution scores and next-step probabilities are extracted from the model at every generation step, with a final visualization aggregating values at the token level. Output attribution scores indicate that the model relies on the keyword “innovate” to initiate the idiomatic expression “think outside the box” with relatively low confidence (\(p = 0.5\)). However, importance shifts to previous tokens in the idiom and confidence progressively grows throughout the generation.

Moreover, since previous generation steps causally influence following predictions, they must be dynamically incorporated into the set of attributed inputs throughout the process. Lastly, while classification typically involves a limited set of classes and simple output selection (e.g., argmax after softmax), generation often operates with large vocabularies and non-trivial decoding strategies (Eikema and Aziz, 2020). These differences limited the use of input attribution methods for generation settings, with relatively few works improving attribution efficiency (Vafa et al., 2021; Ferrando et al., 2022) and the informativeness of explanations (Yin and Neubig, 2022).

Having established a theoretical background on input attribution methods in Section 2.2, we introduce Inseq, a Python library that democratizes access to interpretability analyses of generative language models. Inseq centralizes access to a broad set of input attribution methods, sourced in part from the Captum (Kokhlikyan et al., 2020) framework, enabling a fair comparison of different techniques for all sequence-to-sequence and decoder-only models in the popular 🤗 transformers library (Wolf et al., 2020). Thanks to its intuitive interface, users can easily integrate interpretability analyses into sequence generation experiments with just 3 lines of code (Figure 3.2). Nevertheless, Inseq is also highly flexible, including cutting-edge attribution methods with built-in post-processing features (Section 3.2.2), supporting customizable attribution targets and enabling constrained decoding of arbitrary sequences (Section 3.2.3).

Figure 3.2: Computing and visualizing attributions for Flan-T5 (Chung et al., 2024).

In terms of usability, Inseq greatly simplifies access to local and global explanations, offering built-in support for a command-line interface (CLI), optimized batching that enables dataset-wide attribution, and various methods for visualizing, serializing, and reloading attribution outcomes and generated sequences (Section 3.2.4). Ultimately, Inseq aims to make sequence models first-class citizens in interpretability research and drive future advances in interpretability for generative applications.

3.2 Design

Inseq combines sequence models sourced from 🤗 transformers (Wolf et al., 2020) and attribution methods mainly sourced from Captum (Kokhlikyan et al., 2020). While only text-based tasks are currently supported, the library’s modular design would enable the inclusion of other modeling frameworks, e.g. fairseq (Ott et al., 2019), and modalities (e.g. speech) without requiring substantial redesign. Optional dependencies include 🤗 datasets (Lhoest et al., 2021) and Rich.² Figure 3.3 presents the Inseq hierarchy of models and attribution methods. The model-method connection enables out-of-the-box attribution using the selected method. Framework-specific and architecture-specific classes enable the extension of Inseq to new modeling architectures and frameworks.

Figure 3.3: Inseq models and attribution methods. Concrete classes combine abstract framework and architecture attribution models classes, and are derived from abstract attribution methods’ categories.

3.2.1 Guiding Principles

Research and Generation-oriented: Inseq should support interpretability analyses of a broad set of sequence generation models without focusing narrowly on specific architectures or tasks. Moreover, the inclusion of new, cutting-edge methods should be prioritized to enable fair comparisons with well-established ones.
Scalable: The library should provide an optimized interface to a wide range of use cases, models and setups, ranging from interactive attributions of individual examples using toy models to compiling statistics of large language models’ predictions for entire datasets.
Beginner-friendly: Inseq should provide built-in access to popular frameworks for sequence generation modeling and be fully usable by non-experts at a high level of abstraction, providing sensible defaults for supported attribution methods.
Extensible: Inseq should support a high degree of customization for experienced users, with out-of-the-box support for user-defined solutions to enable future investigations into models’ behaviors.

3.2.2 Input Attribution and Post-processing

	Method	Source	\(f(l)\)
G	(Input ×) Gradient	Simonyan et al. (2014)	✅
	DeepLIFT	Shrikumar et al. (2017)	✅
	GradientSHAP	Lundberg and Lee (2017)	❌
	Integrated Gradients	Sundararajan et al. (2017)	✅
	Discretized IG	Sanyal and Ren (2021)	❌
	Sequential IG	Enguehard (2023)	❌
I	Attention Weights	Bahdanau et al. (2015)	✅
P	Occlusion (Blank-out)	Zeiler and Fergus (2014)	❌
	LIME	Ribeiro et al. (2016)	❌
	Value Zeroing	Mohebbi et al. (2023)	✅
	ReAGent	Zhao and Shan (2024)	❌
S	(Log) Probability	-
	Softmax Entropy	-
	Target Cross-entropy	-
	Perplexity	-
	KL Divergence	-
	Contrastive Logits/Prob. \(\Delta\)	Yin and Neubig (2022)
	\(\mu\) MC Dropout Prob.	Gal and Ghahramani (2016)
	PCXMI	Fernandes et al. (2023)
	In-context PVI	Lu et al. (2023)

Table 3.1: Overview of gradient-based (G), internals-based (I) and perturbation-based (P) attribution methods and built-in step functions (S) available in Inseq. \(f(l)\) marks methods allowing for attribution of arbitrary intermediate layers. Bolded methods were introduced with Inseq v0.6.

At its core, Inseq provides a simple interface for applying input attribution techniques to sequence generation tasks. We categorize methods in three groups, gradient-based, internals-based and perturbation-based, depending on their underlying approach to importance quantification.³ Table 3.1 presents the complete list of supported methods. Aside from popular model-agnostic methods, Inseq notably provides built-in support for attention weight attribution and a range of cutting-edge methods not supported in any other toolkit, such as Discretized Integrated Gradients (Sanyal and Ren, 2021), Sequential Integrated Gradients (Enguehard, 2023), Value Zeroing (Mohebbi et al., 2023), and ReAGent (Zhao and Shan, 2024). Moreover, multiple methods support the importance attribution of custom intermediate model layers, simplifying studies on representational structures and information mixing in sequential models, as seen in our case study of Section 3.3.2.

Source and target-side attribution When using encoder-decoder architectures, users can set the attribute_target parameter to include or exclude the generated prefix in the attributed inputs. In most cases, this should be desirable to account for recently generated tokens when explaining model behaviors, such as when to terminate the generation (e.g. relying on the presence of _yes in the target prefix to predict </s> in Figure 3.2, right matrix). However, attributing the source side separately could be helpful, for example, to derive word alignments from importance scores.

Post-processing of attribution outputs Aggregation is a fundamental but often overlooked step in attribution-based analyses since most methods produce neuron-level or subword-level importance scores that would otherwise be difficult to interpret. Inseq includes several Aggregator classes to perform attribution aggregation across various dimensions. For example, the input word Explanation could be tokenized into two subword tokens Expl and anation, and each token would receive \(N\) importance scores, where \(N\) is the model embedding dimension. In this case, aggregators could first merge subword-level scores into word-level scores, and then merge granular embedding-level scores to obtain a single token-level score that is easier to interpret. Moreover, aggregation could prove especially helpful for long-form generation tasks such as summarization, where word-level importance scores could be aggregated to obtain a measure of sentence-level relevance. Notably, Inseq allows chaining multiple aggregators like in the example above using the AggregatorPipeline class, and provides a PairAggregator to aggregate different attribution maps, simplifying the conduction of contrastive analyses as in Section 3.3.1.⁴

3.2.3 Customizing generation and attribution

During attribution, Inseq first generates target tokens using 🤗 transformers and then attributes them step-by-step. If a custom target string is specified alongside model inputs, the generation step is instead skipped, and the provided text is attributed by constraining the decoding of its tokens.⁵ Constrained attribution can be used, among other things, for contrastive comparisons of minimal pairs and to obtain model justifications for desired outputs.

Custom step functions At every attribution step, Inseq can extract scores of interest (e.g. probabilities, entropy) that can be useful, among other things, to quantify model uncertainty (e.g. how likely the generated _yes token was given the context in Figure 3.2). We collectively refer to functions computing these scores as step functions. Inseq provides access to multiple built-in step functions (Table 3.1, S), enabling the computation of these scores, and allows users to create and register new custom ones. Step scores are computed together with the attribution, returned as separate sequences in the output, and visualized alongside importance scores (e.g. the \(p(y_t|y_{<t})\) row in Figure 3.1).

Step functions as attribution targets For methods relying on model outputs to predict input importance (gradient and perturbation-based), input attributions are commonly obtained from the model’s output logits or class probabilities (Bastings et al., 2022). However, recent work has shown the effectiveness of using targets, such as the probability difference of a contrastive output pair, to answer interesting questions like “What inputs drive the prediction of \(y\) rather than \(\hat{y}\)?” (Yin and Neubig, 2022). For example, the gradient \(\nabla(p(\text{barking}) - p(\text{crying}))\) given the prompt *“Can you stop the dog from ___“* will highlight the role of the entity dog in selecting barking, disentangling the semantic component from grammatical correctness by providing a crying as grammatically valid choice. Figure 3.4 provides an example of such an approach for gender bias detection in machine translation. Inseq users can leverage any built-in or custom-defined step function as an attribution target, enabling advanced use cases like contrastive comparisons.

Figure 3.4: Source-to-target attributions aggregated at the token level, indicating the importance of the stereotypical noun “manager” to generate the Italian masculine pronoun “il” (original) over the feminine “la” (contrastive case).

3.2.4 Usability Features

Batched and span-focused attributions The library provides built-in batching capabilities, enabling users to go beyond single sentences and attribute even entire datasets in a single function call. When the attribution of a specific span of interest is needed, Inseq also allows specifying a start and end position for the attribution process. This functionality greatly accelerates the attribution process for studies on localized phenomena (e.g. pronoun coreference in MT models).

Alignment of contrastive options Inseq supports customizable word alignments, i.e. indices aligning tokens in the original and contrastive generated texts, to support contrastive comparisons between texts of different lengths, including automatic alignments using the multilingual LaBSE encoder (Feng et al., 2022) to streamline their application.

CLI, serialization and visualization The Inseq library offers an API to attribute single examples or entire 🤗 Datasets from the command line and save resulting outputs and visualizations to a file. Attribution outputs can be saved and loaded in JSON format, along with their respective metadata, to easily identify the provenance of the contents. Attributions can be visualized in the console or IPython notebooks and exported as HTML files.

Quantized and distributed attribution Supporting the attribution of large models is critical given recent scaling tendencies (Kaplan et al., 2020). All models that allow for quantization using bitsandbytes (Dettmers et al., 2022) can be loaded directly in 4-bit and 8-bit formats from 🤗 transformers, and their attributions can be computed normally using Inseq at a fraction of the original computational cost.⁶ Relatedly, Inseq is also compatible with the Petals framework (Borzunov et al., 2023), which supports gradient-based attribution across language models whose computation is distributed across multiple machines. This can alleviate the need for high-end GPUs to run LLMs, enabling the distributed computation of attribution scores.⁷

3.3 Case Studies

3.3.1 Gender Bias in Machine Translation

In the first case study, we use Inseq to investigate gender bias in MT models. Studying the social biases embedded in these models is crucial to understanding and mitigating the representational and allocative harms they may engender (Blodgett et al., 2020). Savoldi et al. (2021) note that the study of bias in MT could benefit from explainability techniques to identify spurious cues exploited by the model and the interaction of different features that can lead to intersectional bias.

Synthetic Setup: Turkish to English The Turkish language uses the gender-neutral pronoun o, which can be translated into English as either he, she, or it, making it interesting to study gender bias in MT when associated with a language such as English, for which models will tend to choose a gendered pronoun form. Previous works have leveraged translations from gender-neutral languages to demonstrate the presence of gender bias in translation systems (Cho et al., 2019; Prates et al., 2020; Farkas and Németh, 2022). We repeat this simple setup using a Turkish-to-English MarianMT model (Tiedemann, 2020) and compute different metrics to quantify gender bias using Inseq.

	Base		♀ \(\rightarrow\) ♂
	\(x_\text{pron}\)	\(x_\text{occ}\)	\(x_\text{pron}\)	\(x_\text{occ}\)
\(p(y_\text{pron})\)	0.01		-0.44*
\(\nabla\)	-0.16	0.25*	0.23*	-0.00
IG	-0.08	0.09	0.11	0.17
I×G	-0.11	0.22*	0.22*	-0.01

Table 3.2: Gender Bias in Turkish-to-English MT: Kendall’s \(\tau\) correlation of MT model metrics with U.S. labor statistics. * = Significant correlation (\(p<.05\)).

We select 49 Turkish occupation terms verified by a native speaker (see Section A.1.1) and use them to infill the template sentence O bir ____ (He/She is a(n) ____). For each translation, we compute attribution scores for source Turkish pronoun (\(x_\text{pron}\)) and occupation (\(x_\text{occ}\)) tokens⁸ when generating the target English pronoun (\(y_\text{pron}\)) using Integrated Gradients (IG), Gradients (\(\nabla\)), and Input \(\times\) Gradient (I\(\times\)G).⁹ We also collect target pronoun probabilities (\(p(y_\text{pron})\)), rank the 49 occupation terms using these metrics, and finally compute Kendall’s \(\tau\) correlation with the percentage of women working in the respective fields, using U.S. labor statistics as in previous works (e.g., Caliskan et al., 2017; Rudinger et al., 2018). Table 3.2 presents our results.

In the base case, we correlate the different metrics with how much the gender distribution deviates from an equal distribution (\(50-50\%\)) for each occupation (i.e., the gender bias irrespective of the direction). We observe a strong gender bias, with she being chosen only for 5 out of 49 translations and gender-neutral variants never being produced by the MT model. We find a low correlation between pronoun probability and the degree of gender stereotype associated with the occupation. Moreover, we note a weaker correlation for IG compared to the other two methods. For those, attribution scores for \(x_\text{occ}\) show significant correlations with labor statistics, supporting the intuition that the MT model will accord higher importance to source occupation terms associated to gender-stereotypical occupations when predicting the gendered target pronoun.

In the gender-swap case (♀️ \(\rightarrow\) ♂️), we use the PairAggregator class to contrastively compare attribution scores and probabilities when translating the pronoun as She or He.¹⁰ We correlate the resulting scores with the percentage of women working in the respective occupation and find strong correlations for \(p(y_\text{pron})\), which supports the validity of contrastive approaches in uncovering gender bias.

Qualitative Example: English to Dutch We also qualitatively analyze biased MT outputs, showing how attributions can help develop hypotheses about models’ behavior. Table 3.3 (top) shows the I \(\times\) G attributions for English-to-Dutch translation using M2M-100 (Fan et al., 2021).

Source	De	leraar	verliest	zijn	baan
The	0.10	0.08	0.04	0.03	0.02
teacher	0.11	0.20	0.06	0.03	0.05
loses	0.11	0.09	0.25	0.07	0.07
her	0.15	0.09	0.10	0.21	0.07
job	0.10	0.08	0.08	0.10	0.24
Target	De	leraar	verliest	zijn	baan
De		0.23	0.05	0.06	0.04
leraar			0.17	0.13	0.03
verliest				0.18	0.08
zijn					0.26
\(p(y_t)\)	0.69	0.28	0.35	0.65	0.29

Source	De	♂ → Ø	verliest	haar	baan
The	0.00	-0.02	0.00	0.00	0.00
teacher	0.00	-0.05	-0.01	-0.01	-0.01
loses	0.00	-0.02	-0.01	-0.02	-0.01
her	0.00	-0.01	-0.01	-0.10	0.01
job	0.00	-0.02	-0.01	-0.02	-0.02
Target	De	♂ → Ø	verliest	haar	baan
De		-0.07	-0.01	0.01	-0.01
♂ → Ø			0.09	0.18	0.02
verliest				-0.03	0.00
haar					0.00
\(\Delta p(y_t)\)	0.00	-0.23	0.13	0.20	0.00

Table 3.3: Top: Attribution of pronoun gender mistranslation using M2M-100. Bottom: Target attribution difference when swapping the target noun gender (♂️ \(\to\) Ø) from leraar (male) to leerkracht (gender-neutral).

The model mistranslates the pronoun her into the masculine form zijn (his). We find that the wrongly translated pronoun exhibits high probability but does not associate substantial importance to the source occupation term teacher. Instead, we find good relative importance for the preceding word and leraar (male teacher). This suggests a strong prior bias for masculine variants, as shown by the pronoun zijn and the noun leraar, which may be a possible cause for this mistranslation. When considering the contrastive example obtained by swapping leraar with its gender-neutral variant leerkracht (Table 3.3, bottom), we find increased importance of the target occupation in determining the correctly-gendered target pronoun haar (her). Our results highlight the tendency of MT models to attend inputs sequentially rather than relying on context, hinting at the known benefits of context-aware models for pronoun translation (Voita et al., 2018).

3.3.2 Locating Factual Knowledge inside GPT-2

For our second case study, we experiment with a novel attribution-based technique to locate factual knowledge encoded in the layers of GPT-2 1.5B (Radford et al., 2019). Specifically, we aim to reproduce the results of Meng et al. (2022), showing the influence of intermediate layers in mediating the recall of factual statements such as The Eiffel Tower is located in the city of \(\rightarrow\) Paris. Meng et al. (2022) estimated the effect of network components in the prediction of factual statements as the difference in probability of a correct target (e.g. Paris), given a corrupted subject embedding (e.g. for Eiffel Tower), before and after restoring clean activations for some input tokens at different layers of the network. Apart from the obvious importance of final token states in terminal layers, their results highlight the presence of an early site associated with the last subject token playing an important role in recalling the network’s factual knowledge (Figure 3.5, top).

Figure 3.5: **Top:** Estimated causal importance of GPT-2 XL layers for predicting factual associations, as reported by Meng et al. (2022). **Bottom:** Average GPT-2 XL Gradient \(\times\) Layer Activation scores obtained with Inseq using contrastive factual pairs as attribution targets.

To verify such results, we propose a novel knowledge location method, which we name Contrastive Attribution Tracing (CAT), adopting the contrastive attribution paradigm of Yin and Neubig (2022) to locate relevant network components by attributing minimal pairs of correct and wrong factual targets (e.g. Paris vs. Rome for the example above). To perform contrastive attribution, we use the Layer Gradient \(\times\) Activation method, a layer-specific variant of Input \(\times\) Gradient, to propagate gradients up to intermediate network activations rather than reaching input tokens. The resulting attribution scores hence answer the question “How important are layer \(L\) activations for prefix token \(t\) in predicting the correct factual target over a wrong one?”. We compute attribution scores for 1000 statements taken from the Counterfact Statement dataset (Meng et al., 2022) and present averaged results in Figure 3.5 (bottom).¹¹ Our results closely align with those of the original authors, providing additional evidence that attribution methods can be used to identify salient network components and guide model editing, as demonstrated by Dai et al. (2022).

We introduced the proposed CAT method shortly before the attribution patching technique by Nanda (2023). Together, these two methods represent the most efficient knowledge location techniques based on gradient propagation, with our approach requiring only a single forward and backward pass of the attributed model. Patching-based approaches, such as causal mediation (Meng et al., 2022), on the other hand, provide causal guarantees of feature importance at the price of being more computationally intensive. Despite lacking the causal guarantees of such methods, CAT can provide an approximation of feature importance and greatly simplify the study of knowledge encoded in large language model representations, thanks to its efficiency.

3.4 Conclusion

We introduced Inseq, a versatile and easy-to-use toolkit for interpreting sequence generation models. With many libraries focused on the study of classification models, Inseq is the first tool explicitly designed to analyze systems for tasks such as machine translation, code generation, and conversational applications. Researchers can easily add interpretability evaluations to their studies using our library to identify unwanted biases and interesting phenomena in their models’ predictions.

With the Inseq toolkit providing the foundational infrastructure for interpretability analysis, the following chapters will leverage the supported input attribution techniques to investigate context usage in context-aware machine translation systems Chapter 4 and multilingual language models for retrieval-augmented generation Chapter 5.

Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. 2018. Sanity checks for saliency maps. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in neural information processing systems, volume 31, pages 9505–9515, Montréal, Canada. Curran Associates, Inc.

J Alammar. 2021. Ecco: An open source library for the explainability of transformer language models. In Heng Ji, Jong C. Park, and Rui Xia, editors, Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing: System demonstrations, pages 249–257, Online. Association for Computational Linguistics.

Pepa Atanasova, Jakob Grue Simonsen, Christina Lioma, and Isabelle Augenstein. 2020. A diagnostic study of explainability techniques for text classification. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 3256–3274, Online. Association for Computational Linguistics.

Giuseppe Attanasio, Eliana Pastor, Chiara Di Bonaventura, and Debora Nozza. 2023. Ferret: A framework for benchmarking explainers on transformers. In Danilo Croce and Luca Soldaini, editors, Proceedings of the 17th conference of the european chapter of the association for computational linguistics: System demonstrations, pages 256–266, Dubrovnik, Croatia. Association for Computational Linguistics.

David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. 2010. How to explain individual classification decisions. J. Mach. Learn. Res., 11:1803–1831.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann LeCun, editors, Proceedings of the 3rd international conference on learning representations (ICLR), San Diego, CA, USA.

Jasmijn Bastings, Sebastian Ebert, Polina Zablotskaia, Anders Sandholm, and Katja Filippova. 2022. “Will you find these shortcuts?” A protocol for evaluating the faithfulness of input salience methods for text classification. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 conference on empirical methods in natural language processing, pages 976–991, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Yonatan Belinkov and James Glass. 2019. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72.

Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (technology) is power: A critical survey of “bias” in NLP. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th annual meeting of the association for computational linguistics, pages 5454–5476, Online. Association for Computational Linguistics.

Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Maksim Riabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, and Colin Raffel. 2023. Petals: Collaborative inference and fine-tuning of large models. In Danushka Bollegala, Ruihong Huang, and Alan Ritter, editors, Proceedings of the 61st annual meeting of the association for computational linguistics (volume 3: System demonstrations), pages 558–568, Toronto, Canada. Association for Computational Linguistics.

Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186.

Won Ik Cho, Ji Won Kim, Seok Min Kim, and Nam Soo Kim. 2019. On measuring gender bias in translation of gender-neutral pronouns. In Marta R. Costa-jussà, Christian Hardmeier, Will Radford, and Kellie Webster, editors, Proceedings of the first workshop on gender bias in natural language processing, pages 173–181, Florence, Italy. Association for Computational Linguistics.

George Chrysostomou and Nikolaos Aletras. 2022. An empirical study on explanations in out-of-domain settings. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 6920–6938, Dublin, Ireland. Association for Computational Linguistics.

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, et al. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53.

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? An analysis of BERT‘s attention. In Tal Linzen, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes, editors, Proceedings of the 2019 ACL workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pages 276–286, Florence, Italy. Association for Computational Linguistics.

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. Knowledge neurons in pretrained transformers. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 8493–8502, Dublin, Ireland. Association for Computational Linguistics.

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. GPT3.int8(): 8-bit matrix multiplication for transformers at scale. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in neural information processing systems, volume 35, pages 30318–30332. Curran Associates, Inc.

Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C. Wallace. 2020. ERASER: A benchmark to evaluate rationalized NLP models. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th annual meeting of the association for computational linguistics, pages 4443–4458, Online. Association for Computational Linguistics.

Bryan Eikema and Wilker Aziz. 2020. Is MAP decoding all you need? The inadequacy of the mode in neural machine translation. In Donia Scott, Nuria Bel, and Chengqing Zong, editors, Proceedings of the 28th international conference on computational linguistics, pages 4506–4520, Barcelona, Spain (Online). International Committee on Computational Linguistics.

Joseph Enguehard. 2023. Sequential integrated gradients: A simple but effective method for explaining language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the association for computational linguistics: ACL 2023, pages 7555–7565, Toronto, Canada. Association for Computational Linguistics.

Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Çelebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, and Armand Joulin. 2021. Beyond english-centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48.

Anna Farkas and Renáta Németh. 2022. How to measure gender bias in machine translation: Real-world oriented machine translators, multiple reference points. Social Sciences & Humanities Open, 5(1):100239.

Nils Feldhus, Robert Schwarzenberg, and Sebastian Möller. 2021. Thermostat: A large collection of NLP model explanations and analysis tools. In Heike Adel and Shuming Shi, editors, Proceedings of the 2021 conference on empirical methods in natural language processing: System demonstrations, pages 87–95, Online; Punta Cana, Dominican Republic. Association for Computational Linguistics.

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. Language-agnostic BERT sentence embedding. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 878–891, Dublin, Ireland. Association for Computational Linguistics.

Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, and Jordan Boyd-Graber. 2018. Pathologies of neural models make interpretations difficult. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 conference on empirical methods in natural language processing, pages 3719–3728, Brussels, Belgium. Association for Computational Linguistics.

Patrick Fernandes, Kayo Yin, Emmy Liu, André Martins, and Graham Neubig. 2023. When does translation require context? A data-driven, multilingual exploration. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pages 606–626, Toronto, Canada. Association for Computational Linguistics.

Javier Ferrando, Gerard I. Gállego, Belen Alastruey, Carlos Escolano, and Marta R. Costa-jussà. 2022. Towards opening the black box of neural machine translation: Source and target interpretations of the transformer. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 conference on empirical methods in natural language processing, pages 8756–8769, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33rd international conference on machine learning, volume 48, pages 1050–1059, New York, NY, USA. Proceedings of Machine Learning Research (PLMR).

Alon Jacovi and Yoav Goldberg. 2020. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th annual meeting of the association for computational linguistics, pages 4198–4205, Online. Association for Computational Linguistics.

Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), pages 3543–3556, Minneapolis, Minnesota. Association for Computational Linguistics.

Jared Kaplan, Sam McCandlish, T. J. Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeff Wu, and Dario Amodei. 2020. Scaling laws for neural language models. ArXiv.

Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, and Orion Reblitz-Richardson. 2020. Captum: A unified and generic model interpretability library for PyTorch. ArXiv.

Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Šaško, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, et al. 2021. Datasets: A community library for natural language processing. In Heike Adel and Shuming Shi, editors, Proceedings of the 2021 conference on empirical methods in natural language processing: System demonstrations, pages 175–184, Online; Punta Cana, Dominican Republic. Association for Computational Linguistics.

Sheng Lu, Shan Chen, Yingya Li, Danielle Bitterman, Guergana Savova, and Iryna Gurevych. 2023. Measuring pointwise \(\mathcal{V}\)-usable information in-context-ly. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the association for computational linguistics: EMNLP 2023, pages 15739–15756, Singapore. Association for Computational Linguistics.

Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Proceedings of the 31st international conference on neural information processing systems, volume 30, pages 4768–4777, Long Beach, California, USA. Curran Associates Inc.

Andreas Madsen, Nicholas Meade, Vaibhav Adlakha, and Siva Reddy. 2022. Evaluating the faithfulness of importance measures in NLP by recursively masking allegedly important tokens and retraining. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the association for computational linguistics: EMNLP 2022, pages 1731–1751, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in neural information processing systems, volume 35, pages 17359–17372. Curran Associates, Inc.

Hosein Mohebbi, Willem Zuidema, Grzegorz Chrupała, and Afra Alishahi. 2023. Quantifying context mixing in transformers. In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th conference of the european chapter of the association for computational linguistics, pages 3378–3400, Dubrovnik, Croatia. Association for Computational Linguistics.

Neel Nanda. 2023. Attribution patching: Activation patching at industrial scale.

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. Fairseq: A fast, extensible toolkit for sequence modeling. In Waleed Ammar, Annie Louis, and Nasrin Mostafazadeh, editors, Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics (demonstrations), pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics.

Charles Pierse. 2021. Transformers interpret.

Marcelo OR Prates, Pedro H Avelar, and Luís C Lamb. 2020. Assessing gender bias in machine translation: A case study with Google Translate. Neural Computing and Applications, 32:6363–6381.

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog.

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why should i trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, New York, NY, USA. Association for Computing Machinery.

Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. 2018. Gender bias in coreference resolution. In Marilyn Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 2 (short papers), pages 8–14, New Orleans, Louisiana. Association for Computational Linguistics.

Soumya Sanyal and Xiang Ren. 2021. Discretized integrated gradients for explaining language models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 conference on empirical methods in natural language processing, pages 10285–10299, Online; Punta Cana, Dominican Republic. Association for Computational Linguistics.

Gabriele Sarti, Nils Feldhus, Jirui Qi, Malvina Nissim, and Arianna Bisazza. 2024. Democratizing advanced attribution analyses of generative language models with the inseq toolkit. In xAI-2024 late-breaking work, demos and doctoral consortium joint proceedings, pages 289–296, Valletta, Malta. CEUR.org.

Gabriele Sarti, Nils Feldhus, Ludwig Sickert, Oskar van der Wal, Malvina Nissim, and Arianna Bisazza. 2023. Inseq: An interpretability toolkit for sequence generation models. In Danushka Bollegala, Ruihong Huang, and Alan Ritter, editors, Proceedings of the 61st annual meeting of the association for computational linguistics (volume 3: System demonstrations), pages 421–435, Toronto, Canada. Association for Computational Linguistics.

Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, Matteo Negri, and Marco Turchi. 2021. Gender bias in machine translation. Transactions of the Association for Computational Linguistics, 9:845–874.

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning important features through propagating activation differences. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th international conference on machine learning, volume 70, pages 3145–3153. PMLR.

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep inside convolutional networks: Visualising image classification models and saliency maps. In Yoshua Bengio and Yann LeCun, editors, 2nd international conference on learning representations, (ICLR), Banff, AB, Canada.

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In Proceedings of the 34th international conference on machine learning (ICML), volume 70, pages 3319–3328, Sydney, Australia. Journal of Machine Learning Research (JMLR).

Ian Tenney, James Wexler, Jasmijn Bastings, Tolga Bolukbasi, Andy Coenen, Sebastian Gehrmann, Ellen Jiang, Mahima Pushkarna, Carey Radebaugh, Emily Reif, and Ann Yuan. 2020. The language interpretability tool: Extensible, interactive visualizations and analysis for NLP models. In Qun Liu and David Schlangen, editors, Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations, pages 107–118, Online. Association for Computational Linguistics.

Jörg Tiedemann. 2020. The tatoeba translation challenge – realistic data sets for low resource and multilingual MT. In Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Yvette Graham, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, et al., editors, Proceedings of the fifth conference on machine translation, pages 1174–1182, Online. Association for Computational Linguistics.

Keyon Vafa, Yuntian Deng, David Blei, and Alexander Rush. 2021. Rationales for sequential predictions. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 conference on empirical methods in natural language processing, pages 10314–10332, Online; Punta Cana, Dominican Republic. Association for Computational Linguistics.

Jannis Vamvas and Rico Sennrich. 2021. On the limits of minimal pairs in contrastive evaluation. In Jasmijn Bastings, Yonatan Belinkov, Emmanuel Dupoux, Mario Giulianelli, Dieuwke Hupkes, Yuval Pinter, and Hassan Sajjad, editors, Proceedings of the fourth BlackboxNLP workshop on analyzing and interpreting neural networks for NLP, pages 58–68, Punta Cana, Dominican Republic. Association for Computational Linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in neural information processing systems, volume 30. Curran Associates, Inc.

Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan Titov. 2018. Context-aware neural machine translation learns anaphora resolution. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1264–1274, Melbourne, Australia. Association for Computational Linguistics.

Eric Wallace, Matt Gardner, and Sameer Singh. 2020. Interpreting predictions of NLP models. In Aline Villavicencio and Benjamin Van Durme, editors, Proceedings of the 2020 conference on empirical methods in natural language processing: Tutorial abstracts, pages 20–23, Online. Association for Computational Linguistics.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, et al. 2020. Transformers: State-of-the-art natural language processing. In Qun Liu and David Schlangen, editors, Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations, pages 38–45, Online. Association for Computational Linguistics.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Francis Bach and David Blei, editors, Proceedings of the 32nd international conference on machine learning, volume 37, pages 2048–2057, Lille, France. PMLR.

Kayo Yin and Graham Neubig. 2022. Interpreting language models with contrastive explanations. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 conference on empirical methods in natural language processing, pages 184–198, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Muhammad Bilal Zafar, Michele Donini, Dylan Slack, Cedric Archambeau, Sanjiv Das, and Krishnaram Kenthapadi. 2021. On the lack of robust interpretability of neural text classifiers. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Findings of the association for computational linguistics: ACL-IJCNLP 2021, pages 3730–3740, Online. Association for Computational Linguistics.

Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, 13th european conference on computer vision (ECCV), pages 818–833, Switzerland. Springer International Publishing.

Zhixue Zhao and Boxuan Shan. 2024. ReAGent: A model-agnostic feature attribution method for generative language models. AAAI Workshop on Responsible Language Models (ReLM).

We use sequence generation to refer to all iterative tasks, including (but not limited to) natural language generation.↩︎
https://github.com/Textualize/rich ↩︎
We distinguish between gradient- and internals-based methods to account for their difference in scores’ granularity.↩︎
See Section A.1.2 for an example.↩︎
Users employing constrained decoding should be aware of its limitations in the presence of a high distributional discrepancy with natural model outputs (Vamvas and Sennrich, 2021).↩︎
bitsandbytes 0.37.0 required for backward method, see Section A.1.3 for an example.↩︎
Tutorial: https://inseq.org/en/latest/examples/petals.html ↩︎
For multi-token occupation terms, e.g., bilim insanı (scientist), the first token score was used.↩︎
We set \(\Delta < 0.05\) for IG to ensure convergence. Token-level aggregation is performed using the L2 norm.↩︎
An example is provided in Section A.1.2.↩︎
Figure A.3 of Section A.1.3 presents some examples.↩︎

3.1 Related Work