From Insights to Impact

Actionable Interpretability for Neural Machine Translation

Ph.D. Thesis, Center for Language and Cognition (CLCG), University of Groningen

Author

Gabriele Sarti

Published

Dec 11th, 2025

Abstract

Neural language models have revolutionized the field of natural language processing, quickly becoming essential tools for a wide range of practical applications. Recent advances in interpretability research offered valuable insights into the inner workings of these systems, but often failed to translate into downstream improvements for users in real-world settings. This dissertation investigates the end-to-end development of interpretability methods to improve the trustworthiness and controllability of neural machine translation systems, from conception to experimentation with end users. Its findings address fundamental questions about how language models leverage contextual information, how their generation processes can be steered for personalization, and how interpretability insights can enhance professional translation practices.

The thesis work is organized into three interconnected parts. Part I develops foundational tools and methods for understanding how language models use contextual information during generation. We begin by introducing Inseq, an open-source toolkit for interactive analysis of language model behavior, showcasing its use for gender bias detection in machine translation and activation attribution using gradient-based methods. We then design PECoRe, a framework using contrastive input attribution to quantify how language models exploit contextual information, and demonstrate its effectiveness in detecting context influence in context-aware machine translation systems. Finally, we extend PECoRe to retrieval-augmented generation, using model internals to produce faithful, efficient and high-quality citations for open-book question answering.

Part II shifts the focus of our investigation from analysis to intervention, exploring methods for controlling translation outputs through prompting-based and steering-based approaches. We first present Ramp, a retrieval-augmented prompting technique exploiting relevant examples and style labels for attribute-controlled translation. We then move to the more challenging domain of literary translation, highlighting the effectiveness of steering interventions in conditioning models’ generation by surgically altering their internal representations. In particular, we show that interpretable concepts extracted by trained sparse autoencoders can be used to mimic personal translation styles from human professional translators, and that successful prompting and steering approaches converge on similar mechanistic solutions.

Finally, Part III explores how insights from model internals can inform human editors in professional translation workflows. We begin by conducting a post-editing user study spanning six typologically diverse languages (DivEMT), showing that translation productivity gains vary dramatically across language pairs, with typological similarity being more influential than traditional quality metrics. Our second study, QE4PE, investigates how word-level error highlights impact the productivity of professional post-editors and the quality of their translations, including both supervised and interpretability-based approaches. We conclude with a broad evaluation of unsupervised quality estimation methods, showing that error detection approaches based on model internals can outperform supervised baselines, and highlighting the importance of calibration and multiple annotations to account for human label variation.

Overall, this dissertation advances the field of machine translation interpretability by developing accessible tools and methods for understanding context usage, enabling fine-grained control over translation outputs, and establishing empirical evidence for the use of model internals in professional translation workflows. These contributions, taken together, lay the groundwork for the next generation of trustworthy, controllable, and user-centered translation systems.

1 Introduction

In recent years, language models have undergone a significant transformation, going from simple research prototypes producing barely coherent text to becoming a cornerstone of modern technological infrastructure. This success stems in large part from the remarkable ability of large neural networks such as the transformer (Vaswani et al., 2017) to learn rich representations of language—and by extension, our world and society—from staggering amounts of text. Yet, the complex and deeply intertwined structure that renders these systems so powerful is also the main culprit behind their opacity. The inner workings of neural networks remain notoriously difficult to interpret, and the lack of transparency in their decision-making processes has raised serious concerns about their reliability and fairness in high-stakes applications (Rudin, 2019).

These circumstances have led to a growing interest in interpretability— a field closely aligned with the broader area of explainable artificial intelligence (XAI), which seeks to develop methods and tools to understand how neural networks work and provide insights into their decision-making processes (Doshi-Velez and Kim, 2017; Li et al., 2022). In natural language processing (NLP), interpretability research has made significant strides by uncovering how language models encode and process factual knowledge and linguistic information (Tenney et al., 2019; Belinkov, 2022; Meng et al., 2022), revealing their use of context during generation (Clark et al., 2019; Ferrando et al., 2022) and identifying the learned mechanisms underlying their capabilities (Elhage et al., 2021; Saphra and Wiegreffe, 2024).

While interpretability insights have earned broad recognition and influence within the NLP research community (Mosbach et al., 2024), critics have often pointed out that these findings rarely translate into actionable improvements for real-world systems (Räuker et al., 2023; Rai et al., 2024; Hendrycks and Hiscott, 2025). Most interpretability work today focuses on identifying subnetworks and mechanisms responsible for specific tasks inside language models (Ferrando et al., 2024; Sharkey et al., 2025), yet few studies have put interpretability insights in relation to end-users’ needs and desires (Ehsan et al., 2021), despite their crucial role in determining the practical usefulness of interpretability findings (Ehsan et al., 2024). This disconnect stems from a fundamental divide between research communities: most AI interpretability researchers pursue theoretical understanding of complex systems, while human-computer interaction (HCI) researchers prioritize actionable insights and practical applications.

A prime example of this disconnect can be found in the field of machine translation (MT), a long-standing area of research within NLP. MT researchers pioneered the use of neural language models for sequence generation tasks (Sutskever et al., 2014; Bahdanau et al., 2015), and were among the first to analyze their inner workings (Belinkov et al., 2017; Voita et al., 2019; Rogers et al., 2020). Yet, despite the significant progress in the performance of MT systems across hundreds of languages over the past decade, the field has been remarkably slow to bring interpretability insights to the users of these systems, especially in the case of professional translators who work with these systems on a daily basis. Users of “classic” translation tools such as Google Translate are, to this day, simply presented with translations, without the possibility to personalize their tone or properties, quantify the model uncertainty in its response, or identify potential errors or alternative formulations. At the other extreme, when large language models like GPT-4 (OpenAI, 2023) eagerly offer eloquent justifications alongside their translations, these explanations may sound plausible but often fail to reflect the model’s actual processing and context usage, resulting in plausible yet unfaithful rationalizations (Turpin et al., 2023).

This dissertation aims to bridge the gap between method-centric interpretability research and outcome-centric real-world machine translation applications. We develop novel methods to understand and control language model generation, then study how to integrate these advances effectively into human translation workflows. Our research spans three interconnected macro-themes: (1) understanding how language models exploit contextual information during generation, (2) controlling model generation for personalized translation outputs, and (3) integrating interpretability insights into human translation workflows. Our methodological contributions, empirical evaluations, and user studies demonstrate how insights from interpretability research can lead to meaningful impact in the way machine translation systems are used in real-world translation workflows.

1.1 Outline and Contributions

The experimental chapters of this dissertation are organized into three parts, each addressing one of the research directions outlined above. Each part is composed of multiple chapters, each presenting a self-contained contribution or study related to the overarching theme. Figure 1.1 provides a visual overview of parts and chapters, highlighting for each chapter the topics introduced in detail in Chapter 2. Below, we summarize the contents, research questions and contributions of each part.

Figure 1.1: Chapter guide for the three parts of this dissertation.

Part I: Attributing Context Usage in Multilingual NLP

Part I establishes the foundational infrastructure and methodological frameworks for understanding how neural language models and machine translation systems process contextual information during generation. We begin with Inseq (Chapter 3), a toolkit that democratizes access to interpretability analyses of generative language models, providing the foundation for our investigations into context usage. Then, Chapter 4 introduces PECoRe, a data-driven framework for quantifying the plausibility of context usage in language models through the contrastive identification of context-sensitive tokens and contextual cues that influence their prediction. PECoRe is used to study context usage in context-aware machine translation systems, identifying failure cases stemming from an incorrect usage of context. Chapter 5 extends this analysis to modern large language models and retrieval-augmented generation settings with Mirage, adapting the PECoRe framework to demonstrate how model internals enable faithful answer attribution in question answering. This part addresses two fundamental research questions:

❓ Research Question 1 (RQ1)

What are the conceptual and technical requirements for interpretability software tools enabling scalable and reproducible analyses into the inner workings of generative language models?

❓ Research Question 2 (RQ2)

How do language models and machine translation systems exploit contextual information during generation, and how can we quantify this usage in a faithful manner?

Part I’s primary contributions include: (1) two open-source releases of the Inseq interpretability library; (2) the contrastive attribution tracing (CAT) method, a gradient-based alternative to causal intervention for efficiently identifying salient model components; (3) the PECoRe framework for context reliance attribution in language models, enabling data-driven exploration of context usage patterns in context-aware MT systems; and (4) an extended evaluation of context attribution for retrieval-augmented generation using Mirage, producing high quality citations of retrieved documents while ensuring greater faithfulness to the model’s reasoning process.

Part II: Conditioning Generation for Personalized Machine Translation

Part II moves from understanding context usage to actively controlling model generation for customized translation outputs. Across two chapters, we explore two paradigms to condition machine translation outputs—prompting-based methods and direct interventions in model processing—addressing the question:

❓ Research Question 3 (RQ3)

Are interpretability-based steering methods viable approaches for controllable machine translation? How do they compare with prompting-based methods in terms of their performance and their impact on models’ internal mechanisms?

Chapter 6 pioneers the usage of prompting-based strategies for attribute-controlled translation, while Chapter 7 connects generation conditioning to interpretability techniques, expanding the scope of our analysis from simple attributes in common domains to sophisticated personal styles in the challenging literary translation domain.

The core contributions of Part II include: (1) Ramp, a novel prompting methodology achieving strong performance in attribute-controlled translation across multiple languages and attributes without model fine-tuning; (2) the first comprehensive comparison of prompting versus interpretability-based steering for machine translation personalization; (3) a novel contrastive steering method using sparse autoencoder latents to achieve personalization accuracy comparable to prompting while preserving quality in literary translation; and (4) evidence that prompting and steering methods converge to similar mechanistic solutions, revealing fundamental principles of generation conditioning.

Part III: Interpretability in Human Translation Workflows

Part III evaluates how interpretability insights can benefit human professionals who edit machine-translated content in a practical sense. We begin with DivEMT (Chapter 8), a study investigating the effectiveness of professional MT post-editing across a diverse set of mid-resourced languages, going beyond the one-size-fits-all analysis of high-resourced translation directions. This allows us to establish our human evaluation setup, providing valuable insights into the question:

❓ Research Question 4 (RQ4)

Does MT contribute positively to the productivity of professional translators across different languages? Which factors influence its effectiveness?

Building upon these insights, our second large-scale study QE4PE (Chapter 9) investigates how word-level error span highlights—including those derived from MT systems’ uncertainty during generation—impact the productivity of professional translators and the quality of post-edited contents:

❓ Research Question 5 (RQ5)

How do word-level error highlights impact the productivity and editing choices of professional translators and the quality of resulting translations?

Chapter 10 concludes our human-centered investigation with a deeper analysis of multiple uncertainty and interpretability-based word-level quality estimation methods. Such analysis allows us to assess how the performance of such techniques varies across different models, languages and human annotators:

❓ Research Question 6 (RQ6)

Can unsupervised error span detection methods reliably identify problems in machine-translated outputs? How does human label variation affect their performance, compared to traditional supervised approaches?

Part III contributions include (1) DivEMT, a cross-lingual post-editing dataset enabling controlled comparison of translator productivity across editing modalities and typologically diverse languages; (2) evidence that MT quality metrics fail to correlate with human post-editing productivity across languages, with productivity being heavily influenced by source-target language relatedness; (3) QE4PE, a comprehensive post-editing dataset containing error spans, behavioral editing metrics, and quality annotations from 42 professional post-editors for two translation directions; (4) evidence that error span highlights may reduce productivity but improve critical error detection; and (5) evidence that unsupervised quality estimation methods based on model internals can match state-of-the-art supervised approaches in both accuracy and downstream usability, revealing how subjective editing choice impact the evaluation of error span detection methods.

1.2 Scientific Output

This dissertation is the product of several research articles and open-source projects, which are categorized in the following sections.

1.2.1 Main Publications

The following articles represent the main contributions reflected in this thesis’ experimental chapters, organized in their respective parts:¹

Introduction and Background

Ferrando, J., Sarti, G., Bisazza, A. and Costa-jussà, M. R. (2024). A Primer on the Inner Workings of Transformer-based Language Models. Arxiv Preprint (Chapter 2)

Part I: Attributing Context Usage in Multilingual NLP

Sarti, G., Feldhus, N., Sickert, L., van der Wal, O., Nissim, M. and Bisazza, A. (2023a). Inseq: An Interpretability Toolkit for Sequence Generation Models. In Proc. of the 61st Annual Meeting of the Association for Computational Linguistics (ACL Demo) (Chapter 3)
Sarti, G., Feldhus, N., Qi, J., Nissim, M. and Bisazza, A. (2024d). Democratizing Advanced Attribution Analyses of Generative Language Models with the Inseq Toolkit. In Proc. of the 2nd World Conference on eXplainable Artificial Intelligence: Late-breaking works and demos (xAI Demo) (Chapter 3 and Chapter 4)
Sarti, G., Chrupała G., Nissim, M. and Bisazza, A. (2024c). Quantifying the Plausibility of Context Reliance in Neural Machine Translation. In Proc. of the 12th International Conference on Learning Representations (ICLR) (Chapter 4)
Qi, J.\(^\dagger\), Sarti, G.\(^\dagger\), Fernández, R. and Bisazza, A. (2024). Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation. In Proc. of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Chapter 5)

Part II: Conditioning Generation for Personalized Machine Translation

Sarti, G., Htut, P. M., Niu, X., Hsu, B., Currey, A., Dinu, G. and Nadejde, M. (2023b). RAMP: Retrieval and Attribute-Marking Enhanced Prompting for Attribute-Controlled Translation. In Proc. of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) (Chapter 6)
Scalena, D.\(^\dagger\), Sarti, G.\(^\dagger\), Bisazza, A., Fersini, E. and Nissim, M. (2025). Steering Large Language Models for Machine Translation Personalization. Arxiv Preprint (Chapter 7)

Part III: Interpretability in Human Translation Workflows

Sarti, G., Bisazza, A., Guerberof-Arenas, A. and Toral, A. (2022). DivEMT: Neural Machine Translation Post-Editing Effort Across Typologically Diverse Languages. In Proc. of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Chapter 8)
Sarti, G., Zouhar, V., Chrupała, G., Guerberof-Arenas, A., Nissim, M. and Bisazza, A. (2025a). QE4PE: Word-level Quality Estimation for Human Post-Editing. Transactions of the Association for Computational Linguistics (TACL) (Chapter 9)
Sarti, G., Zouhar, V., Nissim, M. and Bisazza, A. (2025b). Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement. In Proc. of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Chapter 10)

I led the conceptualization, implementation, experimental evaluation, and manuscript writing for each article for which I am the sole first author. For articles with shared first authorship, I co-led the conceptualization, experimental design, and manuscript writing. In Qi^* et al. (2024), I also implemented the API for experimental evaluation. The background in Chapter 2 adapts parts of our primer on transformer interpretability (Ferrando et al., 2024), for which I contributed by surveying the literature and writing content regarding transformer architecture, input attribution methods, steering approaches, and interpretability tools.

1.2.2 Open-source Contributions

Open-source software proved fundamental to this thesis, providing a solid foundation for conducting reproducible experimental work. Notably, all investigations we conducted employed solely open-source tools, models and datasets, despite the current popularity of proprietary language models. Each chapter provides links to all datasets, models, code, and demos to encourage scrutiny and foster further research.

My most notable contribution to the open-source research ecosystem is the Inseq toolkit, presented in Chapter 3, for which I serve as development lead. The library now counts 430+ Github stars and 80+ citations across international venues.

I also contributed to the development of the following open-source projects:

The Groningen Translation Environment (GroTE), a Gradio-based UI for machine translation post-editing supporting the live recording of behavioral logs using the Hugging Face datasets hub and spaces ecosystem, developed with the help of Vilém Zouhar for the QE4PE study (Chapter 9). Available at https://github.com/gsarti/grote or via pip install grote.
gradio-highlightedtextbox, a Svelte component for Gradio supporting text editing with highlighted spans, developed for collecting behavioral edit data in GroTE. Available at https://huggingface.co/spaces/gsarti/gradio_highlightedtextbox or via pip install gradio-highlightedtextbox.
labl, a toolkit to facilitate token-level analyses of annotated texts with multiple edits and tokenization schemes, developed with the help of Vilém Zouhar for Chapter 10 analyses. Available at https://github.com/gsarti/labl or via pip install labl.
Interpreto, a Python toolbox for concept-based interpretability analyses of language models maintained by the FOR/DEEL teams, which I helped design and develop as part of my visit to the IRT Saint Exupéry research institute in Toulouse, France. Interpreto is available at https://github.com/FOR-sight-ai/interpreto or via pip install interpreto.

The full set of open-source contributions, including demos, models, and datasets, are available on GitHub and 🤗 Hugging Face.

1.2.3 Other Research Contributions

Beyond this dissertation’s scope, my research output included projects organized around two main themes:

Advancing Italian natural language processing:

Miaschi, A., Sarti, G., Brunato, D., Dell’Orletta, F. and Venturi, G. (2022). Probing Linguistic Knowledge in Italian Neural Language Models across Language Varieties. Italian Journal of Computational Linguistics (IJCoL)
Bianchi, F., Attanasio, G., Pisoni, R., Terragni, S., Sarti, G. and Balestri, D. (2023). Contrastive Language-Image Pre-training for the Italian Language. In Proc. of the 9th Italian Conference on Computational Linguistics (CLiC-it)
Sarti, G. and Nissim, M. (2024). IT5: Text-to-text Pretraining for Italian Language Understanding and Generation. In Proc. of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)
Sarti, G., Caselli, T., Nissim, M. and Bisazza, A. (2024b). Non Verbis, Sed Rebus: Large Language Models Are Weak Solvers of Italian Rebuses. In Proc. of the 10th Italian Conference on Computational Linguistics (CLiC-it)
Sarti, G., Caselli, T., Bisazza, A. and Nissim, M. (2024a). EurekaRebus - Verbalized Rebus Solving with LLMs: A CALAMITA Challenge. In Proc. of the 10th Italian Conference on Computational Linguistics (CLiC-it)
Ciaccio, C., Sarti, G., Miaschi, A. and Dell’Orletta, F. (2025). Crossword Space: Latent Manifold Learning for Italian Crosswords and Beyond. In Proc. of the 11th Italian Conference on Computational Linguistics (CLiC-it)

Interpreting the inner workings of generative language models:

Langedijk, A., Mohebbi, H., Sarti, G., Zuidema, W. and Jumelet, J. (2024). DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers. In Findings of the North American Chapter of the Association for Computational Linguistics (NAACL Findings)
Edman, L., Sarti, G., Toral, A., van Noord, G. and Bisazza, A. (2024). Are Character-level Translations Worth the Wait? Comparing ByT5 and mT5 for Machine Translation. Trans. of the Association for Computational Linguistics (TACL)
Scalena, D., Sarti, G. and Nissim, M. (2024). Multi-property Steering of Large Language Models with Dynamic Activation Composition. In Proc. of the 7th Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP)
Ghasemi Madani, M. R., Gema, A. P., Sarti, G., Zhao, Y., Minervini, P. and Passerini, A. (2025). Noiser: Bounded Input Perturbations for Attributing Large Language Models. In Proc. of the 2nd Conference on Language Modeling (CoLM)
Candussio, S., Saveri, G., Sarti, G. and Bortolussi, L. (2025). Bridging Logic and Learning: Decoding Temporal Logic Embeddings via Transformers. In Proc. of the European Conference on Machine Learning and Principles of Knowledge Discovery in Databases (ECML-PKDD)
Islam, K. I. and Sarti, G. (2025). Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation. In Proc. of the 2nd Workshop on Bangla Language Processing (BLP).

Fin also had the privilege of co-organizing the BlackboxNLP 2025 workshop²—the leading venue for NLP interpretability work—and contributing to its shared task on benchmarking mechanistic interpretability methods for circuit localization and causal variable identification in large language models:

Belinkov, Y., Mueller, A., Kim, N., Mohebbi, H., Chen, H., Arad, D., Sarti, G. (2025). Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP.
Arad, D., Belinkov, Y., Chen, H., Kim, N., Mohebbi, H., Mueller, A., Sarti, G., Tutek, M. (2025). Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models. In Proc. of the 8th Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP)

Dana Arad, Yonatan Belinkov, Hanjie Chen, Najoung Kim, Hosein Mohebbi, Aaron Mueller, Gabriele Sarti, and Martin Tutek. 2025. Findings of the BlackboxNLP 2025 shared task: Localizing circuits and causal variables in language models. In Proceedings of the 8th BlackboxNLP workshop: Analyzing and interpreting neural networks for NLP, pages 543–552, Suzhou, China. Association for Computational Linguistics.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann LeCun, editors, Proceedings of the 3rd international conference on learning representations (ICLR), San Diego, CA, USA.

Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219.

Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. 2017. What do neural machine translation models learn about morphology? In Regina Barzilay and Min-Yen Kan, editors, Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 861–872, Vancouver, Canada. Association for Computational Linguistics.

Yonatan Belinkov, Aaron Mueller, Najoung Kim, Hosein Mohebbi, Hanjie Chen, Dana Arad, and Gabriele Sarti, editors. 2025. Proceedings of the 8th BlackboxNLP workshop: Analyzing and interpreting neural networks for NLP. Association for Computational Linguistics, Suzhou, China.

Federico Bianchi, Giuseppe Attanasio, Raphael Pisoni, Silvia Terragni, Gabriele Sarti, and Dario Balestri. 2023. Contrastive language–image pre-training for the Italian language. In Federico Boschetti, Gianluca E. Lebani, Bernardo Magnini, and Nicole Novielli, editors, Proceedings of the 9th italian conference on computational linguistics (CLiC-it 2023), pages 78–85, Venice, Italy. CEUR Workshop Proceedings.

Sara Candussio, Gaia Saveri, Gabriele Sarti, and Luca Bortolussi. 2025. Bridging logic and learning: Decoding temporal logic embeddings via transformers. In Machine learning and knowledge discovery in databases. Research track. Springer Nature Switzerland.

Cristiano Ciaccio, Gabriele Sarti, Alessio Miaschi, and Felice Dell’Orletta. 2025. Crossword space: Latent manifold learning for italian crosswords and beyond. In Cristina Bosco, Elisabetta Jezek, Marco Polignano, and Manuela Sanguinetti, editors, Proceedings of the 11th italian conference on computational linguistics (CLiC-it 2023), Cagliari, Italy. CEUR Workshop Proceedings.

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? An analysis of BERT‘s attention. In Tal Linzen, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes, editors, Proceedings of the 2019 ACL workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pages 276–286, Florence, Italy. Association for Computational Linguistics.

Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning.

Lukas Edman, Gabriele Sarti, Antonio Toral, Gertjan van Noord, and Arianna Bisazza. 2024. Are character-level translations worth the wait? Comparing ByT5 and mT5 for machine translation. Transactions of the Association for Computational Linguistics, 12:392–410.

Upol Ehsan, Q. Vera Liao, Michael Muller, Mark O. Riedl, and Justin D. Weisz. 2021. Expanding explainability: Towards social transparency in AI systems. In Proceedings of the 2021 CHI conference on human factors in computing systems, New York, NY, USA. Association for Computing Machinery.

Upol Ehsan, Samir Passi, Q. Vera Liao, Larry Chan, I-Hsiang Lee, Michael Muller, and Mark O Riedl. 2024. The who in XAI: How AI background shapes perceptions of AI explanations. In Proceedings of the 2024 CHI conference on human factors in computing systems, New York, NY, USA. Association for Computing Machinery.

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, et al. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread. https://transformer-circuits.pub/2021/framework/index.html.

Javier Ferrando, Gerard I. Gállego, and Marta R. Costa-jussà. 2022. Measuring the mixing of contextual information in the transformer. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 conference on empirical methods in natural language processing, pages 8698–8714, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Javier Ferrando, Gabriele Sarti, Arianna Bisazza, and Marta R. Costa-jussà. 2024. A primer on the inner workings of transformer-based language models. Arxiv Preprint.

Dan Hendrycks and Laura Hiscott. 2025. The misguided quest for mechanistic AI interpretability. Accessed August 4, 2025.

Khondoker Ittehadul Islam and Gabriele Sarti. 2025. Reveal-bangla: A dataset for cross-lingual multi-step reasoning evaluation. Arxiv Preprint.

Anna Langedijk, Hosein Mohebbi, Gabriele Sarti, Willem Zuidema, and Jaap Jumelet. 2024. DecoderLens: Layerwise interpretation of encoder-decoder transformers. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Findings of the association for computational linguistics: NAACL 2024, pages 4764–4780, Mexico City, Mexico. Association for Computational Linguistics.

Xuhong Li, Haoyi Xiong, Xingjian Li, Xuanyu Wu, Xiao Zhang, Ji Liu, Jiang Bian, and Dejing Dou. 2022. Interpretable deep learning: Interpretation, interpretability, trustworthiness, and beyond. Knowledge and Information Systems, 64(12):3197–3234.

Mohammad Reza Ghasemi Madani, Aryo Pradipta Gema, Gabriele Sarti, Yu Zhao, Pasquale Minervini, and Andrea Passerini. 2025. Noiser: Bounded input perturbations for attributing large language models. In Second conference on language modeling.

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in neural information processing systems, volume 35, pages 17359–17372. Curran Associates, Inc.

Alessio Miaschi, Gabriele Sarti, Dominique Brunato, Felice Dell’Orletta, and Giulia Venturi. 2022. Probing linguistic knowledge in italian neural language models across language varieties. Italian Journal of Computational Linguistics (IJCoL), 8(1):25–44.

Marius Mosbach, Vagrant Gautam, Tomás Vergara Browne, Dietrich Klakow, and Mor Geva. 2024. From insights to actions: The impact of interpretability and analysis research on NLP. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 conference on empirical methods in natural language processing, pages 3078–3105, Miami, Florida, USA. Association for Computational Linguistics.

OpenAI. 2023. Gpt-4 technical report. Arxiv.

Jirui Qi^*, Gabriele Sarti^*, Raquel Fernández, and Arianna Bisazza. 2024. Model internals-based answer attribution for trustworthy retrieval-augmented generation. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 conference on empirical methods in natural language processing, pages 6037–6053, Miami, Florida, USA. Association for Computational Linguistics.

Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. 2024. A practical review of mechanistic interpretability for transformer-based language models. Arxiv Preprint.

Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. 2023. Toward transparent AI: A survey on interpreting the inner structures of deep neural networks. In 2023 IEEE conference on secure and trustworthy machine learning (SaTML), pages 464–483.

Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8:842–866.

Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1:206–215.

Naomi Saphra and Sarah Wiegreffe. 2024. Mechanistic? In Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, and Hanjie Chen, editors, Proceedings of the 7th BlackboxNLP workshop: Analyzing and interpreting neural networks for NLP, pages 480–498, Miami, Florida, US. Association for Computational Linguistics.

Gabriele Sarti, Arianna Bisazza, Ana Guerberof-Arenas, and Antonio Toral. 2022. DivEMT: Neural machine translation post-editing effort across typologically diverse languages. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 conference on empirical methods in natural language processing, pages 7795–7816, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Gabriele Sarti, Tommaso Caselli, Arianna Bisazza, and Malvina Nissim. 2024a. EurekaRebus - verbalized rebus solving with LLMs: A CALAMITA challenge. In Felice Dell’Orletta, Alessandro Lenci, Simonetta Montemagni, and Rachele Sprugnoli, editors, Proceedings of the 10th italian conference on computational linguistics (CLiC-it 2024), pages 1202–1208, Pisa, Italy. CEUR Workshop Proceedings.

Gabriele Sarti, Tommaso Caselli, Malvina Nissim, and Arianna Bisazza. 2024b. Non verbis, sed rebus: Large language models are weak solvers of Italian rebuses. In Felice Dell’Orletta, Alessandro Lenci, Simonetta Montemagni, and Rachele Sprugnoli, editors, Proceedings of the 10th italian conference on computational linguistics (CLiC-it 2024), pages 888–897, Pisa, Italy. CEUR Workshop Proceedings.

Gabriele Sarti, Grzegorz Chrupała, Malvina Nissim, and Arianna Bisazza. 2024c. Quantifying the plausibility of context reliance in neural machine translation. In The twelfth international conference on learning representations (ICLR 2024), Vienna, Austria. OpenReview.

Gabriele Sarti, Nils Feldhus, Jirui Qi, Malvina Nissim, and Arianna Bisazza. 2024d. Democratizing advanced attribution analyses of generative language models with the inseq toolkit. In xAI-2024 late-breaking work, demos and doctoral consortium joint proceedings, pages 289–296, Valletta, Malta. CEUR.org.

Gabriele Sarti, Nils Feldhus, Ludwig Sickert, Oskar van der Wal, Malvina Nissim, and Arianna Bisazza. 2023a. Inseq: An interpretability toolkit for sequence generation models. In Danushka Bollegala, Ruihong Huang, and Alan Ritter, editors, Proceedings of the 61st annual meeting of the association for computational linguistics (volume 3: System demonstrations), pages 421–435, Toronto, Canada. Association for Computational Linguistics.

Gabriele Sarti, Phu Mon Htut, Xing Niu, Benjamin Hsu, Anna Currey, Georgiana Dinu, and Maria Nadejde. 2023b. RAMP: Retrieval and attribute-marking enhanced prompting for attribute-controlled translation. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st annual meeting of the association for computational linguistics (volume 2: Short papers), pages 1476–1490, Toronto, Canada. Association for Computational Linguistics.

Gabriele Sarti and Malvina Nissim. 2024. IT5: Text-to-text pretraining for Italian language understanding and generation. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024), pages 9422–9433, Torino, Italia. ELRA; ICCL.

Gabriele Sarti, Vilém Zouhar, Grzegorz Chrupała, Ana Guerberof-Arenas, Malvina Nissim, and Arianna Bisazza. 2025a. QE4PE: Word-level quality estimation for human post-editing. Transactions of the Association for Computational Linguistics, 13:1410–1435.

Gabriele Sarti, Vilém Zouhar, Malvina Nissim, and Arianna Bisazza. 2025b. Unsupervised word-level quality estimation for machine translation through the lens of annotators (dis)agreement. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 conference on empirical methods in natural language processing, pages 18320–18337, Suzhou, China. Association for Computational Linguistics.

Daniel Scalena, Gabriele Sarti, and Malvina Nissim. 2024. Multi-property steering of large language models with dynamic activation composition. In Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, and Hanjie Chen, editors, Proceedings of the 7th BlackboxNLP workshop: Analyzing and interpreting neural networks for NLP, pages 577–603, Miami, Florida, US. Association for Computational Linguistics.

Daniel Scalena^*, Gabriele Sarti^*, Arianna Bisazza, Elisabetta Fersini, and Malvina Nissim. 2025. Steering large language models for machine translation personalization. Arxiv Preprint.

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, et al. 2025. Open problems in mechanistic interpretability.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 28th international conference on neural information processing systems - volume 2, pages 3104–3112, Cambridge, MA, USA. MIT Press.

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th annual meeting of the association for computational linguistics, pages 4593–4601, Florence, Italy. Association for Computational Linguistics.

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. 2023. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. In Proceedings of the 37th international conference on neural information processing systems, Red Hook, NY, USA. Curran Associates Inc.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in neural information processing systems, volume 30. Curran Associates, Inc.

Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th annual meeting of the association for computational linguistics, pages 5797–5808, Florence, Italy. Association for Computational Linguistics.

Shared first co-authorship is indicated by \(^\dagger\).↩︎
https://blackboxnlp.github.io/2025 ↩︎