This presentation focuses on applying post-hoc interpretability techniques to analyze how language models (LMs) use input information throughout the generation process. We briefly introduce Inseq, our open-source toolkit designed to simplify advanced feature attribution analyses for LMs. Then, our Plausibility Evaluation of Context Reliance (PECoRe) interpretability framework is introduced to conduct data-driven analyses of context usage in LMs. In conclusion, we showcase how PECoRe can easily be adapted to retrieval-augmented generation (RAG) settings to produce internals-based citations for model answers. Our proposed Model Internals for RAG Explanations (MIRAGE) method achieves citation quality comparable to supervised answer validators with no additional training, producing citations that are faithful to actual context usage during generation.
This presentation focuses on applying post-hoc interpretability techniques to analyze how language models (LMs) use input information throughout the generation process. We briefly introduce Inseq, our open-source toolkit designed to simplify advanced feature attribution analyses for LMs. Then, our Plausibility Evaluation of Context Reliance (PECoRe) interpretability framework is introduced to conduct data-driven analyses of context usage in LMs. In conclusion, we showcase how PECoRe can easily be adapted to retrieval-augmented generation (RAG) settings to produce internals-based citations for model answers. Our proposed Model Internals for RAG Explanations (MIRAGE) method achieves citation quality comparable to supervised answer validators with no additional training, producing citations that are faithful to actual context usage during generation.
This presentation summarizes the main contributions of my PhD thesis, advocating for a user-centric perspective on interpretability research, aiming to translate theoretical advances in model understanding in practical benefits in trustworthiness and transparency for end users of these systems.
This dissertation bridges the gap between scientific insights into how language models work and practical benefits for users of these systems, paving the way for better human-AI interaction practices for professional translators and everyday users worldwide.
This masterclass will feature a series of insightful presentations and a hands-on tutorial focused on explainability techniques for Large Language Models (LLMs) and other deep learning architectures. Participants will gain both insights and practical experience in interpreting and understanding the inner workings of modern AI systems. In particular, my presentation provides a general introduction to popular interpretability approaches for studying large language models. Particularly, we will focus on attributional methods to identify the influence of context on model predictions and mechanistic techniques to locate and intervene in model knowledge and behaviors.
In this presentation, I will provide an overview of the interpretability research landscape and describe various promising methods for exploring and controlling the inner mechanisms of generative language models. I will start discussing post-hoc attribution technique and their usage to identify prediction-relevant inputs, showcasing their usage within our PECoRe framework for context usage attribution, and its adaptation to produce internals-based citations in retrieval-augmented generation settings (MIRAGE). The final part will present core insight from recent mechanistic interpretability literature, focusing on the construction of replacement models to build concept attribution graphs and their practical usage for monitoring LLM behaviors.
This presentation focuses on applying post-hoc interpretability techniques to analyze how language models (LMs) use input information throughout the generation process. We briefly introduce Inseq, our open-source toolkit designed to simplify advanced feature attribution analyses for LMs. Then, our Plausibility Evaluation of Context Reliance (PECoRe) interpretability framework is introduced to conduct data-driven analyses of context usage in LMs. In conclusion, we showcase how PECoRe can easily be adapted to retrieval-augmented generation (RAG) settings to produce internals-based citations for model answers. Our proposed Model Internals for RAG Explanations (MIRAGE) method achieves citation quality comparable to supervised answer validators with no additional training, producing citations that are faithful to actual context usage during generation.
The presentation discusses interpreting latent features in large language models (LLMs). After an introduction on mechanistic interpretability fundamentals, including feature superposition and sparse autoencoders, I discuss recent work by the Anthropic interpretability team (Ameisen et al. 2025, Lindsey et al. 2025) for extracting circuits of interpretable features from trained LLMs. Real-world investigations of Claude mechanisms, such as multi-step reasoning and multilinguality, are also analyzed.