Feature Attribution | Gabriele Sarti

From Insights to Impact: Actionable Interpretability for Neural Machine Translation

This presentation summarizes the main contributions of my PhD thesis, advocating for a user-centric perspective on interpretability research, aiming to translate theoretical advances in model understanding in practical benefits in trustworthiness and transparency for end users of these systems.

Scaling Interpretability for LLM Agents

Evaluations and interpretability offer complementary but disconnected views of large language models understanding. This talk presents a research program aimed at bridging this gap across three threads. First, I present a framework combining behavioral evaluation with representational analysis to assess goal-directedness in LLM agents. Studying an LLM navigating grid worlds, we decode cognitive maps from model activations and show that many apparent behavioral failures are rational under the agent's imperfect internal beliefs. Finally, I outline an updated view of the NDIF ecosystem and highlight our vision for open-source infrastructure for merging evals and interpretability workflows.

Interpretability for Language Models: Current Trends and Applications

In this presentation, I will present core insight from recent mechanistic interpretability literature, focusing on the construction of replacement models to build concept attribution graphs and their practical usage for monitoring LLM behaviors. I will overview several application of LRMs for studying model behavior, and conclude with an overview of our efforts at NDIF to build scalable tooling for interpretabilit research.

Scaling Interpretability for LLM Agents

Evaluations and interpretability offer complementary but disconnected views of large language models understanding. This talk presents a research program aimed at bridging this gap across three threads. First, I describe PECoRe and MIRAGE frameworks for scalable context usage analyses in LLM generations, with applications to answer attribution in RAG settings. Second, I present a framework combining behavioral evaluation with representational analysis to assess goal-directedness in LLM agents. Studying an LLM navigating grid worlds, we decode cognitive maps from model activations and show that many apparent behavioral failures are rational under the agent's imperfect internal beliefs. Finally, I outline an updated view of the NDIF ecosystem and highlight our vision for open-source infrastructure for merging evals and interpretability workflows.

Interpretability for Language Models: Current Trends and Applications

In this presentation, I will provide an overview of the interpretability research landscape and describe various promising methods for exploring and controlling the inner mechanisms of generative language models. I will start discussing post-hoc attribution technique and their usage to identify prediction-relevant inputs, showcasing their usage within our PECoRe framework for context usage attribution, and its adaptation to produce internals-based citations in retrieval-augmented generation settings (MIRAGE). The final part will present core insight from recent mechanistic interpretability literature, focusing on the construction of replacement models to build concept attribution graphs and their practical usage for monitoring LLM behaviors.

Interpreting Context Usage in Generative Language Models

This presentation focuses on applying post-hoc interpretability techniques to analyze how language models (LMs) use input information throughout the generation process. We briefly introduce Inseq, our open-source toolkit designed to simplify advanced feature attribution analyses for LMs. Then, our Plausibility Evaluation of Context Reliance (PECoRe) interpretability framework is introduced to conduct data-driven analyses of context usage in LMs. In conclusion, we showcase how PECoRe can easily be adapted to retrieval-augmented generation (RAG) settings to produce internals-based citations for model answers. Our proposed Model Internals for RAG Explanations (MIRAGE) method achieves citation quality comparable to supervised answer validators with no additional training, producing citations that are faithful to actual context usage during generation.

Attribution: Tracing Influence to Inputs and Model Components

Attribution methods are a family of techniques for tracing the influence of inputs and model components on a model's predictions. In this lecture, I will provide an overview of attribution methods, focusing in particular on shortcomings and practical applications of input attribution techniques, and their usage to analyze context usage in language models.