Mechanistic Interpretability | Gabriele Sarti

Mechanistic Interpretability

Scaling Interpretability for LLM Agents

Evaluations and interpretability offer complementary but disconnected views of large language models understanding. This talk presents a research program aimed at bridging this gap across three threads. First, I describe PECoRe and MIRAGE frameworks for scalable context usage analyses in LLM generations, with applications to answer attribution in RAG settings. Second, I present a framework combining behavioral evaluation with representational analysis to assess goal-directedness in LLM agents. Studying an LLM navigating grid worlds, we decode cognitive maps from model activations and show that many apparent behavioral failures are rational under the agent's imperfect internal beliefs. Finally, I outline an updated view of the NDIF ecosystem and highlight our vision for open-source infrastructure for merging evals and interpretability workflows.

Interpretability for Language Models: Current Trends and Applications

In this presentation, I will provide an overview of the interpretability research landscape and describe various promising methods for exploring and controlling the inner mechanisms of generative language models. I will start discussing post-hoc attribution technique and their usage to identify prediction-relevant inputs, showcasing their usage within our PECoRe framework for context usage attribution, and its adaptation to produce internals-based citations in retrieval-augmented generation settings (MIRAGE). The final part will present core insight from recent mechanistic interpretability literature, focusing on the construction of replacement models to build concept attribution graphs and their practical usage for monitoring LLM behaviors.

Attribution: Tracing Influence to Inputs and Model Components

Attribution methods are a family of techniques for tracing the influence of inputs and model components on a model's predictions. In this lecture, I will provide an overview of attribution methods, focusing in particular on shortcomings and practical applications of input attribution techniques, and their usage to analyze context usage in language models.

Interpretability for Language Models: Current Trends and Applications

In this presentation, I will provide an overview of the interpretability research landscape and describe various promising methods for exploring and controlling the inner mechanisms of generative language models. I will start discussing post-hoc attribution technique and their usage to identify prediction-relevant inputs, showcasing their usage within our PECoRe framework for context usage attribution, and its adaptation to produce internals-based citations in retrieval-augmented generation settings (MIRAGE). The final part will present core insight from recent mechanistic interpretability literature, focusing on the construction of replacement models to build concept attribution graphs and their practical usage for monitoring LLM behaviors.

A Primer on the Inner Workings of Transformer-based Language Models

This primer provides a concise technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture.