Evaluations and interpretability offer complementary but disconnected views of large language models understanding. This talk presents a research program aimed at bridging this gap across three threads. First, I describe PECoRe and MIRAGE frameworks for scalable context usage analyses in LLM generations, with applications to answer attribution in RAG settings. Second, I present a framework combining behavioral evaluation with representational analysis to assess goal-directedness in LLM agents. Studying an LLM navigating grid worlds, we decode cognitive maps from model activations and show that many apparent behavioral failures are rational under the agent's imperfect internal beliefs. Finally, I outline an updated view of the NDIF ecosystem and highlight our vision for open-source infrastructure for merging evals and interpretability workflows.
In this presentation, I will provide an overview of the interpretability research landscape and describe various promising methods for exploring and controlling the inner mechanisms of generative language models. I will start discussing post-hoc attribution technique and their usage to identify prediction-relevant inputs, showcasing their usage within our PECoRe framework for context usage attribution, and its adaptation to produce internals-based citations in retrieval-augmented generation settings (MIRAGE). The final part will present core insight from recent mechanistic interpretability literature, focusing on the construction of replacement models to build concept attribution graphs and their practical usage for monitoring LLM behaviors.
Attribution methods are a family of techniques for tracing the influence of inputs and model components on a model's predictions. In this lecture, I will provide an overview of attribution methods, focusing in particular on shortcomings and practical applications of input attribution techniques, and their usage to analyze context usage in language models.
In this presentation, I will provide an overview of the interpretability research landscape and describe various promising methods for exploring and controlling the inner mechanisms of generative language models. I will start discussing post-hoc attribution technique and their usage to identify prediction-relevant inputs, showcasing their usage within our PECoRe framework for context usage attribution, and its adaptation to produce internals-based citations in retrieval-augmented generation settings (MIRAGE). The final part will present core insight from recent mechanistic interpretability literature, focusing on the construction of replacement models to build concept attribution graphs and their practical usage for monitoring LLM behaviors.
This primer provides a concise technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture.