Mechanistic Interpretability

Interpretability for Language Models: Current Trends and Applications

In this presentation, I will provide an overview of the interpretability research landscape and describe various promising methods for exploring and controlling the inner mechanisms of generative language models. I will start discussing post-hoc attribution technique and their usage to identify prediction-relevant inputs, showcasing their usage within our PECoRe framework for context usage attribution, and its adaptation to produce internals-based citations in retrieval-augmented generation settings (MIRAGE). The final part will present core insight from recent mechanistic interpretability literature, focusing on the construction of replacement models to build concept attribution graphs and their practical usage for monitoring LLM behaviors.

A Primer on the Inner Workings of Transformer-based Language Models

This primer provides a concise technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture.