NDIF | Gabriele Sarti

NDIF

Scaling Interpretability for LLM Agents

Evaluations and interpretability offer complementary but disconnected views of large language models understanding. This talk presents a research program aimed at bridging this gap across three threads. First, I describe PECoRe and MIRAGE frameworks for scalable context usage analyses in LLM generations, with applications to answer attribution in RAG settings. Second, I present a framework combining behavioral evaluation with representational analysis to assess goal-directedness in LLM agents. Studying an LLM navigating grid worlds, we decode cognitive maps from model activations and show that many apparent behavioral failures are rational under the agent's imperfect internal beliefs. Finally, I outline an updated view of the NDIF ecosystem and highlight our vision for open-source infrastructure for merging evals and interpretability workflows.