Feature Circuits | Gabriele Sarti

Interpreting and Understanding LLMs and Other Deep Learning Models

This masterclass will feature a series of insightful presentations and a hands-on tutorial focused on explainability techniques for Large Language Models (LLMs) and other deep learning architectures. Participants will gain both insights and practical experience in interpreting and understanding the inner workings of modern AI systems. In particular, my presentation provides a general introduction to popular interpretability approaches for studying large language models. Particularly, we will focus on attributional methods to identify the influence of context on model predictions and mechanistic techniques to locate and intervene in model knowledge and behaviors.