Natural Language Processing

Attribution: Tracing Influence to Inputs and Model Components

Attribution methods are a family of techniques for tracing the influence of inputs and model components on a model's predictions. In this lecture, I will provide an overview of attribution methods, focusing in particular on shortcomings and practical applications of input attribution techniques, and their usage to analyze context usage in language models.

Interpreting Context Usage in Generative Language Models

This presentation focuses on applying post-hoc interpretability techniques to analyze how language models (LMs) use input information throughout the generation process. We briefly introduce Inseq, our open-source toolkit designed to simplify advanced feature attribution analyses for LMs. Then, our Plausibility Evaluation of Context Reliance (PECoRe) interpretability framework is introduced to conduct data-driven analyses of context usage in LMs. In conclusion, we showcase how PECoRe can easily be adapted to retrieval-augmented generation (RAG) settings to produce internals-based citations for model answers. Our proposed Model Internals for RAG Explanations (MIRAGE) method achieves citation quality comparable to supervised answer validators with no additional training, producing citations that are faithful to actual context usage during generation.

Interpreting Context Usage in Generative Language Models

Interpretability for Language Models: Trends and Applications

From Insights to Impact: Actionable Interpretability for Neural Machine Translation

This presentation summarizes the main contributions of my PhD thesis, advocating for a user-centric perspective on interpretability research, aiming to translate theoretical advances in model understanding in practical benefits in trustworthiness and transparency for end users of these systems.

From Insights to Impact: Actionable Interpretability for Neural Machine Translation

This dissertation bridges the gap between scientific insights into how language models work and practical benefits for users of these systems, paving the way for better human-AI interaction practices for professional translators and everyday users worldwide.

Interpreting and Understanding LLMs and Other Deep Learning Models

This masterclass will feature a series of insightful presentations and a hands-on tutorial focused on explainability techniques for Large Language Models (LLMs) and other deep learning architectures. Participants will gain both insights and practical experience in interpreting and understanding the inner workings of modern AI systems. In particular, my presentation provides a general introduction to popular interpretability approaches for studying large language models. Particularly, we will focus on attributional methods to identify the influence of context on model predictions and mechanistic techniques to locate and intervene in model knowledge and behaviors.

Interpretability for Language Models: Current Trends and Applications

In this presentation, I will provide an overview of the interpretability research landscape and describe various promising methods for exploring and controlling the inner mechanisms of generative language models. I will start discussing post-hoc attribution technique and their usage to identify prediction-relevant inputs, showcasing their usage within our PECoRe framework for context usage attribution, and its adaptation to produce internals-based citations in retrieval-augmented generation settings (MIRAGE). The final part will present core insight from recent mechanistic interpretability literature, focusing on the construction of replacement models to build concept attribution graphs and their practical usage for monitoring LLM behaviors.

Interpreting Context Usage in Generative Language Models

Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement

We evaluate unsupervised word-level quality estimation (WQE) methods for machine translation, focusing on their robustness to human label variation.