Natural Language Processing

From Insights to Impact: Actionable Interpretability for Neural Machine Translation

This dissertation bridges the gap between scientific insights into how language models work and practical benefits for users of these systems, paving the way for better human-AI interaction practices for professional translators and everyday users worldwide.

Interpreting and Understanding LLMs and Other Deep Learning Models

This masterclass will feature a series of insightful presentations and a hands-on tutorial focused on explainability techniques for Large Language Models (LLMs) and other deep learning architectures. Participants will gain both insights and practical experience in interpreting and understanding the inner workings of modern AI systems. In particular, my presentation provides a general introduction to popular interpretability approaches for studying large language models. Particularly, we will focus on attributional methods to identify the influence of context on model predictions and mechanistic techniques to locate and intervene in model knowledge and behaviors.

Interpretability for Language Models: Current Trends and Applications

In this presentation, I will provide an overview of the interpretability research landscape and describe various promising methods for exploring and controlling the inner mechanisms of generative language models. I will start discussing post-hoc attribution technique and their usage to identify prediction-relevant inputs, showcasing their usage within our PECoRe framework for context usage attribution, and its adaptation to produce internals-based citations in retrieval-augmented generation settings (MIRAGE). The final part will present core insight from recent mechanistic interpretability literature, focusing on the construction of replacement models to build concept attribution graphs and their practical usage for monitoring LLM behaviors.

Interpreting Context Usage in Generative Language Models

This presentation focuses on applying post-hoc interpretability techniques to analyze how language models (LMs) use input information throughout the generation process. We briefly introduce Inseq, our open-source toolkit designed to simplify advanced feature attribution analyses for LMs. Then, our Plausibility Evaluation of Context Reliance (PECoRe) interpretability framework is introduced to conduct data-driven analyses of context usage in LMs. In conclusion, we showcase how PECoRe can easily be adapted to retrieval-augmented generation (RAG) settings to produce internals-based citations for model answers. Our proposed Model Internals for RAG Explanations (MIRAGE) method achieves citation quality comparable to supervised answer validators with no additional training, producing citations that are faithful to actual context usage during generation.

Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement

We evaluate unsupervised word-level quality estimation (WQE) methods for machine translation, focusing on their robustness to human label variation.

Steering Large Language Models for Machine Translation Personalization

We evaluate prompting and steering based methods for machine translation personalization in the literary domain.

Interpreting Latent Features in Large Language Models

The presentation discusses interpreting latent features in large language models (LLMs). After an introduction on mechanistic interpretability fundamentals, including feature superposition and sparse autoencoders, I discuss recent work by the Anthropic interpretability team (Ameisen et al. 2025, Lindsey et al. 2025) for extracting circuits of interpretable features from trained LLMs. Real-world investigations of Claude mechanisms, such as multi-step reasoning and multilinguality, are also analyzed.

QE4PE: Word-level Quality Estimation for Human Post-Editing

Word-level quality estimation (QE) detects erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. Our QE4PE study investigates the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated by behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors' speed are critical factors in determining highlights' effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.

Interpretability for Language Models: Current Trends and Applications

In this presentation, I will provide an overview of the interpretability research landscape and describe various promising methods for exploring and controlling the inner mechanisms of generative language models. I will focus specifically on post-hoc attribution technique and their usage to identify relevant input and model components, showcasing their usage with our Inseq open-source toolkit. A practical application of attribution techniques will be presented with the PECoRe data-driven framework for context usage attribution and its adaptation to produce internals-based citations for model answers in retrieval-augmented generation settings (MIRAGE).