Gabriele Sarti

Welcome to my website! 👋 I am a postdoc at the BauLab at Northeastern University, working on interpretability interfaces and white-box methods for the evaluations ecosystem as part of the National Deep Inference Fabric (NDIF).

Previously, I was a PhD student at the University of Groningen, where I completed my thesis on actionable interpretability for machine translation as a member of the InCLoW team, the GroNLP group and the Dutch InDeep consortium. Before that, I was a applied scientist intern at Amazon Translate NYC and a research scientist at Aindo.

My current research interests include LLM reasoning, interpretability, user modeling and monitoring of agentic systems. I’m especially interested in making white-box auditing a practical part of how we evaluate frontier AI, since behavioural tests fail to surface unverbalized behaviors, and are increasingly inadequate as models get more capable. I work on ways to surface and steer the beliefs, goals, and plans behind what an agent does, and on the open infrastructure that links interpretability tools to the evaluation ecosystem. If you’re excited about these topics, shoot me a message!

Your (anonymous) constructive feedback is always welcome! 🙂

Interests

  • Reasoning Language Models
  • Mechanistic Interpretability
  • User Modeling and Personalization
  • Alignment Auditing for Agents

Education

Experience

🗞️ News

 

PhD Thesis

From Insights to Impact: Actionable Interpretability for Neural Machine Translation

PhD Thesis at the University of Groningen

This dissertation aims to bridge the gap between method-centric interpretability research and outcome-centric real-world machine translation applications. We develop novel methods to understand and control language model generation, then study how to integrate these advances effectively into human translation workflows. Our research spans three interconnected macro-themes: understanding how language models exploit contextual information during generation, controlling model generation for personalized translation outputs, and integrating interpretability insights into human translation workflows.

Web Book PDF RUG Page Quarto template

Selected Publications

 

Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement

We evaluate unsupervised word-level quality estimation (WQE) methods for machine translation, focusing on their robustness to human …

Steering Large Language Models for Machine Translation Personalization

We evaluate prompting and steering based methods for machine translation personalization in the literary domain.

QE4PE: Word-level Quality Estimation for Human Post-Editing

We investigate the impact of word-level quality estimation on MT post-editing with 42 professional post-editors.

Multi-property Steering of Large Language Models with Dynamic Activation Composition

We propose Dynamic Activation Composition, an adaptive approach for multi-property activation steering of LLMs

Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation

MIRAGE uses model internals for faithful answer attribution in retrieval-augmented generation applications.

Blog posts

 

ICLR 2020 Trends: Better & Faster Transformers for Natural Language Processing

A summary of promising directions from ICLR 2020 for better and faster pretrained tranformers language models.

Recent & Upcoming Talks

Scaling Interpretability for LLM Agents
Interpretability for Language Models: Current Trends and Applications
Scaling Interpretability for LLM Agents

Projects

 

Attributing Context Usage in Language Models

An interpretability framework to detect and attribute context usage in language models’ generations

Inseq: An Interpretability Toolkit for Sequence Generation Models

An open-source library to democratize access to model interpretability for sequence generation models

Contrastive Image-Text Pretraining for Italian

The first CLIP model pretrained on the Italian language.

Covid-19 Semantic Browser

A semantic browser for SARS-CoV-2 and COVID-19 powered by neural language models.

AItalo Svevo: Letters from an Artificial Intelligence

Generating letters with a neural language model in the style of Italo Svevo, a famous italian writer of the 20th century.