Language Modeling | Gabriele Sarti

Language Modeling

Interpretability for Language Models: Current Trends and Applications

In this presentation, I will provide an overview of the interpretability research landscape and describe various promising methods for exploring and controlling the inner mechanisms of generative language models. I will focus specifically on post-hoc attribution technique and their usage to identify relevant input and model components, showcasing their usage with our Inseq open-source toolkit. A practical application of attribution techniques will be presented with the PECoRe data-driven framework for context usage attribution and its adaptation to produce internals-based citations for model answers in retrieval-augmented generation settings (MIRAGE).

Interpreting Context Usage in Generative Language Models with Inseq, PECoRe and MIRAGE

This presentation focuses on applying post-hoc interpretability techniques to analyze how language models (LMs) use input information throughout the generation process. We briefly introduce Inseq, our open-source toolkit designed to simplify advanced feature attribution analyses for LMs. Then, our Plausibility Evaluation of Context Reliance (PECoRe) interpretability framework is introduced to conduct data-driven analyses of context usage in LMs. In conclusion, we showcase how PECoRe can easily be adapted to retrieval-augmented generation (RAG) settings to produce internals-based citations for model answers. Our proposed Model Internals for RAG Explanations (MIRAGE) method achieves citation quality comparable to supervised answer validators with no additional training, producing citations that are faithful to actual context usage during generation.

Interpreting Context Usage in Generative Language Models with Inseq and PECoRe

This talk discusses the challenges and opportunities in conducting interpretability analyses of generative language models. We begin by presenting Inseq, an open-source toolkit for advanced feature attribution analyses of language models. The usage of Inseq is illustrated through examples of state-of-the-art approaches contrastive attribution, input dependence and locating factual knowledge in intermediate model representations. Then, we introduce Plausibility Evaluation of Context Reliance (PECoRe), an end-to-end interpretability framework using model internals to detect context-dependent spans in model generations and trace their prediction back to salient tokens in the available context. The usage of PECoRe is showcased on various generative tasks, including machine translation, story generation and retrieval-augmented question answering.

A Primer on the Inner Workings of Transformer-based Language Models

This primer provides a concise technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture.

Post-hoc Interpretability for Generative Language Models: Explaining Context Usage in Transformers

This talk discusses the challenges of interpreting generative language models and presents Inseq, a toolkit for interpreting sequence generation models. The usage of Inseq is illustrated with examples introducing state-of-the-art approaches for interpreting language models such as contrastive attribution. Finally, the PECoRe framework is presented as a mean to evaluate the plausibility of context usage in language models.

Explaining Language Models with Inseq

In recent years, Transformer-based language models have achieved remarkable progress in most language generation and understanding tasks. However, the internal computations of these models are hardly interpretable due to their highly nonlinear structure, hindering their usage for mission-critical applications requiring trustworthiness and transparency guarantees. This presentation will introduce interpretability methods used for tracing the predictions of language models back to their inputs and discuss how these can be used to gain insights into model biases and behaviors. Several concrete examples of language model attributions will be presented throughout the presentation using the Inseq interpretability library.

Post-hoc Interpretability for Language Models

This talk discusses the challenges of interpreting generative language models and presents Inseq, a toolkit for interpreting sequence generation models. The usage of Inseq is illustrated with examples introducing state-of-the-art approaches for interpreting language models such as contrastive attribution. Finally, the PECoRe framework is presented as a mean to evaluate the plausibility of context usage in language models.

Post-hoc Interpretability for NLG & Inseq: an Interpretability Toolkit for Sequence Generation Models

In recent years, Transformer-based language models have achieved remarkable progress in most language generation and understanding tasks. However, the internal computations of these models are hardly interpretable due to their highly nonlinear structure, hindering their usage for mission-critical applications requiring trustworthiness and transparency guarantees. This presentation will introduce interpretability methods used for tracing the predictions of language models back to their inputs and discuss how these can be used to gain insights into model biases and behaviors. Several concrete examples of language model attributions will be presented throughout the presentation using the Inseq interpretability library.

Post-hoc Interpretability for Neural Language Models

In recent years, Transformer-based language models have achieved remarkable progress in most language generation and understanding tasks. However, the internal computations of these models are hardly interpretable due to their highly nonlinear structure, hindering their usage for mission-critical applications requiring trustworthiness and transparency guarantees. This presentation will introduce interpretability methods used for tracing the predictions of language models back to their inputs and discuss how these can be used to gain insights into model biases and behaviors. Several concrete examples of language model attributions will be presented throughout the presentation using the Inseq interpretability library.

Explaining Neural Language Models from Internal Representations to Model Predictions

As language models become increasingly complex and sophisticated, the processes leading to their predictions are growing increasingly difficult to understand. Research in NLP interpretability focuses on explaining the rationales driving model predictions and is crucial for building trust and transparency in the usage of these systems in real-world scenarios. In this laboratory, we will explore various techniques for analyzing Neural Language Models, such as feature attribution methods and diagnostic classifiers. Besides common approaches to inspect models’ internal representations, we will also introduce prompting techniques to elicit model responses and motivate their usage as alternative methods for the behavioral study of model generations.