In this presentation, I will provide an overview of the interpretability research landscape and describe various promising methods for exploring and controlling the inner mechanisms of generative language models. I will focus specifically on post-hoc attribution technique and their usage to identify relevant input and model components, showcasing their usage with our Inseq open-source toolkit. A practical application of attribution techniques will be presented with the PECoRe data-driven framework for context usage attribution and its adaptation to produce internals-based citations for model answers in retrieval-augmented generation settings (MIRAGE).
This presentation focuses on applying post-hoc interpretability techniques to analyze how language models (LMs) use input information throughout the generation process. We briefly introduce Inseq, our open-source toolkit designed to simplify advanced feature attribution analyses for LMs. Then, our Plausibility Evaluation of Context Reliance (PECoRe) interpretability framework is introduced to conduct data-driven analyses of context usage in LMs. In conclusion, we showcase how PECoRe can easily be adapted to retrieval-augmented generation (RAG) settings to produce internals-based citations for model answers. Our proposed Model Internals for RAG Explanations (MIRAGE) method achieves citation quality comparable to supervised answer validators with no additional training, producing citations that are faithful to actual context usage during generation.
Questo intervento sarà mirato a demistificare il funzionamento dei modelli del linguaggio (Large Language Models), ed evidenziare come lo studio di questi sistemi come 'artefatti cognitivi' possa contribuire a una migliore comprensione dei meccanismi di ragionamento (umani e non), e dei bias nella società che ci circonda.
In this presentation, I will provide an overview of the interpretability research landscape and describe various promising methods for exploring and controlling the inner mechanisms of generative language models. I will focus specifically on post-hoc attribution technique and their usage to identify relevant input and model components, showcasing their usage with our Inseq open-source toolkit. A practical application of attribution techniques will be presented with the PECoRe data-driven framework for context usage attribution and its adaptation to produce internals-based citations for model answers in retrieval-augmented generation settings (MIRAGE).
This presentation focuses on applying post-hoc interpretability techniques to analyze how language models (LMs) use input information throughout the generation process. We briefly introduce Inseq, our open-source toolkit designed to simplify advanced feature attribution analyses for LMs. Then, our Plausibility Evaluation of Context Reliance (PECoRe) interpretability framework is introduced to conduct data-driven analyses of context usage in LMs. In conclusion, we showcase how PECoRe can easily be adapted to retrieval-augmented generation (RAG) settings to produce internals-based citations for model answers. Our proposed Model Internals for RAG Explanations (MIRAGE) method achieves citation quality comparable to supervised answer validators with no additional training, producing citations that are faithful to actual context usage during generation.