11 Conclusion
Building a language to communicate with AI isn’t going to be easy, but quite frankly, it’s the only way to gain control of the way we want to live. Languages shape the way we think. We have an opportunity to shape our own thinking and future machines.
– Been Kim, Beyond Interpretability ICLR Keynote (2022)
Language models have evolved from narrow, task-specific tools to general-purpose architectures that convert knowledge into actionable insights across hundreds of languages. Interpretability research has shed light on how these systems process language, pioneering novel analysis methods to investigate their predictive behaviors and inner mechanisms. Today’s challenge is to translate these insights into practical tools and techniques that help debug models, control their behaviors, and ultimately improve their trustworthiness and usability in the eyes of users. This dissertation has sought to tackle this challenge, developing frameworks that serve the users of language models and machine translation systems at various levels: from everyday users who need factual answers from chatbots, to developers customizing model outputs, to professional editors refining machine translations.
In this final chapter, we begin by revisiting the research questions posed in Chapter 1 and answering them in relation to our findings. We then conclude by charting a path forward, discussing how actionable interpretability research can shape the next generation of transparent, controllable AI systems.
11.1 Research Questions Revisited
The development and deployment of the Inseq toolkit (Chapter 3) and its subsequent integration with the PECoRe framework (Chapter 4) have provided important insights into this question. From a conceptual standpoint, the main principle to facilitate their widespread adoption is a progressive disclosure of complexity, which is necessary to benefit users at all levels of expertise. This human-computer interaction concept proved essential for bridging the gap between two distinct user groups: interpretability researchers with deep technical expertise, and domain experts who understand the practical implications but may lack programming skills. We achieved this balance through three key strategies: first, we unified access to popular models and methods through interfaces compatible with mainstream frameworks. Second, we provided both cutting-edge techniques and extensible baselines with sensible defaults. Third, we created compelling visualizations and post-processing functions that surface key insights without overwhelming users.
On the technical front, supporting model quantization, efficient batching, and distributed inference proved to be challenging yet essential. As language models become increasingly computationally demanding, these optimizations ensure that our tools remain accessible across diverse domains and computational budgets. Our Inseq toolkit successfully innovates across these dimensions, providing simple interfaces for common use cases while maintaining access to advanced features. Its widespread adoption across machine translation, summarization, question answering, and conversational AI validates these design choices and demonstrates their broad applicability.
Our PECoRe framework (Chapter 4) demonstrated that we can faithfully quantify context usage in language models and machine translation systems through a two-step process: first, by identifying context-sensitive tokens using contrastive information-theoretic metrics, and then attributing their generation to specific contextual cues through contrastive input attribution. This data-driven verification process replaces traditional heuristic-based analyses, enabling model debugging at scale.
Our findings exposed critical weaknesses in context-aware MT systems. In particular, we traced gender agreement failures back to incorrect anaphora resolution and found formatting anomalies triggered by spurious examples in context. When we extended this analysis to retrieval-augmented generation with Mirage (Chapter 5), we found that attribution based on model internals could accurately cite relevant retrieved passages. Our proposed procedure avoids the pitfalls of post-hoc rationalizations using surface-level similarity between generated and retrieved contents, grounding instead the citation process in actual context usage for improved trustworthiness.
The comparative analysis of Chapter 7 established interpretability-based steering as a viable alternative to prompting for controllable machine translation. Our contrastive SAE steering framework matched prompting’s personalization accuracy—which already outperformed traditional fine-tuned MT systems in Chapter 6 —while offering distinct advantages in terms of efficiency and transparency. Remarkably, our framework successfully captured individual translators’ stylistic signatures using only learned sparse latent representations, succeeding even in the challenging domain of literary translation.
Moreover, our probing analyses revealed that steering and prompting converge on similar mechanistic solutions, resulting in comparable underlying representations. However, steering methods offer crucial advantages: while in-context demonstrations can fail unpredictably based on prompting choices such as example ordering, steering provides direct control through an interpretable concept space with tunable steering intensity.
The DivEMT study in Chapter 8 provided a nuanced answer to this question. While access to MT generally improved translator productivity, its contribution varied dramatically by language pair. In our results, typological similarity emerged as a significant factor: languages closely related to the source language, English, such as Dutch and Italian, exhibited substantial post-editing productivity gains, whereas distant pairs, like English-Arabic and English-Vietnamese, showed minimal improvement even after controlling for the resourcedness of the training data.
Notably, we found traditional MT quality metrics to correlate poorly with actual productivity benefits across languages. This disconnect challenges the fundamental assumptions that the outcome of better-scoring systems should require less editing, underscoring the need for user-centered assessment that goes beyond technical quality measures.
Our QE4PE study revealed a multifaceted impact of error highlights on the workflow of professional translators. We identified potential error cues that influence both translators’ productivity and editing behavior in different ways, with effects that depend heavily on textual domains and translation direction. Interestingly, Italian translators responded to highlights by editing more broadly across entire texts, whereas Dutch translators focused their edits primarily on highlighted spans. These results suggest different approaches to the post-editing task, hinting at the influence of cultural factors at play.
In our error assessment, highlights led to a 15-20% reduction in critical errors compared to standard post-editing, as translators caught mistakes they might otherwise miss. However, overall quality metrics showed no improvement, indicating that coarse-grained quality metrics employed in MT evaluation might fail to capture these targeted benefits. Perhaps most surprisingly, we found no meaningful differences in terms of speed or quality between editors working with human-made highlights, supervised models, or unsupervised uncertainty metrics. This suggests that the technical accuracy of quality estimation, which is typically the focus of evaluation campaigns, matters less than understanding how to integrate these tools into translators’ workflows effectively.
Our systematic evaluation in Chapter 10 demonstrated that unsupervised methods employing model internals can match supervised approaches in detecting translation errors across multiple models, datasets, and languages. The variance of token log-probabilities estimated with Monte Carlo Dropout (MCD) proved particularly robust for predicting error spans, outperforming methods based on vocabulary projections, attention weights, and other internal signals.
We found that the limitations of supervised metrics stem from their low recall, with predictions often missing the actual error distribution in test sets. Proper calibration of these metrics’ confidence dramatically improved their performance, bringing them close to inter-annotator agreement levels among professional translators. Crucially, we found that metric rankings can shift substantially when few annotations are present, depending on individual annotators’ subjective judgments. This underscores the necessity of multiple annotation sets and careful calibration for fair quality estimation assessment.
11.2 Outlook and Future Directions
The themes and findings of this dissertation open several promising avenues toward the role of actionable interpretability insights for trustworthy NLP systems.
A core premise of this thesis—that downstream applications serve as invaluable testbeds for interpretability methods—resonates with current debates within the interpretability research community. As Marks (2025) argues, if interpretability methods enable use cases unattainable by other approaches, they provide evidence of genuine, significant insights. Our work validates this perspective: interpretability-based methods excel at answer attribution, controlled generation and error detection, providing more faithful and auditable results than the supervised models typically employed for those tasks. Our PECoRe framework, for example, can debug issues in context usage that would be hard to detect through simple behavioral evaluations.
The final experimental chapters of this thesis take this paradigm a step further, evaluating interpretability techniques not only by their accuracy on realistic tasks but also by their downstream impact on user decision-making, productivity, and satisfaction. While the focus of the NLP interpretability community in recent years has gravitated towards the low-level technical depths of mechanistic interpretability (Saphra and Wiegreffe, 2024), the emerging field of human-centered explainable AI (HCXAI)—which has for now mainly engaged the human-computer interaction community1—is taking the lead in developing sociotechnical frameworks for model explanations centered around users’ needs and experiences. The intersection between these areas remains frustratingly small: few mechanistic studies conduct downstream human evaluations, and most human-centered work fails to integrate the best state-of-the-art interpretability methods due to a lack of experience or resources. Work aimed at bridging this gap will remain essential to ensure that interpretability advances remain both technically sound and practically relevant.
Despite its success, modern interpretability research faces a serious threat: the growing inaccessibility of frontier systems, which play a key role as prime “subjects” of interpretability studies. A recent survey of 184 recent interpretability works reveals a widening disparity between the capabilities of state-of-the-art systems and those of systems generally evaluated in interpretability studies (Fiotto-Kaufman et al., 2025). This gap, driven by engineering barriers and proprietary API restrictions, threatens the validity of insights derived from simpler, less capable models. Addressing this issue will require robust shared infrastructure for interpretability research, simplifying access to state-of-the-art systems and fostering a more inclusive research environment. Our proposed Inseq library was developed with this in mind, supporting methods such as quantized, batched and distributed inference to reduce the computational load of interpretability analyses. More recently, the NNSight library (Fiotto-Kaufman et al., 2025) represents the most significant step in this direction, providing researchers with fine-grained access to model internals through remote execution, abstracting away the complexity and costs associated with local hardware setups. Beyond tools, the computational cost of current interpretability methods is a barrier to their widespread adoption, particularly in production environments where faster predictions might be favored over more precise or trustworthy results. Future technical research should prioritize the development of more efficient techniques, exploring approximation methods, caching strategies, or ad-hoc kernels, while preserving faithfulness to the model’s inner workings. The CAT method from Chapter 3, which approximates patching with contrastive gradient attribution, exemplifies the many possibilities in this direction.
Perhaps most importantly, interpretability research can pave the way for more effective human-AI collaboration. Our translator studies show that model insights have the potential to transform professional workflows, but also that presentation matters as much, if not more, than accuracy. Despite that, the presentation of interpretability insights is often overlooked by current work. The translation domain presents unique challenges in this area, with human professionals operating in similar settings but across entirely different languages and cultural contexts, requiring tailored approaches. User-centric interfaces that let domain experts explore model behaviors quickly and intuitively will be essential for addressing these challenges.
As language model adoption accelerates, the demand for transparency and usability tools will only intensify, and interpretability researchers are in a crucial position to address these requests. The methods, insights, and perspectives presented in this dissertation demonstrate the potential of interpretability in machine translation, while highlighting the critical importance of continued research at the intersection of interpretability, multilingual NLP, and human-computer interaction. By making these systems more transparent, controllable, and aligned with human needs, we move toward a future where language technologies do not operate as opaque oracles, but rather as trusted partners helping us tackle the complex challenges ahead.
The main workshop in this area is organized by the ACM SIGCHI interest group.↩︎