Anthropic: Circuit Tracing – Revealing Computational Graphs in Language Models
The paper introduces "circuit tracing," a method for uncovering how language models process information by mapping their computational steps via attribution graphs. This approach uses replacement models and Cross-Layer Transcoders to connect low-level features with high-level behaviors, demonstrated in tasks like acronym generation and addition, while also noting limitations such as fixed attention patterns and reconstruction errors.
Anthropic: Circuit Tracing – Revealing Computational Graphs in Language Models
Summary of https://transformer-circuits.pub/2025/attribution-graphs/methods.html
Introduces a novel methodology called "circuit tracing" to understand the inner workings of language models. The authors developed a technique using "replacement models" with interpretable components to map the computational steps of a language model as "attribution graphs." These graphs visually represent how different computational units, or "features," interact to process information and generate output for specific prompts.
The research details the construction, visualization, and validation of these graphs using an 18-layer model and offers a preview of their application to a more advanced model, Claude 3.5 Haiku. The study explores the interpretability and sufficiency of this method through various evaluations, including case studies on acronym generation and addition.
While acknowledging limitations like missing attention circuits and reconstruction errors, the authors propose circuit tracing as a significant step towards achieving mechanistic interpretability in large language models.
-
This paper introduces a methodology for revealing computational graphs in language models using Cross-Layer Transcoders (CLTs) to extract interpretable features and construct attribution graphs that depict how these features interact to produce model outputs for specific prompts. This approach aims to bridge the gap between raw neurons and high-level model behaviors by identifying meaningful building blocks and their interactions.
-
The methodology involves several key steps: training CLTs to reconstruct MLP outputs, building attribution graphs with nodes representing active features, tokens, errors, and logits, and edges representing linear effects between these nodes. A crucial aspect is achieving linearity in feature interactions by freezing attention patterns and normalization denominators. Attribution graphs allow for the study of how information flows from the input prompt through intermediate features to the final output token.
-
The paper demonstrates the application of this methodology through several case studies, including acronym generation, factual recall, and small number addition. These case studies illustrate how attribution graphs can reveal the specific features and pathways involved in different cognitive tasks performed by language models. For instance, in the addition case study, the method uncovers a hierarchy of heuristic features that collaboratively solve the task.
-
Despite the advancements, the methodology has several significant limitations. A key limitation is the missing explanation of how attention patterns are formed and how they mediate feature interactions (QK-circuits), as the analysis is conducted with fixed attention patterns. Other limitations include reconstruction errors (unexplained model computation), the role of inactive features and inhibitory circuits, the complexity of the resulting graphs, and the difficulty of understanding global circuits that generalize across many prompts.
-
The paper also explores the concept of global weights between features, which are prompt-independent and aim to capture general algorithms used by the replacement model. However, interpreting these global weights is challenging due to issues like interference (spurious connections) and the lack of accounting for attention-mediated interactions. While attribution graphs provide insights on specific prompts, future work aims to enhance the understanding of global mechanisms and address current limitations, potentially through advancements in dictionary learning and handling of attention mechanisms.
Related Articles
Students as Agent Builders: How Role-Based Access (RBAC) Makes It Possible
How ibl.ai’s role-based access control (RBAC) enables students to safely design and build real AI agents—mirroring industry-grade systems—while institutions retain full governance, security, and faculty oversight.
AI Equity as Infrastructure: Why Equitable Access to Institutional AI Must Be Treated as a Campus Utility — Not a Privilege
Why AI must be treated as shared campus infrastructure—closing the equity gap between students who can afford premium tools and those who can’t, and showing how ibl.ai enables affordable, governed AI access for all.
Pilot Fatigue and the Cost of Hesitation: Why Campuses Are Stuck in Endless Proof-of-Concept Cycles
Why higher education’s cautious pilot culture has become a roadblock to innovation—and how usage-based, scalable AI frameworks like ibl.ai’s help institutions escape “demo purgatory” and move confidently to production.
AI Literacy as Institutional Resilience: Equipping Faculty, Staff, and Administrators with Practical AI Fluency
How universities can turn AI literacy into institutional resilience—equipping every stakeholder with practical fluency, transparency, and confidence through explainable, campus-owned AI systems.