Anthropic: Circuit Tracing – Revealing Computational Graphs in Language Models
The paper introduces "circuit tracing," a method for uncovering how language models process information by mapping their computational steps via attribution graphs. This approach uses replacement models and Cross-Layer Transcoders to connect low-level features with high-level behaviors, demonstrated in tasks like acronym generation and addition, while also noting limitations such as fixed attention patterns and reconstruction errors.
Anthropic: Circuit Tracing – Revealing Computational Graphs in Language Models
Summary of Read" class="text-blue-600 hover:text-blue-800" target="_blank" rel="noopener noreferrer">https://transformer-circuits.pub/2025/attribution-graphs/methods.html'>Read Full Report
Introduces a novel methodology called "circuit tracing" to understand the inner workings of language models. The authors developed a technique using "replacement models" with interpretable components to map the computational steps of a language model as "attribution graphs." These graphs visually represent how different computational units, or "features," interact to process information and generate output for specific prompts.
The research details the construction, visualization, and validation of these graphs using an 18-layer model and offers a preview of their application to a more advanced model, Claude 3.5 Haiku. The study explores the interpretability and sufficiency of this method through various evaluations, including case studies on acronym generation and addition.
While acknowledging limitations like missing attention circuits and reconstruction errors, the authors propose circuit tracing as a significant step towards achieving mechanistic interpretability in large language models.
-
This paper introduces a methodology for revealing computational graphs in language models using Cross-Layer Transcoders (CLTs) to extract interpretable features and construct attribution graphs that depict how these features interact to produce model outputs for specific prompts. This approach aims to bridge the gap between raw neurons and high-level model behaviors by identifying meaningful building blocks and their interactions.
-
The methodology involves several key steps: training CLTs to reconstruct MLP outputs, building attribution graphs with nodes representing active features, tokens, errors, and logits, and edges representing linear effects between these nodes. A crucial aspect is achieving linearity in feature interactions by freezing attention patterns and normalization denominators. Attribution graphs allow for the study of how information flows from the input prompt through intermediate features to the final output token.
-
The paper demonstrates the application of this methodology through several case studies, including acronym generation, factual recall, and small number addition. These case studies illustrate how attribution graphs can reveal the specific features and pathways involved in different cognitive tasks performed by language models. For instance, in the addition case study, the method uncovers a hierarchy of heuristic features that collaboratively solve the task.
-
Despite the advancements, the methodology has several significant limitations. A key limitation is the missing explanation of how attention patterns are formed and how they mediate feature interactions (QK-circuits), as the analysis is conducted with fixed attention patterns. Other limitations include reconstruction errors (unexplained model computation), the role of inactive features and inhibitory circuits, the complexity of the resulting graphs, and the difficulty of understanding global circuits that generalize across many prompts.
-
The paper also explores the concept of global weights between features, which are prompt-independent and aim to capture general algorithms used by the replacement model. However, interpreting these global weights is challenging due to issues like interference (spurious connections) and the lack of accounting for attention-mediated interactions. While attribution graphs provide insights on specific prompts, future work aims to enhance the understanding of global mechanisms and address current limitations, potentially through advancements in dictionary learning and handling of attention mechanisms.
Related Articles
Amazon's AI Agent Outage Is a Warning: Why Organizations Need Governed AI Infrastructure
Amazon's AI coding agent Kiro caused a 13-hour AWS outage by deleting and recreating a production environment. The incident reveals why organizations deploying AI agents need architectural governance — not just more human approvals.
An AI Agent Hacked McKinsey in 2 Hours — What It Means for Enterprise AI Security
An autonomous AI agent breached McKinsey's internal AI platform in under 2 hours — exposing 46.5 million chat messages and 57,000 employee accounts. Here's what every organization deploying AI needs to learn from it.
Amazon Now Requires Senior Sign-Off for AI-Generated Code — Here's Why Every Organization Should Take Note
Amazon's new policy requiring senior engineers to approve all AI-assisted code changes signals a turning point: organizations deploying AI agents need governance infrastructure, not just AI capabilities. Here's what it means for the future of agentic systems.
The Pentagon Blacklisted an AI Company. Here's What It Teaches Every Organization About AI Infrastructure.
When the Pentagon designated Anthropic a 'supply chain risk,' defense contractors scrambled to abandon Claude overnight. The lesson for every organization: if you don't own your AI stack, someone else controls your future.
See the ibl.ai AI Operating System in Action
Discover how leading universities and organizations are transforming education with the ibl.ai AI Operating System. Explore real-world implementations from Harvard, MIT, Stanford, and users from 400+ institutions worldwide.
View Case StudiesGet Started with ibl.ai
Choose the plan that fits your needs and start transforming your educational experience today.