--- title: "Anthropic: Circuit Tracing – Revealing Computational Graphs in Language Models" slug: "anthropic-circuit-tracing-revealing-computational-graphs-in-language-models" author: "Jeremy Weaver" date: "2025-04-03 17:19:23" category: "Premium" topics: "Circuit Tracing Methodology, Attribution Graphs in Language Models, Cross-Layer Transcoders (CLTs) for Feature Extraction, Case Studies on Model Reasoning, Limitations and Future Directions in Mechanistic Interpretability" summary: "The paper introduces \"circuit tracing,\" a method for uncovering how language models process information by mapping their computational steps via attribution graphs. This approach uses replacement models and Cross-Layer Transcoders to connect low-level features with high-level behaviors, demonstrated in tasks like acronym generation and addition, while also noting limitations such as fixed attention patterns and reconstruction errors." banner: "" thumbnail: "" --- Anthropic: Circuit Tracing – Revealing Computational Graphs in Language Models

Summary of Read Full Report

Introduces a novel methodology called "circuit tracing" to understand the inner workings of language models. The authors developed a technique using "replacement models" with interpretable components to map the computational steps of a language model as "attribution graphs." These graphs visually represent how different computational units, or "features," interact to process information and generate output for specific prompts.

The research details the construction, visualization, and validation of these graphs using an 18-layer model and offers a preview of their application to a more advanced model, Claude 3.5 Haiku. The study explores the interpretability and sufficiency of this method through various evaluations, including case studies on acronym generation and addition.

While acknowledging limitations like missing attention circuits and reconstruction errors, the authors propose circuit tracing as a significant step towards achieving mechanistic interpretability in large language models.

This paper introduces a methodology for revealing computational graphs in language models using Cross-Layer Transcoders (CLTs) to extract interpretable features and construct attribution graphs that depict how these features interact to produce model outputs for specific prompts. This approach aims to bridge the gap between raw neurons and high-level model behaviors by identifying meaningful building blocks and their interactions.
The methodology involves several key steps: training CLTs to reconstruct MLP outputs, building attribution graphs with nodes representing active features, tokens, errors, and logits, and edges representing linear effects between these nodes. A crucial aspect is achieving linearity in feature interactions by freezing attention patterns and normalization denominators. Attribution graphs allow for the study of how information flows from the input prompt through intermediate features to the final output token.
The paper demonstrates the application of this methodology through several case studies, including acronym generation, factual recall, and small number addition. These case studies illustrate how attribution graphs can reveal the specific features and pathways involved in different cognitive tasks performed by language models. For instance, in the addition case study, the method uncovers a hierarchy of heuristic features that collaboratively solve the task.
Despite the advancements, the methodology has several significant limitations. A key limitation is the missing explanation of how attention patterns are formed and how they mediate feature interactions (QK-circuits), as the analysis is conducted with fixed attention patterns. Other limitations include reconstruction errors (unexplained model computation), the role of inactive features and inhibitory circuits, the complexity of the resulting graphs, and the difficulty of understanding global circuits that generalize across many prompts.
The paper also explores the concept of global weights between features, which are prompt-independent and aim to capture general algorithms used by the replacement model. However, interpreting these global weights is challenging due to issues like interference (spurious connections) and the lack of accounting for attention-mediated interactions. While attribution graphs provide insights on specific prompts, future work aims to enhance the understanding of global mechanisms and address current limitations, potentially through advancements in dictionary learning and handling of attention mechanisms.