Sparse dictionary learning has been a rapidly growing technique in mechanistic interpretability to attack superposition and extract more human-understandable features from model activations. We ask a further question based on the extracted more monosemantic features: How do we recognize circuits connecting the enormous amount of dictionary features? We propose a circuit discovery framework alternative to activation patching. Our framework suffers less from out-of-distribution and proves to be more efficient in terms of asymptotic complexity. The basic unit in our framework is dictionary features decomposed from all modules writing to the residual stream, including embedding, attention output and MLP output. Starting from any logit, dictionary feature or attention score, we manage to trace down to lower-level dictionary features of all tokens and compute their contribution to these more interpretable and local model behaviors. We dig in a small transformer trained on a synthetic task named Othello and find a number of human-understandable fine-grained circuits inside of it.
This paper presents a novel approach to discovering interpretable circuits in transformer models using dictionary learning, focusing on a case study of Othello-GPT. The work advances mechanistic interpretability by providing a patch-free alternative to traditional circuit discovery methods. Key Contributions:
A new framework for circuit discovery that avoids activation patching and offers better computational efficiency
A comprehensive analysis of where to apply dictionary learning in transformer architectures
Detailed case studies showing how the method reveals interpretable circuits in Othello-GPT The authors propose decomposing three key components: - Word embeddings - Attention layer outputs - MLP layer outputs
Figure 2 The paper demonstrates how this approach can trace information flow through the model’s components. The framework allows for both end-to-end and local circuit analysis, starting from any logit or dictionary feature. A significant case study examines Othello-GPT, a transformer trained to predict legal moves in the game Othello. The authors discover several types of interpretable features: - Current move position features - Board state representations - Empty cell detectors - Legal move indicators
Figure 6 The paper provides detailed examples of circuit discovery, including:
A local OV circuit computing board state
QK circuits implementing attention patterns
Early MLP circuits identifying flipped pieces
[See Figure 8 in the original paper]
Compared to traditional patch-based methods, this approach offers several advantages: - Linear computational complexity vs quadratic for patching - No out-of-distribution issues - Direct interpretation of feature interactions The authors acknowledge some limitations in their dictionary training:
[See Figure 10 in the original paper]
An interesting observation about Othello-GPT vs language models shows different density patterns in feature activation:
Figure 13 The paper makes a compelling case for dictionary learning as a powerful tool for mechanistic interpretability, while honestly addressing current limitations and areas for future work. The approach shows promise for scaling to larger language models, though the authors note that Othello-GPT’s simpler domain makes it an ideal test case for developing these methods. The methodology presented could significantly impact how we understand neural networks, offering a more systematic and computationally efficient approach to circuit discovery. The authors’ careful attention to both theoretical foundations and practical implementation details makes this work particularly valuable for future research in mechanistic interpretability.