Summary

This paper presents an important theoretical framework for mechanistic interpretability of neural networks, introducing formal axioms to characterize what constitutes a valid mechanistic interpretation. The authors demonstrate their approach through a detailed case study of a Transformer model trained to solve 2-SAT problems.

Key Contributions:

  1. A formal axiomatic framework for evaluating mechanistic interpretations, including:
  • Prefix Equivalence (Axiom 1)
  • Component Equivalence (Axiom 2)
  • Prefix Replaceability (Axiom 3)
  • Component Replaceability (Axiom 4) Plus two additional informal axioms related to compilability.
  1. A comprehensive case study applying these axioms to interpret a Transformer solving 2-SAT, revealing:
  • The first layer acts as a parser, converting input formulas into clause-level representations
  • The second layer implements an exhaustive evaluation strategy to determine satisfiability

The analysis is particularly noteworthy for its rigorous methodology. The authors use multiple complementary techniques:

For the first layer:

  • Attention pattern analysis (
    Figure 1: Attention scores for all heads on the first sample of the test set. x-axis represents the source (key) positions and y-axis represents the destination (query) positions for the attention mechanism.
    Figure 1: Attention scores for all heads on the first sample of the test set. x-axis represents the source (key) positions and y-axis represents the destination (query) positions for the attention mechanism.
    Figure 1 and
    Figure 2: Average attention scores, grouped by source token position, for all heads, calculated over the test set. In the first layer, we further average across destination token positions restricted to positions $4i+2$ , where $0\leq i<10$ is the clause index. For the second layer, we only consider the readout token as the destination.
    Figure 2: Average attention scores, grouped by source token position, for all heads, calculated over the test set. In the first layer, we further average across destination token positions restricted to positions $4i+2$ , where $0\leq i<10$ is the clause index. For the second layer, we only consider the readout token as the destination.
    Figure 2 show clear periodic patterns)
  • Distributional attention analysis (
    Figure 3: Expected attention scores by position for the second token of each clause.
    Figure 3: Expected attention scores by position for the second token of each clause.
    Figure 3 demonstrates expected attention scores)

For the second layer:

  • Analysis of neuron sparsity (Figure 4 shows only 34 relevant neurons)
  • Detailed study of neuron activation patterns (
    Figure 5: Output coefficients of hidden neurons computed using the composition of the weight matrix for the output layer of the MLP and the unembedding matrix projected to the SAT logit.
    Figure 5: Output coefficients of hidden neurons computed using the composition of the weight matrix for the output layer of the MLP and the unembedding matrix projected to the SAT logit.
    Figure 5 reveals assignment-specific behaviors)
  • Decision tree analysis (Figure 12, Figure 13, Figure 14, Figure 15 show increasingly complex interpretations)

The interpretations are validated using the proposed axioms, with detailed results showing strong performance (low ε values) for most components. The complete neuron interpretations are provided in Table 1 (decision tree based) and Table 2 (disjunction-only), showing how different neurons recognize specific satisfying assignments.

Figure 3: Expected attention scores by position for the second token of each clause.
Figure 3: Expected attention scores by position for the second token of each clause.
Figure 3 would make an excellent thumbnail as it clearly shows the structured attention patterns that form the basis for understanding the model’s parsing behavior, while being visually striking and representative of the paper’s technical depth.

The paper’s methodology is particularly valuable as it provides a template for future mechanistic interpretability work, combining theoretical rigor (through the axioms) with practical analysis techniques. The authors acknowledge limitations, particularly regarding the generalizability of their approach to other tasks and the current reliance on human effort for analysis.

This work represents a significant step forward in making mechanistic interpretability more rigorous and systematic, while demonstrating these principles on a concrete, non-trivial example.