Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach
Authors
Nils Palumbo, Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina S. Păsăreanu, Somesh Jha
Abstract
Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We use these axioms to guide the mechanistic interpretability analysis of a Transformer-based model trained to solve the well-known 2-SAT problem. We are able to reverse engineer the algorithm learned by the model -- the model first parses the input formulas and then evaluates their satisfiability via enumeration of different possible valuations of the Boolean input variables. We also present evidence to support that the mechanistic interpretation of the analyzed model indeed satisfies the stated axioms.
This paper presents an important theoretical framework for mechanistic interpretability of neural networks, introducing formal axioms to characterize what constitutes a valid mechanistic interpretation. The authors demonstrate their approach through a detailed case study of a Transformer model trained to solve 2-SAT problems.
Key Contributions:
A formal axiomatic framework for evaluating mechanistic interpretations, including:
Prefix Equivalence (Axiom 1)
Component Equivalence (Axiom 2)
Prefix Replaceability (Axiom 3)
Component Replaceability (Axiom 4)
Plus two additional informal axioms related to compilability.
A comprehensive case study applying these axioms to interpret a Transformer solving 2-SAT, revealing:
The first layer acts as a parser, converting input formulas into clause-level representations
The second layer implements an exhaustive evaluation strategy to determine satisfiability
The analysis is particularly noteworthy for its rigorous methodology. The authors use multiple complementary techniques:
For the first layer:
Attention pattern analysis (
Figure 1 and
Figure 2 show clear periodic patterns)
Analysis of neuron sparsity (Figure 4 shows only 34 relevant neurons)
Detailed study of neuron activation patterns (
Figure 5 reveals assignment-specific behaviors)
Decision tree analysis (Figure 12, Figure 13, Figure 14, Figure 15 show increasingly complex interpretations)
The interpretations are validated using the proposed axioms, with detailed results showing strong performance (low ε values) for most components. The complete neuron interpretations are provided in Table 1 (decision tree based) and Table 2 (disjunction-only), showing how different neurons recognize specific satisfying assignments.
Figure 3 would make an excellent thumbnail as it clearly shows the structured attention patterns that form the basis for understanding the model’s parsing behavior, while being visually striking and representative of the paper’s technical depth.
The paper’s methodology is particularly valuable as it provides a template for future mechanistic interpretability work, combining theoretical rigor (through the axioms) with practical analysis techniques. The authors acknowledge limitations, particularly regarding the generalizability of their approach to other tasks and the current reliance on human effort for analysis.
This work represents a significant step forward in making mechanistic interpretability more rigorous and systematic, while demonstrating these principles on a concrete, non-trivial example.