Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models, yet scalable training remains a significant challenge. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. Modifications to a state-of-the-art SAE variant, Top-K SAEs, are evaluated across multiple dimensions. In particular, we assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models. Additionally, we analyze the geometry of learned SAE latents, confirming that \emph{feature splitting} enables the discovery of new features. The Llama Scope SAE checkpoints are publicly available at~\url{https://huggingface.co/fnlp/Llama-Scope}, alongside our scalable training, interpretation, and visualization tools at \url{https://github.com/OpenMOSS/Language-Model-SAEs}. These contributions aim to advance the open-source Sparse Autoencoder ecosystem and support mechanistic interpretability research by reducing the need for redundant SAE training.
This paper introduces Llama Scope, a comprehensive suite of 256 Sparse Autoencoders (SAEs) trained on the Llama-3.1-8B language model. The work represents a significant advancement in mechanistic interpretability research by providing ready-to-use, open-source SAE models that can help researchers understand the internal representations of large language models. Key aspects of the work include:
Architecture and Training Position
Figure 1 The authors train SAEs at four different positions within each transformer block: - Post-MLP Residual Stream (R) - Attention Output (A) - MLP Output (M) - Transcoder (TC)
Technical Improvements The authors introduce several modifications to the TopK SAE architecture: - Incorporation of decoder column 2-norm into TopK computation - Post-processing to JumpReLU variants - K-annealing training schedule - Mixed parallelism approach for efficient training
Evaluation Results
Figure 2 The results show that: - TopK SAEs consistently outperform vanilla SAEs in sparsity-fidelity trade-offs - Wider SAEs (32x) achieve better reconstruction while maintaining similar sparsity - The approach generalizes well to longer contexts and instruction-tuned models
Feature Analysis
Figure 7 The authors conduct an in-depth analysis of feature geometry, demonstrating how different SAEs learn related but distinct features. They show that wider SAEs don’t just learn combinations of existing features but discover entirely new ones. The visualization of the “Threats-to-Humanity” cluster provides a compelling example of how features organize themselves semantically.
Comprehensive Evaluation
Figure 10 The paper includes extensive evaluation across all 256 SAEs, measuring L0 sparsity, explained variance, and Delta LM loss across different layers and positions. The work makes several significant contributions to the field: - Provides the first comprehensive suite of SAEs for a large language model - Introduces practical improvements to SAE training and architecture - Demonstrates scalability and generalization capabilities - Offers insights into feature organization and discovery The authors make all models and tools publicly available, which should significantly accelerate research in mechanistic interpretability by reducing the need for redundant SAE training. The paper’s technical depth combined with practical utility makes it a valuable contribution to both the theoretical understanding of language models and the practical tools available to researchers in the field.