Summary

This paper introduces Llama Scope, a comprehensive suite of 256 Sparse Autoencoders (SAEs) trained on the Llama-3.1-8B language model. The work represents a significant advancement in mechanistic interpretability research by providing ready-to-use, open-source SAE models that can help researchers understand the internal representations of large language models. Key aspects of the work include:

  1. Architecture and Training Position
Figure 1: Four potential training positions in one Transformer Block.
Figure 1: Four potential training positions in one Transformer Block.

Figure 1 The authors train SAEs at four different positions within each transformer block: - Post-MLP Residual Stream (R) - Attention Output (A) - MLP Output (M) - Transcoder (TC)

  1. Technical Improvements The authors introduce several modifications to the TopK SAE architecture: - Incorporation of decoder column 2-norm into TopK computation - Post-processing to JumpReLU variants - K-annealing training schedule - Mixed parallelism approach for efficient training

  2. Evaluation Results

Figure 2: Explained Variance (upper) and Delta LM loss (lower) over L0 sparsity for SAEs trained on L7R, L15R and L23R.
Figure 2: Explained Variance (upper) and Delta LM loss (lower) over L0 sparsity for SAEs trained on L7R, L15R and L23R.

Figure 2 The results show that: - TopK SAEs consistently outperform vanilla SAEs in sparsity-fidelity trade-offs - Wider SAEs (32x) achieve better reconstruction while maintaining similar sparsity - The approach generalizes well to longer contexts and instruction-tuned models

  1. Feature Analysis
Figure 7: Feature geometry of L15R-8x-Vanilla, L15R-8x-TopK and L15R-32x-TopK SAEs.
Figure 7: Feature geometry of L15R-8x-Vanilla, L15R-8x-TopK and L15R-32x-TopK SAEs.

Figure 7 The authors conduct an in-depth analysis of feature geometry, demonstrating how different SAEs learn related but distinct features. They show that wider SAEs don’t just learn combinations of existing features but discover entirely new ones. The visualization of the “Threats-to-Humanity” cluster provides a compelling example of how features organize themselves semantically.

  1. Comprehensive Evaluation
Figure 10: All 256 SAEs are evaluated on L0 sparsity (upper), explained variance (middle) and Delta LM loss (lower).
Figure 10: All 256 SAEs are evaluated on L0 sparsity (upper), explained variance (middle) and Delta LM loss (lower).

Figure 10 The paper includes extensive evaluation across all 256 SAEs, measuring L0 sparsity, explained variance, and Delta LM loss across different layers and positions. The work makes several significant contributions to the field: - Provides the first comprehensive suite of SAEs for a large language model - Introduces practical improvements to SAE training and architecture - Demonstrates scalability and generalization capabilities - Offers insights into feature organization and discovery The authors make all models and tools publicly available, which should significantly accelerate research in mechanistic interpretability by reducing the need for redundant SAE training. The paper’s technical depth combined with practical utility makes it a valuable contribution to both the theoretical understanding of language models and the practical tools available to researchers in the field.