2024
- AI Safety: A Climb To Armageddon? 11/13
- Interpreting Attention Layer Outputs with Sparse Autoencoders 11/13
- Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach 11/13
- Safety Cases: How to Justify the Safety of Advanced AI Systems 11/13
- Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? 11/13
- The Elephant in the Room - Why AI Safety Demands Diverse Teams 11/13
- The Role of AI Safety Institutes in Contributing to International Standards for Frontier AI Safety 11/13
- Bilinear MLPs enable weight-based mechanistic interpretability 11/11
- Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT 11/11
- Extracting Finite State Machines from Transformers 11/11
- How to use and interpret activation patching 11/11
- Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning 11/11
- Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders 11/11
- Mechanistic Interpretability for AI Safety - A Review 11/11
- Mechanistic Interpretability of Reinforcement Learning Agents 11/11
- Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations 11/11
- Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization 11/11
- Scalable Mechanistic Neural Networks 11/11
- Transcoders Find Interpretable LLM Feature Circuits 11/11
- Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models 11/11
- Using Degeneracy in the Loss Landscape for Mechanistic Interpretability 11/11