Michael T. Pearce, Thomas Dooms, Alice Rigg, José Oramas, Lee Sharkey
Abstract
A mechanistic understanding of how MLPs do computation in deep neural networks remains elusive. Current interpretability work can extract features from hidden activations over an input dataset but generally cannot explain how MLP weights construct features. One challenge is that element-wise nonlinearities introduce higher-order interactions and make it difficult to trace computations through the MLP layer. In this paper, we analyze bilinear MLPs, a type of Gated Linear Unit (GLU) without any element-wise nonlinearity that nevertheless achieves competitive performance. Bilinear MLPs can be fully expressed in terms of linear operations using a third-order tensor, allowing flexible analysis of the weights. Analyzing the spectra of bilinear MLP weights using eigendecomposition reveals interpretable low-rank structure across toy tasks, image classification, and language modeling. We use this understanding to craft adversarial examples, uncover overfitting, and identify small language model circuits directly from the weights alone. Our results demonstrate that bilinear layers serve as an interpretable drop-in replacement for current activation functions and that weight-based interpretability is viable for understanding deep-learning models.
This paper introduces an important advancement in neural network interpretability by analyzing bilinear MLPs, which are a variant of Gated Linear Units (GLUs) without element-wise nonlinearity. The authors demonstrate that these bilinear MLPs can achieve competitive performance while being significantly more interpretable than traditional MLPs. Key contributions:
Theoretical Framework The authors show that bilinear MLPs can be fully expressed using linear operations with a third-order tensor, making them amenable to mathematical analysis.
Figure 1 illustrates how bilinear layers can be represented either through elementwise products or using the bilinear tensor.
Eigendecomposition Analysis A major contribution is the introduction of eigendecomposition techniques to analyze bilinear MLP weights.
demonstrates how eigenvector activations work and shows examples of interpretable patterns learned for image classification tasks. The eigenvectors often correspond to meaningful features like edge detectors or class-specific patterns.
Image Classification Insights The authors apply their methods to MNIST and Fashion-MNIST classification tasks, revealing that: - Models learn interpretable low-rank structure - Regularization improves feature interpretability - Top eigenvectors capture meaningful patterns
Figure 3 shows how different eigenvalues contribute to digit classification.
Language Model Analysis The researchers analyze a 6-layer transformer with bilinear MLPs trained on TinyStories.
Figure 8 demonstrates a discovered sentiment negation circuit, showing how the model learns to flip sentiment based on negation words.
Figure 9 shows that many output features can be well-approximated by low-rank matrices. Key findings:
Bilinear MLPs offer similar performance to traditional MLPs while being more interpretable
The eigendecomposition reveals interpretable low-rank structure across different tasks
The method enables direct analysis of model weights without requiring input data
The approach can identify specific computational circuits in language models The paper demonstrates that bilinear MLPs could serve as a drop-in replacement for traditional MLPs in many applications, offering improved interpretability without significant performance trade-offs. The authors also provide practical guidance for implementing and analyzing bilinear layers. The work opens new possibilities for mechanistic interpretability by showing how model weights can be directly analyzed without relying on activation patterns from specific inputs. This could lead to more robust and generalizable interpretability methods for deep learning models.