Summary

This comprehensive paper examines the relationship between AI safety benchmarks and general model capabilities, introducing the concept of “safetywashing” - where capability improvements are misrepresented as safety advancements. The authors conduct an extensive meta-analysis across multiple safety domains and dozens of models to determine whether common safety benchmarks actually measure properties distinct from general capabilities.

Key Findings:

  1. Methodology: The authors develop a methodology to measure correlations between safety benchmarks and general capabilities. They create a “capabilities score” using PCA on standard benchmarks like MMLU, GSM8K, and others, which explains about 75% of model performance variance.

  2. Safety Domains Analyzed:

  • Alignment: High correlation with capabilities (MT-Bench: 78.7%, LMSYS Arena: 62.1%)
  • Machine Ethics: Mixed results - ETHICS shows high correlation (82.2%) while propensity measures like MACHIAVELLI show low correlation (-49.9%)
  • Bias: Generally low correlation with capabilities (BBQ Ambiguous: -37.3%)
  • Calibration: RMS calibration error shows low correlation (20.1%) while Brier Score shows high correlation (95.5%)
  • Adversarial Robustness: Varies by attack type - traditional benchmarks show high correlation while jailbreak resistance shows low correlation
  • Weaponization: Strong negative correlation with capabilities (-87.5%)
  1. Key Insights: The paper demonstrates that many safety benchmarks (~half) inadvertently measure general capabilities rather than distinct safety properties. This enables “safetywashing” where basic capability improvements can be misrepresented as safety progress.

Important Figures:

  • Figure 1 provides an excellent overview of the paper’s concept, showing how safety benchmarks can be correlated with capabilities
  • Figure 2 illustrates the safetywashing problem through a leaderboard example
  • Figure 11 demonstrates the important distinction between RMS calibration error and Brier Score correlations
  • Figure 15 shows the strong relationship between compute and capabilities scores

The authors make several recommendations:

  1. Report capabilities correlations for new safety evaluations

  2. Design benchmarks that are decorrelated from capabilities

  3. Avoid making safety claims without demonstrating differential progress

Table 9 provides valuable context by showing the relative capabilities scores across different models, helping readers understand the landscape of current AI systems.

The paper concludes that empirical measurement, rather than intuitive arguments, should guide AI safety research. It suggests that the field needs to focus on developing safety metrics that truly measure properties independent of general capabilities.

This work provides a crucial framework for evaluating AI safety progress and helps prevent the mischaracterization of general capability improvements as safety advancements. Figure 1 would make an excellent thumbnail as it clearly illustrates the paper’s core concept of safety benchmark correlation with capabilities.