This paper presents an argument that certain AI safety measures, rather than mitigating existential risk, may instead exacerbate it. Under certain key assumptions - the inevitability of AI failure, the expected correlation between an AI system's power at the point of failure and the severity of the resulting harm, and the tendency of safety measures to enable AI systems to become more powerful before failing - safety efforts have negative expected utility. The paper examines three response strategies: Optimism, Mitigation, and Holism. Each faces challenges stemming from intrinsic features of the AI safety landscape that we term Bottlenecking, the Perfection Barrier, and Equilibrium Fluctuation. The surprising robustness of the argument forces a re-examination of core assumptions around AI safety and points to several avenues for further research.
This fascinating paper presents a counterintuitive argument against AI safety measures, suggesting that such measures might actually increase rather than decrease existential risks from AI. The authors (Cappelen, Dever, and Hawthorne) develop what they call the “non-deterministic argument” through an analogy with rock climbing.
Key Points:
The Rock Climber Analogy:
Consider a climber who will inevitably fall
Providing safety equipment (chalk) allows them to climb higher
A fall from greater height is more catastrophic
Therefore, providing safety measures leads to worse outcomes
The AI Safety Parallel:
AI systems will eventually fail/malfunction
Safety measures allow AI to become more powerful before failing
Failures of more powerful AI systems are more catastrophic
Therefore, safety measures may increase overall risk
Three Main Response Strategies:
Optimism: Believing we can stay ahead of AI dangers
Mitigation: Focusing on reducing damage rather than preventing failure
Key Challenges to These Responses:
Bottlenecking: Safety measures must route through fallible human systems
Perfection Barrier: Safety requires near-perfect implementation while damage doesn’t
Equilibrium Fluctuation: Even balanced systems will have dangerous fluctuations
The paper’s argument is particularly compelling because it doesn’t deny the existential risk posed by AI, but rather suggests that our attempts to mitigate this risk through safety measures may be counterproductive. The authors acknowledge this is a counterintuitive conclusion but demonstrate its robustness against various objections.
The implications are significant for AI governance and policy. If the argument holds, it suggests we may need to fundamentally rethink our approach to AI safety, possibly leading to more restrictive policies on AI development rather than focusing on safety measures.
The paper’s strength lies in its careful philosophical analysis and the way it builds from a simple analogy to a sophisticated argument about AI risk. While the conclusions may be uncomfortable for many in the AI safety community, the logic is difficult to dismiss.
Future Research Directions:
Developing additional responses to the argument
Challenging the empirical assumptions
Connecting these theoretical insights to practical AI safety work
Exploring implications for AI governance
This paper represents an important contribution to the AI safety discussion by forcing us to confront uncomfortable possibilities about the relationship between safety measures and risk. While it doesn’t definitively solve the problem, it raises crucial questions that deserve serious consideration from both researchers and policymakers.