Frontier reasoning models struggle to control their chains of thought, helping CoT monitorability
Key Points
- CoT controllability is very low (≤15.4%) across tested frontier models
- Larger models slightly better, but longer reasoning and post-training reduce control
- CoT-Control released: 13k+ tasks for measuring CoT controllability
Summary
The paper evaluates whether modern reasoning models can deliberately control or obscure their chain-of-thought (CoT) traces, which would weaken CoT-based monitoring. Using CoT-Control, an open-source suite of 13k+ tasks derived from standard benchmarks, the authors measure models' ability to follow explicit instructions about their CoTs (e.g., avoid keywords, use only lowercase, append tokens). Across 13 frontier models, controllability is very low (0.1%–15.4%), suggesting current models are unlikely to reliably obfuscate reasoning under these proxies. Controllability increases modestly with model scale and with explicit monitor/eval awareness, but decreases with longer reasoning chains and additional post-training (e.g., RL).
Key Points
- CoT-Control: open-source evaluation with >13,000 instructioned CoT tasks built from GPQA, MMLU-Pro, HLE, BFCL, SWE-Bench.
- Measured controllability is low across 13 models (0.1%–15.4%); failures often occur despite models recognizing noncompliance in their own CoT.
- Controllability trends: increases with model size, decreases with longer chain length and more post-training; small boost when told they're monitored; prompt optimization transfers poorly.
- Limitations: tasks are a proxy for real-world adversarial incentives; some controllability types may not be covered and results may under- or over-estimate true capability.
- Recommendation: treat low CoT controllability as a useful canary metric—continue monitoring as models evolve and avoid direct optimization pressure on CoTs.
Implications for engineers
- Add CoT controllability checks (e.g., CoT-Control) into evaluation pipelines alongside existing monitorability metrics.
- Stress-test models with longer multi-step reasoning and post-training variants (RL, fine-tuning) to detect degradations in monitorability.
- Keep defense-in-depth: combine CoT monitoring with behavioral testing, safety training, and code review rather than relying on CoT alone.
- Avoid designs that directly optimize or expose incentives to manipulate CoTs; report CoT controllability in system cards for frontier models.