Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity
Abstract
Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model's external 'working memory', a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.
Key Figures
Key Findings
1. Reasoning Models Show Superior Monitorability
Reasoning models consistently achieve higher monitorability scores than instruction-tuned models. DeepSeek-R1 leads with 78.3% average monitorability, followed by Claude 3.7 Sonnet with extended thinking at 68.8%. Among reasoning models, QwQ 32B achieves 67.3% compared to Qwen 2.5 72B's 42.2%.
2. Higher Verbosity Tracks with Faithfulness
Reasoning models exhibit both higher faithfulness and higher verbosity scores. Longer reasoning traces contain more information and improve performance, but only up to a point—beyond certain lengths, performance deteriorates following an inverse scaling law. A careful balance is required for models to be both highly monitorable and highly performant.
3. Large Differences on Sensitive Cues
The largest monitorability gaps between reasoning and instruction-tuned models occur on the "unethically sourced information" cue, where Claude 3.7 Sonnet outpaced Claude 3.5 Sonnet by 43.8 percentage points and DeepSeek-R1 outpaced DeepSeek-V3 by 32.8 percentage points. The "Stanford professor" cue showed a 37.8 point difference between QwQ 32B (81.7%) and Qwen 2.5 72B (43.9%).
4. Non-Switch Cases Reveal Hidden Unfaithfulness
Cases where models maintain their correct answer despite misleading cues comprise the vast majority (67-76%) of instances. Unlike prior work that treats these as automatically faithful, we find that many models fail to mention the cue even in these cases. Non-switch cases have similar monitorability scores to cases where models do switch, revealing valuable information that traditional metrics miss.
5. Dataset Difficulty Impacts Monitorability
Monitorability scores vary significantly across datasets. The hardest dataset, GPQA-Diamond, has the lowest average monitorability scores, while BBH has the highest. This suggests that task difficulty influences the extent to which models externalize their reasoning.
6. Model Family Differences Persist
Monitorability differs sharply across model families, with DeepSeek and Claude models consistently outperforming Qwen models. Among non-reasoning models, Qwen 3 (59.2%) shows noticeable improvement over Qwen 2.5 (42.2%), indicating that newer generations may improve transparency.
Resources
Citation
@misc{meek2025measuringchainofthoughtmonitorabilityfaithfulness,
title={Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity},
author={Austin Meek and Eitan Sprejer and Iván Arcuschin and Austin J. Brockmeier and Steven Basart},
year={2025},
eprint={2510.27378},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.27378},
}