Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity
Abstract
Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model's external 'working memory', a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.
Key Figures
Key Findings
1. Reasoning Models Show Superior Monitorability
Reasoning models consistently achieve higher monitorability scores than instruction-tuned models. DeepSeek-R1 leads with an average monitorability score of 78.3%, followed by Claude 3.7 Sonnet with extended thinking at 68.8%. Out of the Qwen family of models, Qwen 3 Thinking achieves 67.3% compared to the non-thinking Qwen 3 (59.2%) and Qwen 2.5 72B (42.2%) models.
2. Higher Verbosity Tracks with Faithfulness
Reasoning models exhibit both higher faithfulness and higher verbosity scores. Longer reasoning traces naturally contain more information, but recent work (see Gema et al. (2025)) has shown that beyond a certain length, performance deteriorates following an inverse scaling law. A careful balance is required for models to be both highly monitorable and highly performant. In Appendix C, we provide detailed information on the correlation of our verbosity metric with rollout length (Pearson's r = 0.209, p < 0.001).
3. Large Differences on Sensitive Cues
The largest monitorability gaps between reasoning and instruction-tuned models occur on the "unethically sourced information" cue, where Claude 3.7 Sonnet outpaced Claude 3.5 Sonnet by 43.8 percentage points and DeepSeek-R1 outpaced DeepSeek-V3 by 32.8 percentage points. The "Stanford professor" cue also showed large gap, with a 37.8 point difference between QwQ 32B (81.7%) and Qwen 2.5 72B (43.9%).
4. Non-Switch Cases Reveal Hidden Unfaithfulness
Cases where models maintain their correct answer despite misleading cues comprise the vast majority (66.4-80.3%) of instances. Unlike prior work that treats these as automatically faithful, we find that many models fail to mention the cue even in these cases. Non-switch cases have similar monitorability scores to cases where models do switch, revealing valuable information that traditional metrics miss.
5. Dataset Difficulty Impacts Monitorability
Monitorability scores vary significantly across datasets. The hardest dataset, GPQA-Diamond, has the lowest average monitorability scores, while BBH has the highest. Particularly, GPQA-Diamond had much lower faithfulness scores. Further research into how models must use their CoT to reason and how that effects monitorability is a fruitful area for future work.
6. Models Have Unique Monitorability Patterns
Monitorability can be different between different model families, and even between different cues. For instance, Claude 3.7 Sonnet tends to have abnormally high reporting of the unethically sourced information cue relative to other models, potentially a result of different post-training techniques. We also notice that more capable models tend to have higher monitorability scores. For instance, out of the Qwen model family, Qwen 3 models perform better than the Qwen 2.5 or QwQ models.
Resources
Citation
@misc{meek2025measuringchainofthoughtmonitorabilityfaithfulness,
title={Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity},
author={Austin Meek and Eitan Sprejer and Iván Arcuschin and Austin J. Brockmeier and Steven Basart},
year={2025},
eprint={2510.27378},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.27378},
}