Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity

Austin Meek1,2 Eitan Sprejer3,4 Iván Arcuschin1,5 Austin J. Brockmeier2 Steven Basart6
1ML Alignment and Theory Scholars
2University of Delaware
3AI Safety Argentina
4Buenos Aires AI Safety Hub
5University of Buenos Aires
6Center for AI Safety
AAAI 2025 (Submitted)

Abstract

Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model's external 'working memory', a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.

Key Figures

Overview of monitorability measurement pipeline
Figure 1: Overview of our pipeline for measuring chain-of-thought (CoT) monitorability. For every task we create a baseline prompt (top) and a cued prompt that injects an explicit cue (bottom). The model under evaluation produces a CoT for each prompt. To assess Verbosity we first ask a group of high-performing models (e.g., Gemini, Gemma, etc.) to individually enumerate factors necessary to solve the task, and then use Claude 4 Sonnet to aggregate the different factors into a single list. For evaluation, we then use a lightweight judge model to verify whether each factor is verbalized in the baseline CoT. To assess Faithfulness we use the same judge model to check whether the cued CoT explicitly acknowledges the injected cue. The resulting Verbosity and Faithfulness scores form the two axes of Monitorability, indicating how transparently the model externalizes its reasoning.
Monitorability scores across different cue types
Figure 2: This figure shows the monitorability scores across models and cue types, averaged over BBH, GPQA, and MMLU. See Appendix for per-dataset response rates. Reasoning models are denoted with striped bars, and instruction-tuned models are denoted with solid bars.
Dataset comparison results
Figure 3: Results over three different datasets. BBH is further reduced to the subset found in Turpin et al. (2023), GPQA (specifically, GPQA-Diamond), and MMLU.
Sample CoT example
Figure 4: Sample CoT extracted from the results on the BBH dataset. The cue mention is highlighted in blue, whereas the causal factors mentioned are highlighted in turquoise. Example reasoning traces can be found in the Appendix.

Key Findings

1. Reasoning Models Show Superior Monitorability

Reasoning models consistently achieve higher monitorability scores than instruction-tuned models. DeepSeek-R1 leads with 78.3% average monitorability, followed by Claude 3.7 Sonnet with extended thinking at 68.8%. Among reasoning models, QwQ 32B achieves 67.3% compared to Qwen 2.5 72B's 42.2%.

2. Higher Verbosity Tracks with Faithfulness

Reasoning models exhibit both higher faithfulness and higher verbosity scores. Longer reasoning traces contain more information and improve performance, but only up to a point—beyond certain lengths, performance deteriorates following an inverse scaling law. A careful balance is required for models to be both highly monitorable and highly performant.

3. Large Differences on Sensitive Cues

The largest monitorability gaps between reasoning and instruction-tuned models occur on the "unethically sourced information" cue, where Claude 3.7 Sonnet outpaced Claude 3.5 Sonnet by 43.8 percentage points and DeepSeek-R1 outpaced DeepSeek-V3 by 32.8 percentage points. The "Stanford professor" cue showed a 37.8 point difference between QwQ 32B (81.7%) and Qwen 2.5 72B (43.9%).

4. Non-Switch Cases Reveal Hidden Unfaithfulness

Cases where models maintain their correct answer despite misleading cues comprise the vast majority (67-76%) of instances. Unlike prior work that treats these as automatically faithful, we find that many models fail to mention the cue even in these cases. Non-switch cases have similar monitorability scores to cases where models do switch, revealing valuable information that traditional metrics miss.

5. Dataset Difficulty Impacts Monitorability

Monitorability scores vary significantly across datasets. The hardest dataset, GPQA-Diamond, has the lowest average monitorability scores, while BBH has the highest. This suggests that task difficulty influences the extent to which models externalize their reasoning.

6. Model Family Differences Persist

Monitorability differs sharply across model families, with DeepSeek and Claude models consistently outperforming Qwen models. Among non-reasoning models, Qwen 3 (59.2%) shows noticeable improvement over Qwen 2.5 (42.2%), indicating that newer generations may improve transparency.

Resources

Citation

@misc{meek2025measuringchainofthoughtmonitorabilityfaithfulness,
      title={Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity},
      author={Austin Meek and Eitan Sprejer and Iván Arcuschin and Austin J. Brockmeier and Steven Basart},
      year={2025},
      eprint={2510.27378},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.27378},
}