Benchmark Study: Does Reasoning Structure Shape Failure, Not Just Accuracy?

We ran 1,500 benchmark tests to find out whether DODAR-based prompting improves LLM reasoning. It doesn't improve accuracy. But it measurably changes how models fail — and that finding is statistically significant (p = 0.0003).

What we tested

Phase-Gated Reasoning (PGR) encodes DODAR's five phases — Diagnose, Options, Decide, Action, Review — directly into an LLM system prompt. We tested two PGR variants against three controls on GPT-4.1-mini across 100 benchmark tasks, each run three times:

A — Baseline: No system prompt
B — Zero-Shot CoT: "Think step by step" (token-matched to C)
C — PGR (Late Commitment): Five phases, decision deferred to REVIEW
C_prev — PGR (Early Commitment): Five phases, decision at DECIDE (original design)
G — Few-Shot CoT: Three worked examples showing step-by-step reasoning

What we found

Accuracy is flat

A: 88.7% | B: 85.7% | C: 88.3% | C_prev: 87.3% | G: 91.0%

All conditions fall within a 5.3 percentage point range. No PGR comparison reaches statistical significance. The model either knows the answer or it doesn't — how you ask it to think about it doesn't change that.

Error distributions are significantly different

This is the real finding. When PGR fails, 39% of its errors are anchoring and failure-to-revise — nearly double the baseline rate of 21%. The structured phases create commitment points that become cognitive traps. The model commits to an answer, generates evidence against it, and then stands pat.

Error Type	Baseline	PGR (Late)	Few-Shot CoT
Anchoring + Failure to Revise	21%	39%	24%
Comprehension Failure	33%	21%	50%
Execution Error	41%	36%	19%

Few-Shot CoT wins

The only significant accuracy result: Few-Shot CoT (91%) significantly outperforms Zero-Shot CoT (85.7%) on Wilcoxon signed-rank (p = 0.033). Showing beats telling. Models are better at imitating demonstrated reasoning than following abstract process descriptions.

Self-review doesn't work in LLMs

DODAR's REVIEW phase works in a cockpit because the first officer is a different human challenging the captain. In an LLM, REVIEW is the same model reviewing its own output. We tested an anti-anchoring variant that explicitly instructed the model to argue against its own answer. It made things worse — the model performed the review ritual perfectly and changed nothing.

The exception

One task (BBH-CJ-005) shows late-commitment PGR working exactly as designed. It's a causal judgement question where every other condition goes 0/3 and PGR goes 3/3. By deferring commitment to REVIEW and testing both causal framings equally, the model avoids a trap that every other condition falls into. But this is the exception, not the rule.

What this means for DODAR

DODAR remains a powerful framework for human decision-making. The REVIEW phase works because it involves an independent perspective — a different person challenging the decision-maker's assumptions.

The insight from this study is that DODAR may be better suited as an architecture pattern than a prompting pattern for AI systems. Instead of one model following all five phases, route DIAGNOSE through DECIDE to one model, then hand the reasoning trace to a separate model (or fresh context) for REVIEW. Give the reviewer genuine independence — no memory of the reasoning that produced the anchor.

Multi-agent DODAR — where the reviewer has no memory of the original reasoning — is the natural next step.

Protocol deviations

This study evolved during execution. The original protocol specified seven prompting conditions, seven models across three stages, and human error classification. What was actually executed: five conditions, one model, LLM-only classification.

The PGR prompt was redesigned mid-study from early commitment (decision at DECIDE) to late commitment (decision at REVIEW) after failure analysis showed the model consistently found errors in REVIEW but refused to change its answer. The original prompt was retained as a comparison condition.

Conditions D (ReAct), E (Step-Back), and F (Shuffled PGR) were tested in preliminary runs but dropped from the final triplicate. Multi-model stages 2 and 3 were not executed — accuracy was flat, so cross-model validation was not pursued.

Human error classification was not performed. The blinding script (blind_responses.py) was written but never run. LLM raters (Claude Opus 4.6 + GPT-5.4) classified all errors, with inter-rater kappa of 0.456 — below the protocol's 0.7 threshold for accepting LLM labels as primary data. The chi-squared result (p = 0.0003) is robust enough to survive noisy labels, but specific error counts are directional rather than precise.

The full protocol deviations table is in the benchmark report PDF.

Full report

The complete methodology, data, and analysis are available in the benchmark report PDF. Contact [email protected] for the raw data and code.

About this research

Study conducted March 2026 by Adam Field / Crox. 1,500 runs on GPT-4.1-mini. Error classification by Claude Opus 4.6 and GPT-5.4. Total API cost: $1.45.

Accuracy is flat

Error distributions are significantly different

Few-Shot CoT wins

Self-review doesn't work in LLMs

Ready to integrate AI into your business?