Coverage for mcp/mission/decide.py: 97%

87 statements  

« prev     ^ index     » next       coverage.py v7.14.1, created at 2026-06-15 15:07 +0000

1"""Pure deterministic verdict cascade for the Mission Decide_Phase. 

2 

3The cascade is the **control-path** output of the loop: given the current 

4``SessionState`` (before the in-progress iteration is appended), the 

5in-progress :class:`IterationRecord` (with ``strategy``, ``observation``, 

6and ``criteria_evaluation`` already populated but ``verdict`` / 

7``verdict_reason`` not yet set), and the wall-clock value the caller has 

8already measured, :func:`decide_verdict` returns a 

9``(VerdictLabel, VerdictReason)`` tuple. The function is pure: no logger 

10calls, no I/O, no random sources, no clock reads. The wall-clock value is 

11passed in on the call signature so tests can pin it. 

12 

13The cascade order is fixed: 

14 

151. **Budget terminations** — checked in a fixed sub-order so the 

16 verdict_reason is deterministic when more than one cap is breached: 

17 

18 * ``max_iterations`` — the in-progress iteration would be the 

19 ``budget["max_iterations"]``-th or later. Computed as 

20 ``len(session["iterations"]) + 1 >= max_iterations``. 

21 * ``max_wall_clock`` — ``now - session["started_at"] >= max_wall_clock_seconds``. 

22 Returns False when ``started_at`` is missing (the session has 

23 not yet transitioned out of ``pending``). 

24 * ``no_progress`` — ``no_progress_counter >= stagnation_threshold``. 

25 When the session has ``use_sampling=true``, the heuristic (step 

26 4) gets priority so the sampler can revise the strategy before 

27 the loop terminates. Without sampling, ``no_progress`` 

28 terminates immediately. If the heuristic doesn't fire (e.g., 

29 the tool sequence changed after a prior revision), the deferred 

30 stagnation check (step 4b) terminates. 

31 

322. **Completion** — every ``required=True`` Criterion has status 

33 ``met`` in the in-progress iteration's ``criteria_evaluation``, AND 

34 no Criterion (required or not) has status ``inconclusive``. 

35 

363. **Cadence-skip** — when :func:`should_evaluate_now` says "this is 

37 not a checkpoint", emit a synthetic ``("continue", "cadence_skip")`` 

38 without consulting the Strategy_Revision_Heuristic. The heuristic 

39 only fires on real checkpoints so off-cadence iterations cannot 

40 advance the no-progress counter or trigger an ``adjust``. 

41 

424. **Strategy_Revision_Heuristic** — :func:`_strategy_unproductive`: 

43 the same ``tool_calls[*].tool_name`` sequence 

44 for the last 3 iterations AND ``no_progress_counter`` at or above 

45 half the stagnation threshold, OR new errors in the latest 

46 Observation that didn't appear in the prior Observation. Returns 

47 ``("adjust", "heuristic_unproductive")`` when either clause fires. 

48 

494b. **Deferred stagnation** — if step 1c deferred the ``no_progress`` 

50 check (because sampling is enabled) and the heuristic didn't fire, 

51 terminate now. 

52 

535. **Default** — ``("continue", "in_progress")``. 

54 

55The ``iteration`` argument is *not* yet present in 

56``session["iterations"]`` — the engine appends it after the verdict is 

57decided. Anything that needs to look at "the last N iterations 

58including the current one" composes the current ``iteration`` with 

59``session["iterations"][-(N-1):]``. 

60 

61Determinism: same ``(session, iteration, now)`` triples produce the same 

62``(VerdictLabel, VerdictReason)`` tuples. This is enforced by a property 

63test in ``tests/test_mission_decide_determinism.py``. 

64 

65Cost guardrails are intentionally absent from this cascade. Real-time 

66workload cost tracking is structurally inaccurate (Spot vs on-demand 

67drift, EBS / EFA / egress not in the Pricing API, Cost Explorer 24h 

68latency). Operators who need a cost cap should configure AWS Budgets 

69and Cost Anomaly Detection at the account level — Mission caps only 

70the controls the loop has direct visibility into. 

71""" 

72 

73from __future__ import annotations 

74 

75import math 

76from datetime import datetime, timedelta 

77 

78from .checkpoints import should_evaluate_now 

79from .types import IterationRecord, SessionState, VerdictLabel, VerdictReason 

80 

81# <pyflowchart-code-diagram> BEGIN - auto-inserted, do not edit 

82# Flowchart(s) generated from this file: 

83# * ``decide_verdict`` -> ``diagrams/code_diagrams/mcp/mission/decide.decide_verdict.html`` 

84# (PNG: ``diagrams/code_diagrams/mcp/mission/decide.decide_verdict.png``) 

85# Regenerate with ``python diagrams/code_diagrams/generate.py``. 

86# <pyflowchart-code-diagram> END 

87 

88 

89__all__ = [ 

90 "build_revision_rationale_template", 

91 "decide_verdict", 

92] 

93 

94 

95def decide_verdict( 

96 session: SessionState, 

97 iteration: IterationRecord, 

98 now: datetime, 

99) -> tuple[VerdictLabel, VerdictReason]: 

100 """Return the deterministic Verdict for the in-progress iteration. 

101 

102 The cascade order is fixed (see module docstring). The first matching 

103 branch wins: a session that has both run out of iterations and has 

104 every Criterion met returns ``("terminate", "max_iterations")``, not 

105 ``("complete", "criteria_met")`` — budget caps are evaluated before 

106 completion so the operator can tell the loop ended because it ran 

107 out of budget rather than because the goal was reached on the 

108 closing iteration. 

109 Note: When the prior Execute_Phase ran a scripted Strategy and the 

110 sandbox cap fired, ``_execute_script`` writes 

111 ``iteration["sandbox_terminated_reason"]`` and the cascade returns 

112 that reason verbatim before anything else is consulted. The 

113 sandbox limit is a true budget cap — the script ran out of wall 

114 clock during execution — so it routes to a ``terminate`` verdict 

115 on the same path as the ``BudgetControls``-driven caps below. 

116 """ 

117 # 0. Sandbox-cap propagation. ``_execute_script`` stashes the 

118 # reason on the in-progress iteration when the sandbox runner 

119 # raised :class:`SandboxTerminated`. Reading the sentinel here 

120 # means the engine's Execute_Phase can complete cleanly (no phase 

121 # failure) while still routing the verdict to the budget-cap path. 

122 sandbox_reason = iteration.get("sandbox_terminated_reason") 

123 if sandbox_reason is not None: 

124 return ("terminate", sandbox_reason) 

125 

126 # 1a. max_iterations — the +1 captures "this in-progress iteration 

127 # would be the Nth one to land", so a session with budget=N and N-1 

128 # already-recorded iterations terminates on the Nth's Decide_Phase. 

129 # ``-1`` is the explicit "uncapped" sentinel; the validator 

130 # already enforced that any other non-positive value is rejected. 

131 max_iter = session["budget"]["max_iterations"] 

132 if max_iter != -1 and len(session["iterations"]) + 1 >= max_iter: 

133 return ("terminate", "max_iterations") 

134 # 1b. max_wall_clock — pure time arithmetic; missing started_at 

135 # means the session has not yet recorded its first iteration's 

136 # start, so no wall-clock can be measured. 

137 if _wall_clock_exceeded(session, now): 

138 return ("terminate", "max_wall_clock") 

139 # 1c. no_progress — the counter is incremented by the engine only 

140 # on evaluated iterations, so a session with all-skipped checkpoints 

141 # cannot terminate for stagnation. When the session has sampling 

142 # enabled, the heuristic gets priority (step 4 below) so the 

143 # sampler can revise the strategy before the loop terminates. 

144 # Without sampling, ``adjust`` is purely informational and 

145 # ``no_progress`` terminates immediately. 

146 if session["no_progress_counter"] >= session["stagnation_threshold"]: 

147 if not session.get("use_sampling"): 

148 return ("terminate", "no_progress") 

149 # With sampling enabled, fall through to the heuristic check 

150 # below. If the heuristic fires, the sampler gets one more 

151 # chance. If it doesn't fire (e.g., the tool sequence changed 

152 # after a prior revision), terminate for stagnation. 

153 _stagnation_pending = True 

154 else: 

155 _stagnation_pending = False 

156 

157 # 2. Completion — every required Criterion met AND nothing inconclusive. 

158 if _completion_satisfied(session, iteration): 

159 return ("complete", "criteria_met") 

160 

161 # 3. Cadence-skip — bail before the heuristic fires so off-cadence 

162 # iterations don't ever produce ``adjust``. The iteration_index 

163 # passed to ``should_evaluate_now`` is the 0-indexed position of 

164 # the in-progress iteration (which equals the count of already- 

165 # persisted iterations). 

166 if not should_evaluate_now(session, len(session["iterations"]), now): 

167 return ("continue", "cadence_skip") 

168 

169 # 4. Strategy_Revision_Heuristic. 

170 unproductive, _heuristic_reason = _strategy_unproductive(session, iteration) 

171 if unproductive: 

172 return ("adjust", "heuristic_unproductive") 

173 

174 # 4b. Deferred stagnation — the counter hit the threshold but the 

175 # heuristic didn't fire (e.g., the tool sequence changed after a 

176 # prior sampled revision). Terminate now. 

177 if _stagnation_pending: 

178 return ("terminate", "no_progress") 

179 

180 # 5. Default. 

181 return ("continue", "in_progress") 

182 

183 

184# --------------------------------------------------------------------------- 

185# Budget helpers — pure 

186# --------------------------------------------------------------------------- 

187 

188 

189def _wall_clock_exceeded(session: SessionState, now: datetime) -> bool: 

190 """True iff ``now - session["started_at"] >= max_wall_clock_seconds``. 

191 

192 Returns False when ``started_at`` is absent — a session that has 

193 never been transitioned out of ``pending`` cannot have exceeded any 

194 wall-clock budget. The engine writes ``started_at`` on the first 

195 iteration entry, so this guard only matters for the synthetic 

196 "decide called before run_iteration" path used in unit tests. 

197 

198 Returns False when ``max_wall_clock_seconds`` is the explicit 

199 ``-1`` "uncapped" sentinel — the operator opted out of the wall- 

200 clock cap and the cascade should fall through to the next branch 

201 rather than terminate spuriously. 

202 """ 

203 started_iso = session.get("started_at") 

204 if not started_iso: 204 ↛ 205line 204 didn't jump to line 205 because the condition on line 204 was never true

205 return False 

206 max_seconds = session["budget"]["max_wall_clock_seconds"] 

207 if max_seconds == -1: 

208 return False 

209 started = datetime.fromisoformat(started_iso) 

210 return now - started >= timedelta(seconds=max_seconds) 

211 

212 

213# --------------------------------------------------------------------------- 

214# Completion check 

215# --------------------------------------------------------------------------- 

216 

217 

218def _completion_satisfied( 

219 session: SessionState, 

220 iteration: IterationRecord, 

221) -> bool: 

222 """True iff every required Criterion is met and none are inconclusive. 

223 

224 A session completes when all Criteria with ``required=True`` have 

225 status ``met`` AND no Criterion (required or not) has status 

226 ``inconclusive``. The ``required`` flag lives on the Criterion 

227 declaration in ``session["criteria"]``; the per-iteration status 

228 lives on ``iteration["criteria_evaluation"]``. The two are joined 

229 by ``criterion_id``. 

230 

231 A session with zero declared Criteria can never complete on its own 

232 — there are no required Criteria for the cascade to satisfy. The 

233 operator drives such a session to terminal via ``mission_complete`` 

234 or a budget cap. We mirror that semantic here by returning False 

235 when the criteria list is empty. 

236 """ 

237 if not session["criteria"]: 

238 return False 

239 required_by_id = {c["criterion_id"]: c.get("required", True) for c in session["criteria"]} 

240 for result in iteration["criteria_evaluation"]: 

241 status = result["status"] 

242 if status == "inconclusive": 

243 return False 

244 if required_by_id.get(result["criterion_id"], True) and status != "met": 

245 return False 

246 return True 

247 

248 

249# --------------------------------------------------------------------------- 

250# Strategy_Revision_Heuristic 

251# --------------------------------------------------------------------------- 

252 

253 

254def _strategy_unproductive( 

255 session: SessionState, 

256 iteration: IterationRecord, 

257) -> tuple[bool, str]: 

258 """Pure heuristic for the Strategy_Revision check. 

259 

260 Two clauses, evaluated in declaration order. The first match wins 

261 so the returned reason is deterministic when both clauses fire. 

262 

263 * **Clause (a)** — the same ``tool_calls[*].tool_name`` sequence 

264 has been used for the last 3 iterations (counting the in-progress 

265 one) AND ``no_progress_counter >= ceil(stagnation_threshold / 2)``. 

266 Needs at least 2 prior iterations to evaluate (3 total when the 

267 current iteration is included). A scripted strategy contributes 

268 an empty sequence so two scripts with the same body register as 

269 "same sequence" — that's intentional: the heuristic flags repeats, 

270 and an empty-sequence repeat across three iterations is a repeat. 

271 * **Clause (b)** — the in-progress Observation contains at least 

272 one ``errors`` entry that did not appear in the immediately 

273 prior Iteration's Observation. Needs at least 1 prior iteration 

274 to evaluate. Without a prior to compare to, "new" is undefined 

275 and we return False. 

276 

277 Returns ``(False, "")`` when neither clause fires. When clause (a) 

278 fires, the reason is ``"tool_sequence_repeating"``; when clause (b) 

279 fires, ``"new_observation_errors"``. The reason string is 

280 informational only — the Verdict's ``verdict_reason`` is always 

281 ``"heuristic_unproductive"`` regardless of which clause matched. 

282 """ 

283 # Clause (a): no_progress threshold AND tool-sequence repeat. 

284 threshold = session["stagnation_threshold"] 

285 half = math.ceil(threshold / 2) 

286 if session["no_progress_counter"] >= half: 

287 # Need at least 2 prior + the current = 3 total iterations. 

288 prior = session["iterations"] 

289 if len(prior) >= 2: 289 ↛ 296line 289 didn't jump to line 296 because the condition on line 289 was always true

290 recent_three = [prior[-2], prior[-1], iteration] 

291 sequences = [_tool_name_sequence(it) for it in recent_three] 

292 if sequences[0] == sequences[1] == sequences[2]: 

293 return (True, "tool_sequence_repeating") 

294 

295 # Clause (b): new errors in the latest Observation vs the prior one. 

296 if session["iterations"]: 

297 prior_observation = session["iterations"][-1].get("observation") or {} 

298 prior_errors = list(prior_observation.get("errors") or []) 

299 current_errors = list(iteration["observation"].get("errors") or []) 

300 for err in current_errors: 

301 if err not in prior_errors: 301 ↛ 300line 301 didn't jump to line 300 because the condition on line 301 was always true

302 return (True, "new_observation_errors") 

303 

304 return (False, "") 

305 

306 

307def _tool_name_sequence(iteration: IterationRecord) -> tuple[str, ...]: 

308 """Extract the ordered tuple of ``tool_name``s from an iteration's strategy. 

309 

310 Returns an empty tuple when the strategy is a script (no 

311 ``tool_calls``) or when ``tool_calls`` is missing. Two scripted 

312 strategies therefore both produce ``()`` and compare equal — clause 

313 (a) treats that as "same sequence", which matches the operator's 

314 intent of flagging mechanical repetition regardless of mode. 

315 """ 

316 strategy = iteration.get("strategy") or {} 

317 tool_calls = strategy.get("tool_calls") or [] 

318 return tuple(str(call.get("tool_name", "")) for call in tool_calls if isinstance(call, dict)) 

319 

320 

321# --------------------------------------------------------------------------- 

322# Revision rationale template 

323# --------------------------------------------------------------------------- 

324 

325 

326def build_revision_rationale_template( 

327 session: SessionState, 

328 iteration: IterationRecord, 

329) -> str: 

330 """Build the deterministic ``revision_rationale`` text for an ``adjust`` verdict. 

331 

332 Used both as the rationale on sessions with ``use_sampling=false`` 

333 and as the fallback rationale when sampling is rejected on a 

334 ``use_sampling=true`` session. 

335 Pure: depends only on persisted Session/Iteration fields, never 

336 calls into the sampler or any other non-deterministic component. 

337 

338 The rendered text names the iteration index (1-indexed for 

339 operator-friendliness), the heuristic reason, the unmet Criterion 

340 ids (so the rationale points at the goal that's still moving), and 

341 a one-line summary of the in-progress strategy (tool-name sequence 

342 or ``"scripted strategy"``). The format is intentionally short and 

343 machine-parseable — operators can grep it; no LLM is involved. 

344 """ 

345 # Resolve the iteration index — the in-progress iteration has not 

346 # been appended to session["iterations"] yet, so its 0-indexed 

347 # position equals len(iterations) and the 1-indexed position is +1. 

348 iteration_index_one_based = len(session["iterations"]) + 1 

349 

350 # Match the heuristic again so the rationale text matches whichever 

351 # clause actually fired. Both calls are pure and cheap. 

352 _, heuristic_reason = _strategy_unproductive(session, iteration) 

353 if not heuristic_reason: 

354 # decide_verdict only emits ``adjust`` when the heuristic fires, 

355 # but the caller may invoke this template independently (e.g. 

356 # the sampling-fallback path on a non-heuristic adjust) — fall 

357 # back to a generic reason so the template stays usable. 

358 heuristic_reason = "strategy_review_requested" 

359 

360 unmet_ids = [ 

361 result["criterion_id"] 

362 for result in iteration["criteria_evaluation"] 

363 if result["status"] == "unmet" 

364 ] 

365 unmet_summary = ", ".join(unmet_ids) if unmet_ids else "none" 

366 

367 strategy = iteration.get("strategy") or {} 

368 if "script" in strategy: 

369 strategy_summary = "scripted strategy" 

370 else: 

371 names = _tool_name_sequence(iteration) 

372 strategy_summary = ", ".join(names) if names else "no tool calls" 

373 

374 no_progress = session["no_progress_counter"] 

375 threshold = session["stagnation_threshold"] 

376 

377 return ( 

378 f"Strategy revised on iteration {iteration_index_one_based}: " 

379 f"{heuristic_reason}. Unmet criteria: {unmet_summary}. " 

380 f"Last strategy: {strategy_summary}. " 

381 f"No-progress counter: {no_progress}/{threshold}. " 

382 f"Adjusting approach for next iteration." 

383 )