Coverage for mcp/mission/decide.py: 97%

1"""Pure deterministic verdict cascade for the Mission Decide_Phase.

3The cascade is the **control-path** output of the loop: given the current

4``SessionState`` (before the in-progress iteration is appended), the

5in-progress :class:`IterationRecord` (with ``strategy``, ``observation``,

6and ``criteria_evaluation`` already populated but ``verdict`` /

7``verdict_reason`` not yet set), and the wall-clock value the caller has

8already measured, :func:`decide_verdict` returns a

9``(VerdictLabel, VerdictReason)`` tuple. The function is pure: no logger

10calls, no I/O, no random sources, no clock reads. The wall-clock value is

11passed in on the call signature so tests can pin it.

13The cascade order is fixed:

151. **Budget terminations** — checked in a fixed sub-order so the

16 verdict_reason is deterministic when more than one cap is breached:

18 * ``max_iterations`` — the in-progress iteration would be the

19 ``budget["max_iterations"]``-th or later. Computed as

20 ``len(session["iterations"]) + 1 >= max_iterations``.

21 * ``max_wall_clock`` — ``now - session["started_at"] >= max_wall_clock_seconds``.

22 Returns False when ``started_at`` is missing (the session has

23 not yet transitioned out of ``pending``).

24 * ``no_progress`` — ``no_progress_counter >= stagnation_threshold``.

25 When the session has ``use_sampling=true``, the heuristic (step

26 4) gets priority so the sampler can revise the strategy before

27 the loop terminates. Without sampling, ``no_progress``

28 terminates immediately. If the heuristic doesn't fire (e.g.,

29 the tool sequence changed after a prior revision), the deferred

30 stagnation check (step 4b) terminates.

322. **Completion** — every ``required=True`` Criterion has status

33 ``met`` in the in-progress iteration's ``criteria_evaluation``, AND

34 no Criterion (required or not) has status ``inconclusive``.

363. **Cadence-skip** — when :func:`should_evaluate_now` says "this is

37 not a checkpoint", emit a synthetic ``("continue", "cadence_skip")``

38 without consulting the Strategy_Revision_Heuristic. The heuristic

39 only fires on real checkpoints so off-cadence iterations cannot

40 advance the no-progress counter or trigger an ``adjust``.

424. **Strategy_Revision_Heuristic** — :func:`_strategy_unproductive`:

43 the same ``tool_calls[*].tool_name`` sequence

44 for the last 3 iterations AND ``no_progress_counter`` at or above

45 half the stagnation threshold, OR new errors in the latest

46 Observation that didn't appear in the prior Observation. Returns

47 ``("adjust", "heuristic_unproductive")`` when either clause fires.

494b. **Deferred stagnation** — if step 1c deferred the ``no_progress``

50 check (because sampling is enabled) and the heuristic didn't fire,

51 terminate now.

535. **Default** — ``("continue", "in_progress")``.

55The ``iteration`` argument is *not* yet present in

56``session["iterations"]`` — the engine appends it after the verdict is

57decided. Anything that needs to look at "the last N iterations

58including the current one" composes the current ``iteration`` with

59``session["iterations"][-(N-1):]``.

61Determinism: same ``(session, iteration, now)`` triples produce the same

62``(VerdictLabel, VerdictReason)`` tuples. This is enforced by a property

63test in ``tests/test_mission_decide_determinism.py``.

65Cost guardrails are intentionally absent from this cascade. Real-time

66workload cost tracking is structurally inaccurate (Spot vs on-demand

67drift, EBS / EFA / egress not in the Pricing API, Cost Explorer 24h

68latency). Operators who need a cost cap should configure AWS Budgets

69and Cost Anomaly Detection at the account level — Mission caps only

70the controls the loop has direct visibility into.

71"""

73from __future__ import annotations

75import math

76from datetime import datetime, timedelta

78from .checkpoints import should_evaluate_now

79from .types import IterationRecord, SessionState, VerdictLabel, VerdictReason

81# <pyflowchart-code-diagram> BEGIN - auto-inserted, do not edit

82# Flowchart(s) generated from this file:

83# * ``decide_verdict`` -> ``diagrams/code_diagrams/mcp/mission/decide.decide_verdict.html``

84# (PNG: ``diagrams/code_diagrams/mcp/mission/decide.decide_verdict.png``)

85# Regenerate with ``python diagrams/code_diagrams/generate.py``.

86# <pyflowchart-code-diagram> END

89__all__ = [

90 "build_revision_rationale_template",

91 "decide_verdict",

92]

95def decide_verdict(

96 session: SessionState,

97 iteration: IterationRecord,

98 now: datetime,

99) -> tuple[VerdictLabel, VerdictReason]:

100 """Return the deterministic Verdict for the in-progress iteration.

101

102 The cascade order is fixed (see module docstring). The first matching

103 branch wins: a session that has both run out of iterations and has

104 every Criterion met returns ``("terminate", "max_iterations")``, not

105 ``("complete", "criteria_met")`` — budget caps are evaluated before

106 completion so the operator can tell the loop ended because it ran

107 out of budget rather than because the goal was reached on the

108 closing iteration.

109 Note: When the prior Execute_Phase ran a scripted Strategy and the

110 sandbox cap fired, ``_execute_script`` writes

111 ``iteration["sandbox_terminated_reason"]`` and the cascade returns

112 that reason verbatim before anything else is consulted. The

113 sandbox limit is a true budget cap — the script ran out of wall

114 clock during execution — so it routes to a ``terminate`` verdict

115 on the same path as the ``BudgetControls``-driven caps below.

116 """

117 # 0. Sandbox-cap propagation. ``_execute_script`` stashes the

118 # reason on the in-progress iteration when the sandbox runner

119 # raised :class:`SandboxTerminated`. Reading the sentinel here

120 # means the engine's Execute_Phase can complete cleanly (no phase

121 # failure) while still routing the verdict to the budget-cap path.

122 sandbox_reason = iteration.get("sandbox_terminated_reason")

123 if sandbox_reason is not None:

124 return ("terminate", sandbox_reason)

125

126 # 1a. max_iterations — the +1 captures "this in-progress iteration

127 # would be the Nth one to land", so a session with budget=N and N-1

128 # already-recorded iterations terminates on the Nth's Decide_Phase.

129 # ``-1`` is the explicit "uncapped" sentinel; the validator

130 # already enforced that any other non-positive value is rejected.

131 max_iter = session["budget"]["max_iterations"]

132 if max_iter != -1 and len(session["iterations"]) + 1 >= max_iter:

133 return ("terminate", "max_iterations")

134 # 1b. max_wall_clock — pure time arithmetic; missing started_at

135 # means the session has not yet recorded its first iteration's

136 # start, so no wall-clock can be measured.

137 if _wall_clock_exceeded(session, now):

138 return ("terminate", "max_wall_clock")

139 # 1c. no_progress — the counter is incremented by the engine only

140 # on evaluated iterations, so a session with all-skipped checkpoints

141 # cannot terminate for stagnation. When the session has sampling

142 # enabled, the heuristic gets priority (step 4 below) so the

143 # sampler can revise the strategy before the loop terminates.

144 # Without sampling, ``adjust`` is purely informational and

145 # ``no_progress`` terminates immediately.

146 if session["no_progress_counter"] >= session["stagnation_threshold"]:

147 if not session.get("use_sampling"):

148 return ("terminate", "no_progress")

149 # With sampling enabled, fall through to the heuristic check

150 # below. If the heuristic fires, the sampler gets one more

151 # chance. If it doesn't fire (e.g., the tool sequence changed

152 # after a prior revision), terminate for stagnation.

153 _stagnation_pending = True

154 else:

155 _stagnation_pending = False

156

157 # 2. Completion — every required Criterion met AND nothing inconclusive.

158 if _completion_satisfied(session, iteration):

159 return ("complete", "criteria_met")

160

161 # 3. Cadence-skip — bail before the heuristic fires so off-cadence

162 # iterations don't ever produce ``adjust``. The iteration_index

163 # passed to ``should_evaluate_now`` is the 0-indexed position of

164 # the in-progress iteration (which equals the count of already-

165 # persisted iterations).

166 if not should_evaluate_now(session, len(session["iterations"]), now):

167 return ("continue", "cadence_skip")

168

169 # 4. Strategy_Revision_Heuristic.

170 unproductive, _heuristic_reason = _strategy_unproductive(session, iteration)

171 if unproductive:

172 return ("adjust", "heuristic_unproductive")

173

174 # 4b. Deferred stagnation — the counter hit the threshold but the

175 # heuristic didn't fire (e.g., the tool sequence changed after a

176 # prior sampled revision). Terminate now.

177 if _stagnation_pending:

178 return ("terminate", "no_progress")

179

180 # 5. Default.

181 return ("continue", "in_progress")

182

183

184# ---------------------------------------------------------------------------

185# Budget helpers — pure

186# ---------------------------------------------------------------------------

187

188

189def _wall_clock_exceeded(session: SessionState, now: datetime) -> bool:

190 """True iff ``now - session["started_at"] >= max_wall_clock_seconds``.

191

192 Returns False when ``started_at`` is absent — a session that has

193 never been transitioned out of ``pending`` cannot have exceeded any

194 wall-clock budget. The engine writes ``started_at`` on the first

195 iteration entry, so this guard only matters for the synthetic

196 "decide called before run_iteration" path used in unit tests.

197

198 Returns False when ``max_wall_clock_seconds`` is the explicit

199 ``-1`` "uncapped" sentinel — the operator opted out of the wall-

200 clock cap and the cascade should fall through to the next branch

201 rather than terminate spuriously.

202 """

203 started_iso = session.get("started_at")

204 if not started_iso: 204 ↛ 205line 204 didn't jump to line 205 because the condition on line 204 was never true

205 return False

206 max_seconds = session["budget"]["max_wall_clock_seconds"]

207 if max_seconds == -1:

208 return False

209 started = datetime.fromisoformat(started_iso)

210 return now - started >= timedelta(seconds=max_seconds)

211

212

213# ---------------------------------------------------------------------------

214# Completion check

215# ---------------------------------------------------------------------------

216

217

218def _completion_satisfied(

219 session: SessionState,

220 iteration: IterationRecord,

221) -> bool:

222 """True iff every required Criterion is met and none are inconclusive.

223

224 A session completes when all Criteria with ``required=True`` have

225 status ``met`` AND no Criterion (required or not) has status

226 ``inconclusive``. The ``required`` flag lives on the Criterion

227 declaration in ``session["criteria"]``; the per-iteration status

228 lives on ``iteration["criteria_evaluation"]``. The two are joined

229 by ``criterion_id``.

230

231 A session with zero declared Criteria can never complete on its own

232 — there are no required Criteria for the cascade to satisfy. The

233 operator drives such a session to terminal via ``mission_complete``

234 or a budget cap. We mirror that semantic here by returning False

235 when the criteria list is empty.

236 """

237 if not session["criteria"]:

238 return False

239 required_by_id = {c["criterion_id"]: c.get("required", True) for c in session["criteria"]}

240 for result in iteration["criteria_evaluation"]:

241 status = result["status"]

242 if status == "inconclusive":

243 return False

244 if required_by_id.get(result["criterion_id"], True) and status != "met":

245 return False

246 return True

247

248

249# ---------------------------------------------------------------------------

250# Strategy_Revision_Heuristic

251# ---------------------------------------------------------------------------

252

253

254def _strategy_unproductive(

255 session: SessionState,

256 iteration: IterationRecord,

257) -> tuple[bool, str]:

258 """Pure heuristic for the Strategy_Revision check.

259

260 Two clauses, evaluated in declaration order. The first match wins

261 so the returned reason is deterministic when both clauses fire.

262

263 * **Clause (a)** — the same ``tool_calls[*].tool_name`` sequence

264 has been used for the last 3 iterations (counting the in-progress

265 one) AND ``no_progress_counter >= ceil(stagnation_threshold / 2)``.

266 Needs at least 2 prior iterations to evaluate (3 total when the

267 current iteration is included). A scripted strategy contributes

268 an empty sequence so two scripts with the same body register as

269 "same sequence" — that's intentional: the heuristic flags repeats,

270 and an empty-sequence repeat across three iterations is a repeat.

271 * **Clause (b)** — the in-progress Observation contains at least

272 one ``errors`` entry that did not appear in the immediately

273 prior Iteration's Observation. Needs at least 1 prior iteration

274 to evaluate. Without a prior to compare to, "new" is undefined

275 and we return False.

276

277 Returns ``(False, "")`` when neither clause fires. When clause (a)

278 fires, the reason is ``"tool_sequence_repeating"``; when clause (b)

279 fires, ``"new_observation_errors"``. The reason string is

280 informational only — the Verdict's ``verdict_reason`` is always

281 ``"heuristic_unproductive"`` regardless of which clause matched.

282 """

283 # Clause (a): no_progress threshold AND tool-sequence repeat.

284 threshold = session["stagnation_threshold"]

285 half = math.ceil(threshold / 2)

286 if session["no_progress_counter"] >= half:

287 # Need at least 2 prior + the current = 3 total iterations.

288 prior = session["iterations"]

289 if len(prior) >= 2: 289 ↛ 296line 289 didn't jump to line 296 because the condition on line 289 was always true

290 recent_three = [prior[-2], prior[-1], iteration]

291 sequences = [_tool_name_sequence(it) for it in recent_three]

292 if sequences[0] == sequences[1] == sequences[2]:

293 return (True, "tool_sequence_repeating")

294

295 # Clause (b): new errors in the latest Observation vs the prior one.

296 if session["iterations"]:

297 prior_observation = session["iterations"][-1].get("observation") or {}

298 prior_errors = list(prior_observation.get("errors") or [])

299 current_errors = list(iteration["observation"].get("errors") or [])

300 for err in current_errors:

301 if err not in prior_errors: 301 ↛ 300line 301 didn't jump to line 300 because the condition on line 301 was always true

302 return (True, "new_observation_errors")

303

304 return (False, "")

305

306

307def _tool_name_sequence(iteration: IterationRecord) -> tuple[str, ...]:

308 """Extract the ordered tuple of ``tool_name``s from an iteration's strategy.

309

310 Returns an empty tuple when the strategy is a script (no

311 ``tool_calls``) or when ``tool_calls`` is missing. Two scripted

312 strategies therefore both produce ``()`` and compare equal — clause

313 (a) treats that as "same sequence", which matches the operator's

314 intent of flagging mechanical repetition regardless of mode.

315 """

316 strategy = iteration.get("strategy") or {}

317 tool_calls = strategy.get("tool_calls") or []

318 return tuple(str(call.get("tool_name", "")) for call in tool_calls if isinstance(call, dict))

319

320

321# ---------------------------------------------------------------------------

322# Revision rationale template

323# ---------------------------------------------------------------------------

324

325

326def build_revision_rationale_template(

327 session: SessionState,

328 iteration: IterationRecord,

329) -> str:

330 """Build the deterministic ``revision_rationale`` text for an ``adjust`` verdict.

331

332 Used both as the rationale on sessions with ``use_sampling=false``

333 and as the fallback rationale when sampling is rejected on a

334 ``use_sampling=true`` session.

335 Pure: depends only on persisted Session/Iteration fields, never

336 calls into the sampler or any other non-deterministic component.

337

338 The rendered text names the iteration index (1-indexed for

339 operator-friendliness), the heuristic reason, the unmet Criterion

340 ids (so the rationale points at the goal that's still moving), and

341 a one-line summary of the in-progress strategy (tool-name sequence

342 or ``"scripted strategy"``). The format is intentionally short and

343 machine-parseable — operators can grep it; no LLM is involved.

344 """

345 # Resolve the iteration index — the in-progress iteration has not

346 # been appended to session["iterations"] yet, so its 0-indexed

347 # position equals len(iterations) and the 1-indexed position is +1.

348 iteration_index_one_based = len(session["iterations"]) + 1

349

350 # Match the heuristic again so the rationale text matches whichever

351 # clause actually fired. Both calls are pure and cheap.

352 _, heuristic_reason = _strategy_unproductive(session, iteration)

353 if not heuristic_reason:

354 # decide_verdict only emits ``adjust`` when the heuristic fires,

355 # but the caller may invoke this template independently (e.g.

356 # the sampling-fallback path on a non-heuristic adjust) — fall

357 # back to a generic reason so the template stays usable.

358 heuristic_reason = "strategy_review_requested"

359

360 unmet_ids = [

361 result["criterion_id"]

362 for result in iteration["criteria_evaluation"]

363 if result["status"] == "unmet"

364 ]

365 unmet_summary = ", ".join(unmet_ids) if unmet_ids else "none"

366

367 strategy = iteration.get("strategy") or {}

368 if "script" in strategy:

369 strategy_summary = "scripted strategy"

370 else:

371 names = _tool_name_sequence(iteration)

372 strategy_summary = ", ".join(names) if names else "no tool calls"

373

374 no_progress = session["no_progress_counter"]

375 threshold = session["stagnation_threshold"]

376

377 return (

378 f"Strategy revised on iteration {iteration_index_one_based}: "

379 f"{heuristic_reason}. Unmet criteria: {unmet_summary}. "

380 f"Last strategy: {strategy_summary}. "

381 f"No-progress counter: {no_progress}/{threshold}. "

382 f"Adjusting approach for next iteration."

383 )