Coverage for mcp/mission/decide.py: 97%
87 statements
« prev ^ index » next coverage.py v7.14.1, created at 2026-06-15 15:07 +0000
« prev ^ index » next coverage.py v7.14.1, created at 2026-06-15 15:07 +0000
1"""Pure deterministic verdict cascade for the Mission Decide_Phase.
3The cascade is the **control-path** output of the loop: given the current
4``SessionState`` (before the in-progress iteration is appended), the
5in-progress :class:`IterationRecord` (with ``strategy``, ``observation``,
6and ``criteria_evaluation`` already populated but ``verdict`` /
7``verdict_reason`` not yet set), and the wall-clock value the caller has
8already measured, :func:`decide_verdict` returns a
9``(VerdictLabel, VerdictReason)`` tuple. The function is pure: no logger
10calls, no I/O, no random sources, no clock reads. The wall-clock value is
11passed in on the call signature so tests can pin it.
13The cascade order is fixed:
151. **Budget terminations** — checked in a fixed sub-order so the
16 verdict_reason is deterministic when more than one cap is breached:
18 * ``max_iterations`` — the in-progress iteration would be the
19 ``budget["max_iterations"]``-th or later. Computed as
20 ``len(session["iterations"]) + 1 >= max_iterations``.
21 * ``max_wall_clock`` — ``now - session["started_at"] >= max_wall_clock_seconds``.
22 Returns False when ``started_at`` is missing (the session has
23 not yet transitioned out of ``pending``).
24 * ``no_progress`` — ``no_progress_counter >= stagnation_threshold``.
25 When the session has ``use_sampling=true``, the heuristic (step
26 4) gets priority so the sampler can revise the strategy before
27 the loop terminates. Without sampling, ``no_progress``
28 terminates immediately. If the heuristic doesn't fire (e.g.,
29 the tool sequence changed after a prior revision), the deferred
30 stagnation check (step 4b) terminates.
322. **Completion** — every ``required=True`` Criterion has status
33 ``met`` in the in-progress iteration's ``criteria_evaluation``, AND
34 no Criterion (required or not) has status ``inconclusive``.
363. **Cadence-skip** — when :func:`should_evaluate_now` says "this is
37 not a checkpoint", emit a synthetic ``("continue", "cadence_skip")``
38 without consulting the Strategy_Revision_Heuristic. The heuristic
39 only fires on real checkpoints so off-cadence iterations cannot
40 advance the no-progress counter or trigger an ``adjust``.
424. **Strategy_Revision_Heuristic** — :func:`_strategy_unproductive`:
43 the same ``tool_calls[*].tool_name`` sequence
44 for the last 3 iterations AND ``no_progress_counter`` at or above
45 half the stagnation threshold, OR new errors in the latest
46 Observation that didn't appear in the prior Observation. Returns
47 ``("adjust", "heuristic_unproductive")`` when either clause fires.
494b. **Deferred stagnation** — if step 1c deferred the ``no_progress``
50 check (because sampling is enabled) and the heuristic didn't fire,
51 terminate now.
535. **Default** — ``("continue", "in_progress")``.
55The ``iteration`` argument is *not* yet present in
56``session["iterations"]`` — the engine appends it after the verdict is
57decided. Anything that needs to look at "the last N iterations
58including the current one" composes the current ``iteration`` with
59``session["iterations"][-(N-1):]``.
61Determinism: same ``(session, iteration, now)`` triples produce the same
62``(VerdictLabel, VerdictReason)`` tuples. This is enforced by a property
63test in ``tests/test_mission_decide_determinism.py``.
65Cost guardrails are intentionally absent from this cascade. Real-time
66workload cost tracking is structurally inaccurate (Spot vs on-demand
67drift, EBS / EFA / egress not in the Pricing API, Cost Explorer 24h
68latency). Operators who need a cost cap should configure AWS Budgets
69and Cost Anomaly Detection at the account level — Mission caps only
70the controls the loop has direct visibility into.
71"""
73from __future__ import annotations
75import math
76from datetime import datetime, timedelta
78from .checkpoints import should_evaluate_now
79from .types import IterationRecord, SessionState, VerdictLabel, VerdictReason
81# <pyflowchart-code-diagram> BEGIN - auto-inserted, do not edit
82# Flowchart(s) generated from this file:
83# * ``decide_verdict`` -> ``diagrams/code_diagrams/mcp/mission/decide.decide_verdict.html``
84# (PNG: ``diagrams/code_diagrams/mcp/mission/decide.decide_verdict.png``)
85# Regenerate with ``python diagrams/code_diagrams/generate.py``.
86# <pyflowchart-code-diagram> END
89__all__ = [
90 "build_revision_rationale_template",
91 "decide_verdict",
92]
95def decide_verdict(
96 session: SessionState,
97 iteration: IterationRecord,
98 now: datetime,
99) -> tuple[VerdictLabel, VerdictReason]:
100 """Return the deterministic Verdict for the in-progress iteration.
102 The cascade order is fixed (see module docstring). The first matching
103 branch wins: a session that has both run out of iterations and has
104 every Criterion met returns ``("terminate", "max_iterations")``, not
105 ``("complete", "criteria_met")`` — budget caps are evaluated before
106 completion so the operator can tell the loop ended because it ran
107 out of budget rather than because the goal was reached on the
108 closing iteration.
109 Note: When the prior Execute_Phase ran a scripted Strategy and the
110 sandbox cap fired, ``_execute_script`` writes
111 ``iteration["sandbox_terminated_reason"]`` and the cascade returns
112 that reason verbatim before anything else is consulted. The
113 sandbox limit is a true budget cap — the script ran out of wall
114 clock during execution — so it routes to a ``terminate`` verdict
115 on the same path as the ``BudgetControls``-driven caps below.
116 """
117 # 0. Sandbox-cap propagation. ``_execute_script`` stashes the
118 # reason on the in-progress iteration when the sandbox runner
119 # raised :class:`SandboxTerminated`. Reading the sentinel here
120 # means the engine's Execute_Phase can complete cleanly (no phase
121 # failure) while still routing the verdict to the budget-cap path.
122 sandbox_reason = iteration.get("sandbox_terminated_reason")
123 if sandbox_reason is not None:
124 return ("terminate", sandbox_reason)
126 # 1a. max_iterations — the +1 captures "this in-progress iteration
127 # would be the Nth one to land", so a session with budget=N and N-1
128 # already-recorded iterations terminates on the Nth's Decide_Phase.
129 # ``-1`` is the explicit "uncapped" sentinel; the validator
130 # already enforced that any other non-positive value is rejected.
131 max_iter = session["budget"]["max_iterations"]
132 if max_iter != -1 and len(session["iterations"]) + 1 >= max_iter:
133 return ("terminate", "max_iterations")
134 # 1b. max_wall_clock — pure time arithmetic; missing started_at
135 # means the session has not yet recorded its first iteration's
136 # start, so no wall-clock can be measured.
137 if _wall_clock_exceeded(session, now):
138 return ("terminate", "max_wall_clock")
139 # 1c. no_progress — the counter is incremented by the engine only
140 # on evaluated iterations, so a session with all-skipped checkpoints
141 # cannot terminate for stagnation. When the session has sampling
142 # enabled, the heuristic gets priority (step 4 below) so the
143 # sampler can revise the strategy before the loop terminates.
144 # Without sampling, ``adjust`` is purely informational and
145 # ``no_progress`` terminates immediately.
146 if session["no_progress_counter"] >= session["stagnation_threshold"]:
147 if not session.get("use_sampling"):
148 return ("terminate", "no_progress")
149 # With sampling enabled, fall through to the heuristic check
150 # below. If the heuristic fires, the sampler gets one more
151 # chance. If it doesn't fire (e.g., the tool sequence changed
152 # after a prior revision), terminate for stagnation.
153 _stagnation_pending = True
154 else:
155 _stagnation_pending = False
157 # 2. Completion — every required Criterion met AND nothing inconclusive.
158 if _completion_satisfied(session, iteration):
159 return ("complete", "criteria_met")
161 # 3. Cadence-skip — bail before the heuristic fires so off-cadence
162 # iterations don't ever produce ``adjust``. The iteration_index
163 # passed to ``should_evaluate_now`` is the 0-indexed position of
164 # the in-progress iteration (which equals the count of already-
165 # persisted iterations).
166 if not should_evaluate_now(session, len(session["iterations"]), now):
167 return ("continue", "cadence_skip")
169 # 4. Strategy_Revision_Heuristic.
170 unproductive, _heuristic_reason = _strategy_unproductive(session, iteration)
171 if unproductive:
172 return ("adjust", "heuristic_unproductive")
174 # 4b. Deferred stagnation — the counter hit the threshold but the
175 # heuristic didn't fire (e.g., the tool sequence changed after a
176 # prior sampled revision). Terminate now.
177 if _stagnation_pending:
178 return ("terminate", "no_progress")
180 # 5. Default.
181 return ("continue", "in_progress")
184# ---------------------------------------------------------------------------
185# Budget helpers — pure
186# ---------------------------------------------------------------------------
189def _wall_clock_exceeded(session: SessionState, now: datetime) -> bool:
190 """True iff ``now - session["started_at"] >= max_wall_clock_seconds``.
192 Returns False when ``started_at`` is absent — a session that has
193 never been transitioned out of ``pending`` cannot have exceeded any
194 wall-clock budget. The engine writes ``started_at`` on the first
195 iteration entry, so this guard only matters for the synthetic
196 "decide called before run_iteration" path used in unit tests.
198 Returns False when ``max_wall_clock_seconds`` is the explicit
199 ``-1`` "uncapped" sentinel — the operator opted out of the wall-
200 clock cap and the cascade should fall through to the next branch
201 rather than terminate spuriously.
202 """
203 started_iso = session.get("started_at")
204 if not started_iso: 204 ↛ 205line 204 didn't jump to line 205 because the condition on line 204 was never true
205 return False
206 max_seconds = session["budget"]["max_wall_clock_seconds"]
207 if max_seconds == -1:
208 return False
209 started = datetime.fromisoformat(started_iso)
210 return now - started >= timedelta(seconds=max_seconds)
213# ---------------------------------------------------------------------------
214# Completion check
215# ---------------------------------------------------------------------------
218def _completion_satisfied(
219 session: SessionState,
220 iteration: IterationRecord,
221) -> bool:
222 """True iff every required Criterion is met and none are inconclusive.
224 A session completes when all Criteria with ``required=True`` have
225 status ``met`` AND no Criterion (required or not) has status
226 ``inconclusive``. The ``required`` flag lives on the Criterion
227 declaration in ``session["criteria"]``; the per-iteration status
228 lives on ``iteration["criteria_evaluation"]``. The two are joined
229 by ``criterion_id``.
231 A session with zero declared Criteria can never complete on its own
232 — there are no required Criteria for the cascade to satisfy. The
233 operator drives such a session to terminal via ``mission_complete``
234 or a budget cap. We mirror that semantic here by returning False
235 when the criteria list is empty.
236 """
237 if not session["criteria"]:
238 return False
239 required_by_id = {c["criterion_id"]: c.get("required", True) for c in session["criteria"]}
240 for result in iteration["criteria_evaluation"]:
241 status = result["status"]
242 if status == "inconclusive":
243 return False
244 if required_by_id.get(result["criterion_id"], True) and status != "met":
245 return False
246 return True
249# ---------------------------------------------------------------------------
250# Strategy_Revision_Heuristic
251# ---------------------------------------------------------------------------
254def _strategy_unproductive(
255 session: SessionState,
256 iteration: IterationRecord,
257) -> tuple[bool, str]:
258 """Pure heuristic for the Strategy_Revision check.
260 Two clauses, evaluated in declaration order. The first match wins
261 so the returned reason is deterministic when both clauses fire.
263 * **Clause (a)** — the same ``tool_calls[*].tool_name`` sequence
264 has been used for the last 3 iterations (counting the in-progress
265 one) AND ``no_progress_counter >= ceil(stagnation_threshold / 2)``.
266 Needs at least 2 prior iterations to evaluate (3 total when the
267 current iteration is included). A scripted strategy contributes
268 an empty sequence so two scripts with the same body register as
269 "same sequence" — that's intentional: the heuristic flags repeats,
270 and an empty-sequence repeat across three iterations is a repeat.
271 * **Clause (b)** — the in-progress Observation contains at least
272 one ``errors`` entry that did not appear in the immediately
273 prior Iteration's Observation. Needs at least 1 prior iteration
274 to evaluate. Without a prior to compare to, "new" is undefined
275 and we return False.
277 Returns ``(False, "")`` when neither clause fires. When clause (a)
278 fires, the reason is ``"tool_sequence_repeating"``; when clause (b)
279 fires, ``"new_observation_errors"``. The reason string is
280 informational only — the Verdict's ``verdict_reason`` is always
281 ``"heuristic_unproductive"`` regardless of which clause matched.
282 """
283 # Clause (a): no_progress threshold AND tool-sequence repeat.
284 threshold = session["stagnation_threshold"]
285 half = math.ceil(threshold / 2)
286 if session["no_progress_counter"] >= half:
287 # Need at least 2 prior + the current = 3 total iterations.
288 prior = session["iterations"]
289 if len(prior) >= 2: 289 ↛ 296line 289 didn't jump to line 296 because the condition on line 289 was always true
290 recent_three = [prior[-2], prior[-1], iteration]
291 sequences = [_tool_name_sequence(it) for it in recent_three]
292 if sequences[0] == sequences[1] == sequences[2]:
293 return (True, "tool_sequence_repeating")
295 # Clause (b): new errors in the latest Observation vs the prior one.
296 if session["iterations"]:
297 prior_observation = session["iterations"][-1].get("observation") or {}
298 prior_errors = list(prior_observation.get("errors") or [])
299 current_errors = list(iteration["observation"].get("errors") or [])
300 for err in current_errors:
301 if err not in prior_errors: 301 ↛ 300line 301 didn't jump to line 300 because the condition on line 301 was always true
302 return (True, "new_observation_errors")
304 return (False, "")
307def _tool_name_sequence(iteration: IterationRecord) -> tuple[str, ...]:
308 """Extract the ordered tuple of ``tool_name``s from an iteration's strategy.
310 Returns an empty tuple when the strategy is a script (no
311 ``tool_calls``) or when ``tool_calls`` is missing. Two scripted
312 strategies therefore both produce ``()`` and compare equal — clause
313 (a) treats that as "same sequence", which matches the operator's
314 intent of flagging mechanical repetition regardless of mode.
315 """
316 strategy = iteration.get("strategy") or {}
317 tool_calls = strategy.get("tool_calls") or []
318 return tuple(str(call.get("tool_name", "")) for call in tool_calls if isinstance(call, dict))
321# ---------------------------------------------------------------------------
322# Revision rationale template
323# ---------------------------------------------------------------------------
326def build_revision_rationale_template(
327 session: SessionState,
328 iteration: IterationRecord,
329) -> str:
330 """Build the deterministic ``revision_rationale`` text for an ``adjust`` verdict.
332 Used both as the rationale on sessions with ``use_sampling=false``
333 and as the fallback rationale when sampling is rejected on a
334 ``use_sampling=true`` session.
335 Pure: depends only on persisted Session/Iteration fields, never
336 calls into the sampler or any other non-deterministic component.
338 The rendered text names the iteration index (1-indexed for
339 operator-friendliness), the heuristic reason, the unmet Criterion
340 ids (so the rationale points at the goal that's still moving), and
341 a one-line summary of the in-progress strategy (tool-name sequence
342 or ``"scripted strategy"``). The format is intentionally short and
343 machine-parseable — operators can grep it; no LLM is involved.
344 """
345 # Resolve the iteration index — the in-progress iteration has not
346 # been appended to session["iterations"] yet, so its 0-indexed
347 # position equals len(iterations) and the 1-indexed position is +1.
348 iteration_index_one_based = len(session["iterations"]) + 1
350 # Match the heuristic again so the rationale text matches whichever
351 # clause actually fired. Both calls are pure and cheap.
352 _, heuristic_reason = _strategy_unproductive(session, iteration)
353 if not heuristic_reason:
354 # decide_verdict only emits ``adjust`` when the heuristic fires,
355 # but the caller may invoke this template independently (e.g.
356 # the sampling-fallback path on a non-heuristic adjust) — fall
357 # back to a generic reason so the template stays usable.
358 heuristic_reason = "strategy_review_requested"
360 unmet_ids = [
361 result["criterion_id"]
362 for result in iteration["criteria_evaluation"]
363 if result["status"] == "unmet"
364 ]
365 unmet_summary = ", ".join(unmet_ids) if unmet_ids else "none"
367 strategy = iteration.get("strategy") or {}
368 if "script" in strategy:
369 strategy_summary = "scripted strategy"
370 else:
371 names = _tool_name_sequence(iteration)
372 strategy_summary = ", ".join(names) if names else "no tool calls"
374 no_progress = session["no_progress_counter"]
375 threshold = session["stagnation_threshold"]
377 return (
378 f"Strategy revised on iteration {iteration_index_one_based}: "
379 f"{heuristic_reason}. Unmet criteria: {unmet_summary}. "
380 f"Last strategy: {strategy_summary}. "
381 f"No-progress counter: {no_progress}/{threshold}. "
382 f"Adjusting approach for next iteration."
383 )