Coverage for mcp/mission/sandbox.py: 88%

511 statements  

« prev     ^ index     » next       coverage.py v7.14.1, created at 2026-06-15 15:07 +0000

1"""Restricted AST validator for Mission ``Strategy.script`` source. 

2 

3Where a ``Criterion(kind="predicate")`` carries a single expression, a 

4``Strategy.script`` carries a multi-statement Python module that runs 

5inside the Mission sandbox to drive an iteration. Both surfaces accept 

6untrusted operator input, so both go through a parse-time AST allowlist 

7before any execution. This module owns the script side: it parses 

8scripts in ``exec`` mode, walks the tree with an explicit list of 

9allowed nodes, and rejects everything else with :class:`ScriptRejected`. 

10 

11The script surface is wider than the predicate surface — multi-statement 

12control flow, helper function definitions, named-exception ``try`` / 

13``except`` / ``finally`` blocks, plus calls to the operator-supplied 

14tool allowlist — so this module is its own validator rather than a 

15shared base class. The structural decisions (an :class:`ast.NodeVisitor` 

16that defines a ``visit_*`` for every accepted node and rejects in 

17``generic_visit``, an exception type carrying ``reason`` / 

18``failing_node`` / ``lineno`` / ``col_offset``, dunder filtering on 

19strings and identifiers, comprehension-target shadowing checks) mirror 

20:mod:`mcp.mission.predicate` so the two layers reject the same shapes 

21the same way. 

22 

23Two layers, same as the predicate sandbox: 

24 

251. **Parse-time validation.** :func:`validate_script_ast` parses the 

26 source in ``exec`` mode and walks the tree with 

27 :class:`_ScriptValidator`. The first disallowed construct raises 

28 :class:`ScriptRejected`; the script never runs. 

292. **Run-time isolation.** The runtime layer (the 

30 :class:`MissionSandbox` wrapper around ``MontySandboxProvider``) 

31 executes a validated script under shared duration / memory limits 

32 with an explicit namespace that withholds dangerous builtins like 

33 ``open`` / ``getattr`` / ``__import__``. Even a tree that smuggled 

34 past this validator would fail at lookup. 

35 

36Allowed surface 

37--------------- 

38**Statements:** ``Module``, ``Expr``, ``Assign``, ``AugAssign``, 

39``AnnAssign``, ``If``, ``While``, ``For``, ``Pass``, ``Break``, 

40``Continue``, ``Return``, ``FunctionDef`` (no decorators), ``Try`` 

41(named-exception handlers only), ``Raise``. 

42 

43**Expressions:** constants, names from the allowlist, container 

44literals (``List`` / ``Tuple`` / ``Set`` / ``Dict``), comprehensions 

45(``ListComp`` / ``SetComp`` / ``DictComp`` / ``GeneratorExp``), 

46``BinOp`` / ``UnaryOp`` / ``BoolOp`` / ``Compare`` / ``IfExp``, 

47subscript and slice access, f-strings, lambdas, the walrus operator, 

48plus calls. 

49 

50**Names visible to a script (the *base scope*):** 

51 

52- ``mission`` — the per-iteration namespace; the only allowed 

53 attribute access is ``mission.observe`` and ``mission.event``. 

54- The pure stdlib callables ``len``, ``min``, ``max``, ``sum``, 

55 ``abs``, ``any``, ``all``, ``sorted``, ``range``, ``enumerate``, 

56 ``zip``, ``list``, ``dict``, ``tuple``, ``set``, ``str``, ``int``, 

57 ``float``, ``bool``. 

58- A small set of built-in exception classes so ``raise ValueError(...)`` 

59 and ``except KeyError as e:`` both work without importing. 

60- Every tool name the operator placed on the per-session allowlist. 

61 

62**Calls** may target a bare name from the base scope, a name a script 

63introduced (a function it defined or a value it bound), or one of the 

64two attribute calls ``mission.observe(...)`` / ``mission.event(...)``. 

65``exec``, ``eval``, ``compile``, and ``__import__`` are rejected by 

66name even if a script binds those identifiers locally. 

67 

68Rejected outright 

69----------------- 

70``Import`` / ``ImportFrom``, ``ClassDef``, ``AsyncFunctionDef`` / 

71``AsyncFor`` / ``AsyncWith``, ``Yield`` / ``YieldFrom``, ``Global`` / 

72``Nonlocal``, ``Match``, ``With``, ``Assert``, ``Delete``, decorators 

73(the allowlist is currently empty), bare ``except:`` clauses, 

74attribute access on anything other than ``mission``, calls on 

75attributes / subscripts / other calls, dunder strings and identifiers, 

76and any binding (``Assign``, ``AnnAssign``, ``AugAssign``, walrus, 

77function parameter, function name, comprehension target, ``for`` 

78target, ``except as`` name) whose name shadows a base-scope identifier. 

79 

80``Await`` carries a single, narrow exception: ``await <tool>(...)`` 

81where ``<tool>`` is a bare name on the per-session tool allowlist. 

82The runtime layer below exposes every allowlisted tool through the 

83underlying Monty ``external_functions`` channel as a coroutine 

84factory, so the script must ``await`` the call to receive the 

85dispatcher's return value rather than a coroutine object. Every 

86other ``Await`` shape — ``await name`` on a non-call, ``await 

87mission.observe(...)``, ``await some_other_tool()`` for a tool not on 

88the allowlist, ``await (lambda: ...)()`` — stays rejected with 

89reason ``await_not_allowed``. 

90""" 

91 

92from __future__ import annotations 

93 

94import ast 

95from collections.abc import Iterable 

96from typing import Final, NoReturn 

97 

98# <pyflowchart-code-diagram> BEGIN - auto-inserted, do not edit 

99# Flowchart(s) generated from this file: 

100# * ``validate_script_ast`` -> ``diagrams/code_diagrams/mcp/mission/sandbox.validate_script_ast.html`` 

101# (PNG: ``diagrams/code_diagrams/mcp/mission/sandbox.validate_script_ast.png``) 

102# Regenerate with ``python diagrams/code_diagrams/generate.py``. 

103# <pyflowchart-code-diagram> END 

104 

105 

106# --------------------------------------------------------------------------- 

107# Allowlists 

108# --------------------------------------------------------------------------- 

109 

110_SAFE_BUILTINS: Final[frozenset[str]] = frozenset( 

111 { 

112 "len", 

113 "min", 

114 "max", 

115 "sum", 

116 "abs", 

117 "any", 

118 "all", 

119 "sorted", 

120 "range", 

121 "enumerate", 

122 "zip", 

123 "list", 

124 "dict", 

125 "tuple", 

126 "set", 

127 "str", 

128 "int", 

129 "float", 

130 "bool", 

131 } 

132) 

133"""Pure stdlib callables a script may look up by bare name.""" 

134 

135_ALLOWED_EXCEPTION_NAMES: Final[frozenset[str]] = frozenset( 

136 { 

137 "Exception", 

138 "ValueError", 

139 "TypeError", 

140 "KeyError", 

141 "IndexError", 

142 "AttributeError", 

143 "LookupError", 

144 "RuntimeError", 

145 "ArithmeticError", 

146 "ZeroDivisionError", 

147 "OverflowError", 

148 "OSError", 

149 "FileNotFoundError", 

150 "TimeoutError", 

151 "ConnectionError", 

152 "StopIteration", 

153 "AssertionError", 

154 } 

155) 

156"""Built-in exception classes a script may name in ``raise`` and ``except``. 

157 

158Including these in the base scope is what lets a script say 

159``except ValueError as e:`` or ``raise RuntimeError("msg")`` without an 

160``import``. Constructing an exception instance is side-effect-free, so 

161exposing the class is no broader than exposing the safe builtins. 

162""" 

163 

164_MISSION_NAMESPACE_NAME: Final[str] = "mission" 

165"""Top-level identifier reserved for the per-iteration helper namespace.""" 

166 

167_MISSION_HELPER_ATTRIBUTES: Final[frozenset[str]] = frozenset({"observe", "event"}) 

168"""Only attributes the validator accepts on the ``mission`` namespace.""" 

169 

170_FORBIDDEN_CALL_TARGETS: Final[frozenset[str]] = frozenset( 

171 {"exec", "eval", "compile", "__import__"} 

172) 

173"""Names whose call form is rejected by name even if a script shadows them. 

174 

175A script could in principle write ``def exec(): ...`` and then call its 

176own local. Rejecting these names at the call site as well as via the 

177dunder filter (for ``__import__``) closes the gap. 

178""" 

179 

180_ALLOWED_DECORATORS: Final[frozenset[str]] = frozenset() 

181"""Decorator names a function definition may carry. 

182 

183Currently empty: any ``@decorator`` on a ``FunctionDef`` is rejected. 

184The hook is here so a future iteration can vet a small set of operator- 

185facing helpers (e.g. a retry decorator) by editing only this constant. 

186""" 

187 

188_ALLOWED_BIN_OPS: Final[tuple[type[ast.operator], ...]] = ( 

189 ast.Add, 

190 ast.Sub, 

191 ast.Mult, 

192 ast.Div, 

193 ast.FloorDiv, 

194 ast.Mod, 

195 ast.Pow, 

196 ast.MatMult, 

197) 

198 

199_ALLOWED_UNARY_OPS: Final[tuple[type[ast.unaryop], ...]] = ( 

200 ast.UAdd, 

201 ast.USub, 

202 ast.Not, 

203 ast.Invert, 

204) 

205 

206_ALLOWED_COMPARE_OPS: Final[tuple[type[ast.cmpop], ...]] = ( 

207 ast.Eq, 

208 ast.NotEq, 

209 ast.Lt, 

210 ast.LtE, 

211 ast.Gt, 

212 ast.GtE, 

213 ast.Is, 

214 ast.IsNot, 

215 ast.In, 

216 ast.NotIn, 

217) 

218 

219_ALLOWED_BOOL_OPS: Final[tuple[type[ast.boolop], ...]] = (ast.And, ast.Or) 

220 

221 

222# --------------------------------------------------------------------------- 

223# Exception 

224# --------------------------------------------------------------------------- 

225 

226 

227class ScriptRejected(Exception): 

228 """Raised when a script source contains a disallowed construct. 

229 

230 Mirror of :class:`mcp.mission.predicate.PredicateRejected` so callers 

231 can render uniform structured errors regardless of which sandbox 

232 layer rejected the input. ``reason`` is a short stable token (e.g. 

233 ``"forbidden_node"``, ``"shadows_protected_name"``) suitable for 

234 machine-readable error envelopes; ``failing_node`` is the 

235 :class:`ast.AST` that triggered rejection (``None`` only when the 

236 source failed to parse at all). 

237 """ 

238 

239 def __init__( 

240 self, 

241 reason: str, 

242 *, 

243 failing_node: ast.AST | None = None, 

244 message: str | None = None, 

245 ) -> None: 

246 self.reason: str = reason 

247 self.failing_node: ast.AST | None = failing_node 

248 self.lineno: int | None = ( 

249 getattr(failing_node, "lineno", None) if failing_node is not None else None 

250 ) 

251 self.col_offset: int | None = ( 

252 getattr(failing_node, "col_offset", None) if failing_node is not None else None 

253 ) 

254 rendered = message if message is not None else reason 

255 if self.lineno is not None: 

256 rendered = f"{rendered} (line {self.lineno}, col {self.col_offset})" 

257 super().__init__(rendered) 

258 

259 

260# --------------------------------------------------------------------------- 

261# Validator 

262# --------------------------------------------------------------------------- 

263 

264 

265class _ScriptValidator(ast.NodeVisitor): 

266 """Walk a script AST and reject any construct outside the allowlist. 

267 

268 The validator tracks two things across the walk: 

269 

270 * **The base scope** — the union of the operator-supplied tool 

271 allowlist, the safe builtins, the allowed exception names, and 

272 the ``mission`` namespace. These names are *protected*: a script 

273 may read them but may not bind, rebind, or shadow them with a 

274 local of any kind (assignment, walrus, function parameter, 

275 function name, comprehension target, ``for`` target, 

276 ``except as`` name). Protecting them keeps the security model 

277 one-line-tall: if you see a Name in the source whose ``id`` is 

278 ``submit_job_sqs``, you can be sure it resolves to the registered 

279 tool. 

280 * **A scope stack** — entries onto the stack carry the names a 

281 script has bound at module level plus the names introduced by 

282 function parameters, comprehension targets, ``for`` loops, and 

283 ``except as`` clauses. The stack is what makes a helper function 

284 that defines a parameter ``i`` validate cleanly without ``i`` 

285 leaking into the module-level scope. 

286 """ 

287 

288 def __init__(self, allowlist: Iterable[str]) -> None: 

289 # Order does not matter; keep as a frozenset for fast membership. 

290 self._tool_allowlist: frozenset[str] = frozenset(allowlist) 

291 

292 # Names that are visible from the start of the script and that 

293 # script-introduced bindings may NOT shadow. The mission 

294 # namespace counts as protected: rebinding it would defeat the 

295 # one-allowed-attribute-base rule in :meth:`visit_Attribute`. 

296 # The forbidden call targets (``eval``, ``exec``, ``compile``, 

297 # ``__import__``) are folded into the protected set so that a 

298 # script trying to shadow them — ``(eval := 1)``, ``def exec(): 

299 # ...``, ``for compile in xs:`` — is rejected at the binding 

300 # site with ``shadows_protected_name``, in addition to the 

301 # call-site rejection in :meth:`visit_Call`. Two layers of 

302 # defense for the same risk: a reader does not have to chase 

303 # every later use to know whether the shadow is harmful. 

304 self._base_scope: frozenset[str] = ( 

305 self._tool_allowlist 

306 | _SAFE_BUILTINS 

307 | _ALLOWED_EXCEPTION_NAMES 

308 | _FORBIDDEN_CALL_TARGETS 

309 | {_MISSION_NAMESPACE_NAME} 

310 ) 

311 

312 # Stack of frozensets of script-bound names (function params, 

313 # for-loop targets, comprehension targets, assignment targets, 

314 # function definitions). The base frame is empty; each scope 

315 # push appends a new frame whose contents accumulate from the 

316 # parent frame so a nested lookup can see outer locals. 

317 self._scopes: list[frozenset[str]] = [frozenset()] 

318 

319 # ---- helpers ------------------------------------------------------- 

320 

321 def _current_locals(self) -> frozenset[str]: 

322 return self._scopes[-1] 

323 

324 def _name_is_visible(self, name: str) -> bool: 

325 return name in self._base_scope or name in self._current_locals() 

326 

327 @staticmethod 

328 def _is_dunder(name: str) -> bool: 

329 return name.startswith("__") 

330 

331 @staticmethod 

332 def _reject(reason: str, node: ast.AST, message: str | None = None) -> NoReturn: 

333 raise ScriptRejected(reason, failing_node=node, message=message) 

334 

335 def _push_scope(self, locals_: frozenset[str]) -> None: 

336 self._scopes.append(self._current_locals() | locals_) 

337 

338 def _pop_scope(self) -> None: 

339 self._scopes.pop() 

340 

341 def _bind_local(self, name: str, node: ast.AST) -> None: 

342 """Add ``name`` to the current frame, rejecting protected shadows. 

343 

344 Used by every binding form (assignment, walrus, function name, 

345 function parameter, ``for`` target, comprehension target, 

346 ``except as`` name). The shadow check is what prevents a 

347 script from rebinding ``submit_job_sqs`` or ``mission`` and 

348 thereby sneaking past later name-based validation. 

349 """ 

350 if self._is_dunder(name): 

351 self._reject( 

352 "dunder_binding", 

353 node, 

354 f"binding to '{name}' is not allowed (starts with '__')", 

355 ) 

356 if name in self._base_scope: 

357 self._reject( 

358 "shadows_protected_name", 

359 node, 

360 f"binding to '{name}' shadows a protected name", 

361 ) 

362 # The accumulated-frame model means we replace the top frame 

363 # rather than mutate it in place: every ``_push_scope`` already 

364 # captured the parent, and append-adds at the leaf are local to 

365 # this frame. 

366 self._scopes[-1] = self._scopes[-1] | {name} 

367 

368 def _collect_target_names(self, target: ast.AST) -> list[ast.Name]: 

369 """Flatten an assignment / for / comprehension target. 

370 

371 Tuples and lists nest (``for (a, b) in pairs``). ``Starred`` 

372 wraps (``a, *rest = xs``). Anything else under a target — 

373 ``Subscript``, ``Attribute`` — would be a write into a 

374 non-local namespace and is rejected by the caller via the 

375 ``invalid_target`` reason. 

376 """ 

377 if isinstance(target, ast.Name): 

378 return [target] 

379 if isinstance(target, (ast.Tuple, ast.List)): 

380 collected: list[ast.Name] = [] 

381 for elt in target.elts: 

382 collected.extend(self._collect_target_names(elt)) 

383 return collected 

384 if isinstance(target, ast.Starred): 

385 return self._collect_target_names(target.value) 

386 self._reject( 

387 "invalid_target", 

388 target, 

389 "assignment / loop target must be a plain identifier", 

390 ) 

391 return [] # unreachable; _reject raises 

392 

393 def _bind_targets(self, target: ast.AST) -> None: 

394 for name_node in self._collect_target_names(target): 

395 self._bind_local(name_node.id, name_node) 

396 

397 # ---- top-level entry ---------------------------------------------- 

398 

399 def visit_Module(self, node: ast.Module) -> None: 

400 # ``ast.parse(..., mode="exec")`` produces a Module whose body 

401 # is a list of statements. Walk each in order so any forward 

402 # binding (e.g. a function definition followed by a call) 

403 # validates with the binding visible in the same module scope. 

404 for stmt in node.body: 

405 self.visit(stmt) 

406 

407 # ---- catch-all ----------------------------------------------------- 

408 

409 def generic_visit(self, node: ast.AST) -> None: 

410 # Default rejection: the validator opts in to every supported 

411 # node via a dedicated ``visit_*`` method. Anything reaching 

412 # ``generic_visit`` is something the operator wrote that the 

413 # script surface deliberately does not support — ``Import``, 

414 # ``ClassDef``, ``Global``, ``Nonlocal``, ``Match``, ``With``, 

415 # ``Assert``, ``Delete``, ``Yield``, ``AsyncFunctionDef`` / 

416 # ``AsyncFor`` / ``AsyncWith`` (``Await`` is handled by its 

417 # own narrow visitor), etc. 

418 self._reject( 

419 "forbidden_node", 

420 node, 

421 f"{type(node).__name__} is not allowed in a script", 

422 ) 

423 

424 # ---- statements ---------------------------------------------------- 

425 

426 def visit_Expr(self, node: ast.Expr) -> None: 

427 self.visit(node.value) 

428 

429 def visit_Pass(self, node: ast.Pass) -> None: 

430 # No children; the visitor still has to opt in to keep 

431 # generic_visit from rejecting it. 

432 pass 

433 

434 def visit_Break(self, node: ast.Break) -> None: 

435 pass 

436 

437 def visit_Continue(self, node: ast.Continue) -> None: 

438 pass 

439 

440 def visit_Assign(self, node: ast.Assign) -> None: 

441 # Validate the RHS *first* under the current scope, then bind 

442 # the LHS targets. This ordering matters for ``x = x + 1``: the 

443 # right-hand ``x`` must already exist as a local; if it does 

444 # not, the ``visit_Name`` lookup fails. Conversely, ``x = 1`` 

445 # introduces ``x`` only after the literal validates. 

446 self.visit(node.value) 

447 for target in node.targets: 

448 self._bind_targets(target) 

449 

450 def visit_AugAssign(self, node: ast.AugAssign) -> None: 

451 if not isinstance(node.op, _ALLOWED_BIN_OPS): 451 ↛ 452line 451 didn't jump to line 452 because the condition on line 451 was never true

452 self._reject( 

453 "binop_not_allowed", 

454 node, 

455 f"augmented operator {type(node.op).__name__} is not allowed", 

456 ) 

457 # ``x += 1`` reads ``x`` then writes ``x``. The target Name 

458 # must be visible already (no defining via aug-assign), and 

459 # the target itself must not be a protected name. We re-use 

460 # ``_bind_local`` for the shadow check; if ``x`` is already 

461 # local the bind is a no-op. 

462 if not isinstance(node.target, ast.Name): 

463 self._reject( 

464 "invalid_target", 

465 node.target, 

466 "augmented assignment target must be a plain identifier", 

467 ) 

468 # Read-side check: target must already be in scope. 

469 self.visit(node.target) 

470 self.visit(node.value) 

471 # Bind defensively — protects against aug-assign on a 

472 # protected name even though the read-side visit above would 

473 # already accept it (protected names ARE visible). The 

474 # shadow check fires here. 

475 self._bind_local(node.target.id, node.target) 

476 

477 def visit_AnnAssign(self, node: ast.AnnAssign) -> None: 

478 # ``x: int = 1`` and ``x: int`` are accepted; ``obj.attr: int`` 

479 # is not (target must be a plain identifier). 

480 if node.value is not None: 480 ↛ 482line 480 didn't jump to line 482 because the condition on line 480 was always true

481 self.visit(node.value) 

482 if node.annotation is not None: 482 ↛ 484line 482 didn't jump to line 484 because the condition on line 482 was always true

483 self.visit(node.annotation) 

484 if not isinstance(node.target, ast.Name): 

485 self._reject( 

486 "invalid_target", 

487 node.target, 

488 "annotated assignment target must be a plain identifier", 

489 ) 

490 self._bind_local(node.target.id, node.target) 

491 

492 def visit_If(self, node: ast.If) -> None: 

493 self.visit(node.test) 

494 for stmt in node.body: 

495 self.visit(stmt) 

496 for stmt in node.orelse: 496 ↛ 497line 496 didn't jump to line 497 because the loop on line 496 never started

497 self.visit(stmt) 

498 

499 def visit_While(self, node: ast.While) -> None: 

500 self.visit(node.test) 

501 for stmt in node.body: 

502 self.visit(stmt) 

503 for stmt in node.orelse: 

504 self.visit(stmt) 

505 

506 def visit_For(self, node: ast.For) -> None: 

507 # Validate the iterable in the *outer* scope, then bind the 

508 # loop targets in the same scope as the body. ``for x in xs:`` 

509 # leaks ``x`` after the loop, matching Python semantics. 

510 self.visit(node.iter) 

511 self._bind_targets(node.target) 

512 for stmt in node.body: 

513 self.visit(stmt) 

514 for stmt in node.orelse: 

515 self.visit(stmt) 

516 

517 def visit_Return(self, node: ast.Return) -> None: 

518 if node.value is not None: 518 ↛ exitline 518 didn't return from function 'visit_Return' because the condition on line 518 was always true

519 self.visit(node.value) 

520 

521 def visit_Raise(self, node: ast.Raise) -> None: 

522 if node.exc is not None: 522 ↛ 524line 522 didn't jump to line 524 because the condition on line 522 was always true

523 self.visit(node.exc) 

524 if node.cause is not None: 524 ↛ 525line 524 didn't jump to line 525 because the condition on line 524 was never true

525 self.visit(node.cause) 

526 

527 def visit_Try(self, node: ast.Try) -> None: 

528 # Body of the try block runs in the current scope. 

529 for stmt in node.body: 

530 self.visit(stmt) 

531 for handler in node.handlers: 

532 # Bare ``except:`` is rejected — operators must name the 

533 # exception class so an unrelated bug is not silently 

534 # swallowed by the same handler that catches a tool 

535 # timeout. 

536 if handler.type is None: 

537 self._reject( 

538 "bare_except", 

539 handler, 

540 "bare 'except:' is not allowed; name the exception class", 

541 ) 

542 self.visit(handler.type) 

543 # ``except Exc as name:`` introduces ``name`` only inside 

544 # the handler block, mirroring Python semantics. Push a 

545 # new scope so the binding does not leak to siblings. 

546 self._push_scope(frozenset()) 

547 try: 

548 if handler.name is not None: 

549 # ``handler`` is the canonical AST node for the 

550 # binding location; reuse it as the failing-node 

551 # context for shadow rejections. 

552 self._bind_local(handler.name, handler) 

553 for stmt in handler.body: 

554 self.visit(stmt) 

555 finally: 

556 self._pop_scope() 

557 for stmt in node.orelse: 

558 self.visit(stmt) 

559 for stmt in node.finalbody: 

560 self.visit(stmt) 

561 

562 def visit_FunctionDef(self, node: ast.FunctionDef) -> None: 

563 # Decorators are gated by a dedicated allowlist so the security 

564 # surface stays small. The list is currently empty. 

565 for deco in node.decorator_list: 

566 if not (isinstance(deco, ast.Name) and deco.id in _ALLOWED_DECORATORS): 566 ↛ 565line 566 didn't jump to line 565 because the condition on line 566 was always true

567 self._reject( 

568 "decorator_not_allowed", 

569 deco, 

570 "decorators are not allowed on script functions", 

571 ) 

572 # Bind the function name in the *current* scope so the rest of 

573 # the module can call it. The body opens a new scope under 

574 # which arguments live. 

575 self._bind_local(node.name, node) 

576 self._validate_function_signature_and_body(node.args, node.body, node) 

577 

578 def _validate_function_signature_and_body( 

579 self, 

580 args: ast.arguments, 

581 body: list[ast.stmt], 

582 owner: ast.AST, 

583 ) -> None: 

584 # No defaults that touch the outer scope are forbidden, but 

585 # the default expressions still validate under the *outer* 

586 # scope (Python evaluates them once at def time, not per call). 

587 for default in args.defaults: 

588 self.visit(default) 

589 for kw_default in args.kw_defaults: 

590 if kw_default is not None: 

591 self.visit(kw_default) 

592 

593 # Collect parameter names. Reject duplicates and protected 

594 # shadows up front so the body sees a coherent local frame. 

595 param_names: list[tuple[str, ast.AST]] = [] 

596 

597 def _collect_arg(arg: ast.arg) -> None: 

598 param_names.append((arg.arg, arg)) 

599 if arg.annotation is not None: 599 ↛ 600line 599 didn't jump to line 600 because the condition on line 599 was never true

600 self.visit(arg.annotation) 

601 

602 for arg in args.posonlyargs: 602 ↛ 603line 602 didn't jump to line 603 because the loop on line 602 never started

603 _collect_arg(arg) 

604 for arg in args.args: 

605 _collect_arg(arg) 

606 if args.vararg is not None: 

607 _collect_arg(args.vararg) 

608 for arg in args.kwonlyargs: 

609 _collect_arg(arg) 

610 if args.kwarg is not None: 

611 _collect_arg(args.kwarg) 

612 

613 # Push a fresh frame; bindings inside the function do not 

614 # leak to the module-level scope. 

615 self._push_scope(frozenset()) 

616 try: 

617 seen: set[str] = set() 

618 for name, owning_node in param_names: 

619 if name in seen: 

620 self._reject( 

621 "duplicate_parameter", 

622 owning_node, 

623 f"duplicate parameter '{name}'", 

624 ) 

625 seen.add(name) 

626 self._bind_local(name, owning_node) 

627 for stmt in body: 

628 self.visit(stmt) 

629 finally: 

630 self._pop_scope() 

631 

632 # ---- expressions --------------------------------------------------- 

633 

634 def visit_Constant(self, node: ast.Constant) -> None: 

635 # Reject dunder strings even when used as plain data. The same 

636 # rationale as in the predicate sandbox: a string like 

637 # ``"__class__"`` only ever appears in source code as part of 

638 # an introspection escape pattern (``getattr(x, "__class__")``, 

639 # ``locals()["__import__"]``). Forbidding them at the constant 

640 # level closes those off even if a future change widened the 

641 # call or attribute allowlist. 

642 if isinstance(node.value, str) and self._is_dunder(node.value): 

643 self._reject( 

644 "dunder_string", 

645 node, 

646 "string constants starting with '__' are not allowed", 

647 ) 

648 

649 def visit_Name(self, node: ast.Name) -> None: 

650 if self._is_dunder(node.id): 650 ↛ 651line 650 didn't jump to line 651 because the condition on line 650 was never true

651 self._reject( 

652 "dunder_name", 

653 node, 

654 f"identifier '{node.id}' starts with '__'", 

655 ) 

656 if not self._name_is_visible(node.id): 

657 self._reject( 

658 "name_not_allowed", 

659 node, 

660 f"name '{node.id}' is not in the script allowlist", 

661 ) 

662 

663 def visit_NamedExpr(self, node: ast.NamedExpr) -> None: 

664 # ``(x := expr)`` — the walrus binds ``x`` in the enclosing 

665 # scope. Validate the value first, then route through the 

666 # standard binding helper so the protected-name shadow check 

667 # fires for ``(mission := ...)`` etc. 

668 self.visit(node.value) 

669 if not isinstance(node.target, ast.Name): 669 ↛ 670line 669 didn't jump to line 670 because the condition on line 669 was never true

670 self._reject( 

671 "invalid_target", 

672 node.target, 

673 "walrus target must be a plain identifier", 

674 ) 

675 self._bind_local(node.target.id, node.target) 

676 

677 def visit_Lambda(self, node: ast.Lambda) -> None: 

678 # Lambdas are scoped expressions: validate parameters + body 

679 # under a fresh frame, exactly like a ``FunctionDef`` minus 

680 # the decorator list and statement body. The lambda itself 

681 # produces no binding in the enclosing scope. 

682 self._validate_function_signature_and_body(node.args, [ast.Expr(value=node.body)], node) 

683 

684 # ---- containers ---------------------------------------------------- 

685 

686 def visit_List(self, node: ast.List) -> None: 

687 for elt in node.elts: 

688 self.visit(elt) 

689 

690 def visit_Tuple(self, node: ast.Tuple) -> None: 

691 for elt in node.elts: 

692 self.visit(elt) 

693 

694 def visit_Set(self, node: ast.Set) -> None: 

695 for elt in node.elts: 

696 self.visit(elt) 

697 

698 def visit_Dict(self, node: ast.Dict) -> None: 

699 for key in node.keys: 

700 if key is not None: 

701 self.visit(key) 

702 else: 

703 # ``{**other}`` would let a script splat arbitrary 

704 # mappings into a dict literal; reject for the same 

705 # reason as in the predicate sandbox. 

706 self._reject( 

707 "dict_unpacking", 

708 node, 

709 "dict unpacking is not allowed in a script", 

710 ) 

711 for value in node.values: 

712 self.visit(value) 

713 

714 def visit_Starred(self, node: ast.Starred) -> None: 

715 # ``[*xs]``, ``f(*xs)``, ``a, *rest = xs`` — recurse into the 

716 # inner expression so the nested Name still hits the 

717 # allowlist check. 

718 self.visit(node.value) 

719 

720 # ---- operators ----------------------------------------------------- 

721 

722 def visit_BinOp(self, node: ast.BinOp) -> None: 

723 if not isinstance(node.op, _ALLOWED_BIN_OPS): 723 ↛ 724line 723 didn't jump to line 724 because the condition on line 723 was never true

724 self._reject( 

725 "binop_not_allowed", 

726 node, 

727 f"binary operator {type(node.op).__name__} is not allowed", 

728 ) 

729 self.visit(node.left) 

730 self.visit(node.right) 

731 

732 def visit_UnaryOp(self, node: ast.UnaryOp) -> None: 

733 if not isinstance(node.op, _ALLOWED_UNARY_OPS): 

734 self._reject( 

735 "unaryop_not_allowed", 

736 node, 

737 f"unary operator {type(node.op).__name__} is not allowed", 

738 ) 

739 self.visit(node.operand) 

740 

741 def visit_BoolOp(self, node: ast.BoolOp) -> None: 

742 if not isinstance(node.op, _ALLOWED_BOOL_OPS): 

743 self._reject( 

744 "boolop_not_allowed", 

745 node, 

746 f"bool operator {type(node.op).__name__} is not allowed", 

747 ) 

748 for value in node.values: 

749 self.visit(value) 

750 

751 def visit_Compare(self, node: ast.Compare) -> None: 

752 for op in node.ops: 

753 if not isinstance(op, _ALLOWED_COMPARE_OPS): 753 ↛ 754line 753 didn't jump to line 754 because the condition on line 753 was never true

754 self._reject( 

755 "compareop_not_allowed", 

756 node, 

757 f"comparison operator {type(op).__name__} is not allowed", 

758 ) 

759 self.visit(node.left) 

760 for comparator in node.comparators: 

761 self.visit(comparator) 

762 

763 def visit_IfExp(self, node: ast.IfExp) -> None: 

764 self.visit(node.test) 

765 self.visit(node.body) 

766 self.visit(node.orelse) 

767 

768 # ---- attribute and subscript -------------------------------------- 

769 

770 def visit_Attribute(self, node: ast.Attribute) -> None: 

771 # The script surface allows attribute access on exactly one 

772 # name — the ``mission`` namespace — and only for the two 

773 # helper attributes ``observe`` and ``event``. Every other 

774 # ``foo.bar`` reads raise ``ScriptRejected``: tool results are 

775 # opaque values, not deep object graphs, so a script that 

776 # needs nested data should use subscripting on a return value. 

777 if self._is_dunder(node.attr): 

778 self._reject( 

779 "dunder_attribute", 

780 node, 

781 f"attribute '{node.attr}' starts with '__'", 

782 ) 

783 if not isinstance(node.value, ast.Name): 

784 self._reject( 

785 "attribute_target_not_name", 

786 node, 

787 "attribute access is only allowed on the 'mission' namespace", 

788 ) 

789 if node.value.id != _MISSION_NAMESPACE_NAME: 

790 self._reject( 

791 "attribute_target_not_allowed", 

792 node, 

793 "attribute access is only allowed on the 'mission' namespace", 

794 ) 

795 if node.attr not in _MISSION_HELPER_ATTRIBUTES: 

796 self._reject( 

797 "attribute_not_allowed", 

798 node, 

799 f"'mission.{node.attr}' is not an allowed helper", 

800 ) 

801 # ``mission`` itself is a base-scope name; visit it for 

802 # regularity so any future Name-side check still fires here. 

803 self.visit(node.value) 

804 

805 def visit_Subscript(self, node: ast.Subscript) -> None: 

806 # Recurse into both the value and the slice. The base of the 

807 # chain falls out as a ``Name`` lookup that hits the 

808 # allowlist; slices may themselves contain Names and Calls 

809 # that go through the same validation path. 

810 self.visit(node.value) 

811 self.visit(node.slice) 

812 

813 def visit_Slice(self, node: ast.Slice) -> None: 

814 if node.lower is not None: 

815 self.visit(node.lower) 

816 if node.upper is not None: 

817 self.visit(node.upper) 

818 if node.step is not None: 

819 self.visit(node.step) 

820 

821 # ---- calls --------------------------------------------------------- 

822 

823 def visit_Call(self, node: ast.Call) -> None: 

824 # The callee form decides which rule applies. Three shapes are 

825 # allowed: 

826 # 

827 # * ``name(...)`` — bare name call. The name must already be 

828 # visible (base-scope or script-bound local). 

829 # * ``mission.observe(...)`` / ``mission.event(...)`` — the 

830 # only attribute-call shape supported. 

831 # 

832 # ``foo()()`` (call returning a callable, then call), ``a[0]()`` 

833 # (subscript-then-call), and ``x.y()`` for any ``y`` not on the 

834 # mission helper list are all rejected outright. 

835 func = node.func 

836 if isinstance(func, ast.Name): 

837 # ``__import__``, ``exec``, ``eval``, ``compile`` are 

838 # rejected by name even if a script defined a local with 

839 # one of those names. The dunder filter in 

840 # :meth:`visit_Name` already rejects ``__import__`` for 

841 # plain reads; the explicit list is what blocks the 

842 # ``def exec(): ...; exec()`` shadow attempt. 

843 if func.id in _FORBIDDEN_CALL_TARGETS: 

844 self._reject( 

845 "forbidden_call_target", 

846 node, 

847 f"call to '{func.id}' is not allowed", 

848 ) 

849 # Visit the Name so the visibility / dunder check fires. 

850 self.visit(func) 

851 elif isinstance(func, ast.Attribute): 

852 # Only ``mission.observe`` / ``mission.event``. The 

853 # attribute visit raises with a structured reason for 

854 # every other shape (non-Name base, non-mission base, 

855 # disallowed attribute), so we just recurse here. 

856 self.visit(func) 

857 else: 

858 # ``f()()``, ``xs[0]()``, ``(lambda: ...)()`` — the 

859 # callee is neither a Name nor a single ``mission.<x>`` 

860 # attribute access. Reject without descending; the 

861 # blanket ``call_target_shape`` reason captures all three. 

862 self._reject( 

863 "call_target_shape", 

864 node, 

865 "script calls must target a bare name or 'mission.<helper>'", 

866 ) 

867 for arg in node.args: 

868 self.visit(arg) 

869 for kw in node.keywords: 

870 # ``**kwargs`` shows up as a keyword with arg=None; allow 

871 # the value but recurse so its content is still validated 

872 # against the same name and call rules. 

873 self.visit(kw.value) 

874 

875 def visit_Await(self, node: ast.Await) -> None: 

876 # The runtime layer (:class:`MissionSandbox`) exposes every 

877 # allowlisted tool through Monty's ``external_functions`` 

878 # channel, where a registered async callable surfaces inside 

879 # the script as a coroutine factory. Calling 

880 # ``find_examples(query="gpu")`` from inside a script returns 

881 # a coroutine object, not the dispatcher's return value; 

882 # consuming the value requires writing ``await 

883 # find_examples(query="gpu")``. The two ``mission`` helpers 

884 # ride the same channel — the runtime layer prepends a small 

885 # source-level shim that makes ``mission.observe`` / 

886 # ``mission.event`` route into host-side closures via the same 

887 # coroutine-factory channel, so awaiting them is required for 

888 # the side effect (an observation row, an event row) to land 

889 # on the iteration's audit log. The validator therefore opens 

890 # ``Await`` for exactly two shapes: 

891 # 

892 # * ``await <name>(...)`` where ``<name>`` is on the per- 

893 # session tool allowlist. 

894 # * ``await mission.observe(...)`` / ``await mission.event(...)`` 

895 # — attribute calls on the ``mission`` namespace whose 

896 # attribute is one of the two helper names that 

897 # :meth:`visit_Attribute` already accepts. 

898 # 

899 # Both forms route the wrapped Call back through 

900 # :meth:`visit_Call` so kwargs, positional args, and the 

901 # forbidden-call-target rules apply unchanged. 

902 # 

903 # Rejected (folded into ``await_not_allowed``): 

904 # 

905 # * ``await x`` — bare name (no Call inside). 

906 # * ``await some_other_tool()`` — call on a Name that is not 

907 # on the per-session tool allowlist (a safe builtin, an 

908 # exception class, ``mission`` itself, a script-bound local, 

909 # or simply unknown). 

910 # * ``await mission.foo(...)`` for any ``foo`` outside the 

911 # helper set — :meth:`visit_Attribute` would already reject 

912 # the inner call, but the early reject here keeps the reason 

913 # token stable as ``await_not_allowed``. 

914 # * ``await x.observe(...)`` for any ``x`` other than 

915 # ``mission`` — same rationale. 

916 # * ``await (lambda: ...)()`` / ``await xs[0]()`` — 

917 # subscript-then-call / call-of-call shapes; the underlying 

918 # Call would already fail :meth:`visit_Call`'s 

919 # ``call_target_shape`` check, but reject at the await 

920 # level too so the reason token stays ``await_not_allowed``. 

921 # 

922 # ``AsyncFunctionDef`` / ``AsyncFor`` / ``AsyncWith`` continue 

923 # to fall through to :meth:`generic_visit` and stay rejected 

924 # with ``forbidden_node`` — the relaxation here covers only 

925 # the bare ``Await`` expression on the two accepted call 

926 # shapes. 

927 inner = node.value 

928 if not isinstance(inner, ast.Call): 

929 self._reject( 

930 "await_not_allowed", 

931 node, 

932 "'await' may only be used on a call to an allowlisted " 

933 "tool or a 'mission.<helper>' call", 

934 ) 

935 func = inner.func 

936 if isinstance(func, ast.Name): 

937 if func.id not in self._tool_allowlist: 937 ↛ 938line 937 didn't jump to line 938 because the condition on line 937 was never true

938 self._reject( 

939 "await_not_allowed", 

940 node, 

941 "'await' may only be used on a call to an allowlisted " 

942 "tool or a 'mission.<helper>' call", 

943 ) 

944 elif isinstance(func, ast.Attribute): 944 ↛ 958line 944 didn't jump to line 958 because the condition on line 944 was always true

945 # Only ``mission.observe(...)`` / ``mission.event(...)``. 

946 if not ( 946 ↛ 951line 946 didn't jump to line 951 because the condition on line 946 was never true

947 isinstance(func.value, ast.Name) 

948 and func.value.id == _MISSION_NAMESPACE_NAME 

949 and func.attr in _MISSION_HELPER_ATTRIBUTES 

950 ): 

951 self._reject( 

952 "await_not_allowed", 

953 node, 

954 "'await' may only be used on a call to an allowlisted " 

955 "tool or a 'mission.<helper>' call", 

956 ) 

957 else: 

958 self._reject( 

959 "await_not_allowed", 

960 node, 

961 "'await' may only be used on a call to an allowlisted " 

962 "tool or a 'mission.<helper>' call", 

963 ) 

964 # Hand the Call node back to the existing call-validation 

965 # machinery so kwargs, positional args, and the 

966 # forbidden-call-target check all fire exactly as they would 

967 # for the non-awaited form. 

968 self.visit(inner) 

969 

970 # ---- f-strings ----------------------------------------------------- 

971 

972 def visit_JoinedStr(self, node: ast.JoinedStr) -> None: 

973 for value in node.values: 

974 self.visit(value) 

975 

976 def visit_FormattedValue(self, node: ast.FormattedValue) -> None: 

977 self.visit(node.value) 

978 if node.format_spec is not None: 978 ↛ 979line 978 didn't jump to line 979 because the condition on line 978 was never true

979 self.visit(node.format_spec) 

980 

981 # ---- comprehensions ----------------------------------------------- 

982 

983 def _validate_comprehensions(self, generators: list[ast.comprehension]) -> frozenset[str]: 

984 """Walk comprehension generators and return their target names. 

985 

986 Each generator's ``iter`` is validated against the *outer* 

987 scope (it cannot reference targets of its own generator), then 

988 the targets are added to the local set so the next generator's 

989 ``ifs`` and any later ``iter`` can see them. Async generators 

990 (``async for``) are rejected; the script body is sync. 

991 """ 

992 accumulated: set[str] = set() 

993 for gen in generators: 

994 if gen.is_async: 994 ↛ 995line 994 didn't jump to line 995 because the condition on line 994 was never true

995 self._reject( 

996 "async_comprehension", 

997 gen.iter, 

998 "async comprehensions are not allowed", 

999 ) 

1000 self.visit(gen.iter) 

1001 target_names = self._collect_target_names(gen.target) 

1002 for name_node in target_names: 

1003 if self._is_dunder(name_node.id): 

1004 self._reject( 

1005 "dunder_comprehension_target", 

1006 name_node, 

1007 f"comprehension target '{name_node.id}' starts with '__'", 

1008 ) 

1009 if name_node.id in self._base_scope: 

1010 self._reject( 

1011 "shadows_protected_name", 

1012 name_node, 

1013 f"comprehension target '{name_node.id}' shadows a protected name", 

1014 ) 

1015 accumulated.add(name_node.id) 

1016 self._push_scope(frozenset(accumulated)) 

1017 try: 

1018 for if_clause in gen.ifs: 

1019 self.visit(if_clause) 

1020 finally: 

1021 self._pop_scope() 

1022 return frozenset(accumulated) 

1023 

1024 def _visit_comprehension_like( 

1025 self, 

1026 node: ast.ListComp | ast.SetComp | ast.GeneratorExp, 

1027 ) -> None: 

1028 locals_ = self._validate_comprehensions(node.generators) 

1029 self._push_scope(locals_) 

1030 try: 

1031 self.visit(node.elt) 

1032 finally: 

1033 self._pop_scope() 

1034 

1035 def visit_ListComp(self, node: ast.ListComp) -> None: 

1036 self._visit_comprehension_like(node) 

1037 

1038 def visit_SetComp(self, node: ast.SetComp) -> None: 

1039 self._visit_comprehension_like(node) 

1040 

1041 def visit_GeneratorExp(self, node: ast.GeneratorExp) -> None: 

1042 self._visit_comprehension_like(node) 

1043 

1044 def visit_DictComp(self, node: ast.DictComp) -> None: 

1045 locals_ = self._validate_comprehensions(node.generators) 

1046 self._push_scope(locals_) 

1047 try: 

1048 self.visit(node.key) 

1049 self.visit(node.value) 

1050 finally: 

1051 self._pop_scope() 

1052 

1053 

1054# --------------------------------------------------------------------------- 

1055# Public API 

1056# --------------------------------------------------------------------------- 

1057 

1058 

1059def validate_script_ast(script: str, allowlist: list[str]) -> None: 

1060 """Parse and validate a Mission script source string. 

1061 

1062 On success, the function returns ``None`` and the caller may pass 

1063 ``script`` to the sandbox runtime layer. On any disallowed 

1064 construct, raises :class:`ScriptRejected` carrying ``reason``, 

1065 ``failing_node``, ``lineno``, and ``col_offset``. The script is 

1066 *never* executed by this function; it only walks the AST. 

1067 

1068 ``allowlist`` is the per-session list of MCP tool names the script 

1069 may call. Each name becomes a visible bare-Name and a permitted 

1070 call target. Names not in the allowlist (and not in the safe 

1071 builtin / exception / mission set) are rejected at every Name 

1072 lookup. 

1073 """ 

1074 if not isinstance(script, str): 1074 ↛ 1075line 1074 didn't jump to line 1075 because the condition on line 1074 was never true

1075 raise ScriptRejected( 

1076 "not_a_string", 

1077 message="script source must be a str", 

1078 ) 

1079 try: 

1080 parsed = ast.parse(script, mode="exec") 

1081 except SyntaxError as exc: 

1082 rejection = ScriptRejected( 

1083 "syntax_error", 

1084 message=f"could not parse script: {exc.msg}", 

1085 ) 

1086 rejection.lineno = exc.lineno 

1087 rejection.col_offset = exc.offset 

1088 raise rejection from exc 

1089 _ScriptValidator(allowlist).visit(parsed) 

1090 

1091 

1092# =========================================================================== 

1093# Runtime layer — MissionSandbox wrapper around MontySandboxProvider 

1094# =========================================================================== 

1095# 

1096# Where ``validate_script_ast`` above is the parse-time gate, the wrapper 

1097# below is the run-time isolation. A validated script is handed to the 

1098# Monty sandbox under shared duration / memory limits, with two extras 

1099# layered on top: 

1100# 

1101# * The operator-supplied tool allowlist is exposed as a set of async 

1102# callables in the script's namespace. Each callable forwards into the 

1103# engine's tool dispatcher so the existing ``@audit_logged`` / 

1104# feature-flag / allowlist semantics still fire — running inside a 

1105# script is *not* a way to bypass any of those. 

1106# * A ``mission`` namespace object exposes the iteration's read-only 

1107# metadata (deep-copied snapshot of the session's directive, criteria, 

1108# budget, and prior-iteration summaries) plus the two streaming 

1109# helpers ``mission.observe(...)`` / ``mission.event(...)``. The 

1110# helpers append into closure-captured lists that ``MissionSandbox.run`` 

1111# merges into the resulting Observation. 

1112# 

1113# On any limit violation (duration, memory, runtime / typing / syntax 

1114# from inside the script) the ``MontyError`` family bubbles out of the 

1115# provider; the wrapper re-raises it as :class:`SandboxTerminated` 

1116# carrying whatever the script collected before it was killed so the 

1117# engine's ``_decide_phase`` can produce a deterministic ``terminate`` 

1118# verdict with the partial observation attached. 

1119 

1120import copy # noqa: E402 — runtime layer below; keep imports near their consumers 

1121import os # noqa: E402 

1122import time # noqa: E402 

1123from collections.abc import Awaitable, Callable # noqa: E402 

1124from datetime import UTC, datetime # noqa: E402 

1125from types import MappingProxyType # noqa: E402 

1126from typing import Any # noqa: E402 

1127 

1128from . import audit as _audit # noqa: E402 

1129 

1130# --------------------------------------------------------------------------- 

1131# Env helpers — module-level so the constants below are read once at import 

1132# time. Tests pin the constants by monkey-patching the module attributes; a 

1133# per-call read of os.environ would defeat that. 

1134# --------------------------------------------------------------------------- 

1135 

1136 

1137def _int_env(name: str, default: int) -> int: 

1138 """Parse an integer env var; fall back to default on missing/empty/non-numeric. 

1139 

1140 Mirrors the helper in :mod:`mcp.server` so the two code-mode entry 

1141 points read the same caps with the same parsing semantics. Empty, 

1142 whitespace-only, and non-numeric values all collapse to ``default`` 

1143 rather than raising — an operator who fat-fingers the env should 

1144 still get a working sandbox. 

1145 """ 

1146 raw = os.environ.get(name, "").strip() 

1147 if not raw: 

1148 return default 

1149 try: 

1150 return int(raw) 

1151 except ValueError: 

1152 return default 

1153 

1154 

1155def _float_env(name: str, default: float) -> float: 

1156 """Parse a float env var; fall back to default on missing/empty/non-numeric. 

1157 

1158 Same fall-back semantics as :func:`_int_env`. The duration cap is a 

1159 float so fractional seconds remain expressible. 

1160 """ 

1161 raw = os.environ.get(name, "").strip() 

1162 if not raw: 

1163 return default 

1164 try: 

1165 return float(raw) 

1166 except ValueError: 

1167 return default 

1168 

1169 

1170# Read the resource caps once at import time. Tests pin behaviour by 

1171# monkey-patching these module-level constants before constructing a 

1172# MissionSandbox. The defaults match the existing precedent in 

1173# ``mcp/server.py`` where the same env names are wired into the 

1174# Code Mode discovery transform's sandbox. 

1175_DURATION_LIMIT_SECS: float = _float_env("GCO_MCP_CODE_MODE_MAX_DURATION_SECS", 30.0) 

1176_MEMORY_LIMIT_BYTES: int = _int_env("GCO_MCP_CODE_MODE_MAX_MEMORY", 200_000_000) 

1177 

1178 

1179# --------------------------------------------------------------------------- 

1180# Lazy import of the runtime dependencies 

1181# --------------------------------------------------------------------------- 

1182# 

1183# The AST validator above must remain importable on a host where 

1184# ``fastmcp`` and ``pydantic_monty`` are not installed (for example a 

1185# CLI-only environment that runs ``gco mission validate`` against a 

1186# stored session JSON without ever wiring an engine). The provider class 

1187# and the error class are pulled in lazily by ``_import_provider`` and 

1188# cached at module level so repeated MissionSandbox constructions in the 

1189# same process pay the import cost exactly once. 

1190 

1191_MONTY_PROVIDER_CLASS: Any = None 

1192_MONTY_ERROR_CLASS: Any = None 

1193 

1194 

1195def _import_provider() -> tuple[Any, Any]: 

1196 """Lazy-import ``MontySandboxProvider`` and ``MontyError`` and cache them. 

1197 

1198 Returns the ``(provider_cls, error_cls)`` pair. The provider class 

1199 is the value the wrapper instantiates with a ``ResourceLimits`` 

1200 dict; the error class is the *base* ``pydantic_monty.MontyError`` 

1201 that covers the whole limit / runtime / typing / syntax family 

1202 raised from inside a script. We catch the base class rather than 

1203 the leaves so a future Monty release that adds a new error type 

1204 still routes through ``SandboxTerminated`` rather than escaping as 

1205 an opaque ``Exception``. 

1206 """ 

1207 global _MONTY_PROVIDER_CLASS, _MONTY_ERROR_CLASS 

1208 if _MONTY_PROVIDER_CLASS is None: 

1209 from fastmcp.experimental.transforms.code_mode import MontySandboxProvider 

1210 from pydantic_monty import MontyError 

1211 

1212 _MONTY_PROVIDER_CLASS = MontySandboxProvider 

1213 _MONTY_ERROR_CLASS = MontyError 

1214 return _MONTY_PROVIDER_CLASS, _MONTY_ERROR_CLASS 

1215 

1216 

1217# --------------------------------------------------------------------------- 

1218# Termination signal 

1219# --------------------------------------------------------------------------- 

1220 

1221 

1222class SandboxTerminated(Exception): 

1223 """Raised when the Monty sandbox killed the script for exceeding a limit. 

1224 

1225 The Mission engine catches this exception in its decide-phase and 

1226 produces a ``terminate`` verdict for the iteration. Whatever the 

1227 script collected via ``mission.observe(...)`` / ``mission.event(...)`` 

1228 before being killed is carried on the exception so the engine can 

1229 surface the partial Observation in the iteration's audit record — 

1230 a script that ran for 29 seconds and observed five intermediate 

1231 states should not lose those five states just because the 30-second 

1232 cap fired before the script returned. 

1233 

1234 ``cause`` is the underlying Monty exception's class name (e.g. 

1235 ``"MontyRuntimeError"``, ``"MontyTypingError"``) so callers can render 

1236 a stable structured-error envelope without holding a reference to 

1237 the original Monty exception object. 

1238 """ 

1239 

1240 def __init__( 

1241 self, 

1242 cause: str, 

1243 *, 

1244 partial_observations: list[dict[str, Any]] | None = None, 

1245 partial_events: list[dict[str, Any]] | None = None, 

1246 partial_script_call_log: list[dict[str, Any]] | None = None, 

1247 ) -> None: 

1248 self.cause: str = cause 

1249 # Defensive copies: callers occasionally inspect these lists 

1250 # after the exception has propagated several frames up. A 

1251 # shared reference would let a later mutation in the original 

1252 # closure corrupt the audit record. 

1253 self.partial_observations: list[dict[str, Any]] = list(partial_observations or []) 

1254 self.partial_events: list[dict[str, Any]] = list(partial_events or []) 

1255 # Partial in-script tool-call log captured by the per-tool 

1256 # wrappers up to the moment Monty killed the script. Carrying 

1257 # this onto the exception lets the engine's 

1258 # ``_execute_script`` stash the partial calls on the iteration 

1259 # record so a script that fired ten ``submit_job_sqs(...)`` 

1260 # calls before tripping the duration cap still records all ten 

1261 # in the audit log. Defensive copy for the same reason as the 

1262 # observe / event lists above. 

1263 self.partial_script_call_log: list[dict[str, Any]] = list(partial_script_call_log or []) 

1264 super().__init__(f"sandbox terminated: {cause}") 

1265 

1266 

1267# --------------------------------------------------------------------------- 

1268# Script rewrite — mission.observe/event → _mission_observe/_mission_event 

1269# --------------------------------------------------------------------------- 

1270# 

1271# The AST gate above accepts ``mission.observe(...)`` and 

1272# ``mission.event(...)`` as the only two attribute calls a script may 

1273# write on the ``mission`` namespace. The runtime needs those calls to 

1274# land on host-side closures so the iteration's ``observe_log`` / 

1275# ``event_log`` lists actually receive the appends — passing the 

1276# helpers in through ``inputs={"mission": <object>}`` would not work, 

1277# because :class:`MontySandboxProvider` round-trips ``inputs`` values 

1278# into the Monty VM by value (any in-script mutation lands on the VM 

1279# copy, not the host's). Wrapping the helpers in a small host-side 

1280# class and prepending it to the script as a preamble would not work 

1281# either: Monty's parser does not support ``class`` definitions. 

1282# 

1283# Instead, after validation, the host re-parses the script and 

1284# rewrites every accepted ``mission.<helper>(...)`` Call so its 

1285# callee becomes a bare-Name lookup of the corresponding reserved 

1286# external-function name. The rewritten source is then handed to 

1287# Monty, where ``_mission_observe`` / ``_mission_event`` resolve to 

1288# the host-side closures registered via ``external_functions``. 

1289# Operator scripts cannot reference these names directly: the AST 

1290# validator rejects them under ``name_not_allowed`` (neither is on 

1291# the per-session tool allowlist nor in any safe-builtin / exception 

1292# / mission base set), so the only path that produces those Name 

1293# nodes is the rewrite below. 

1294 

1295_MISSION_HELPER_RUNTIME_NAMES: Final[dict[str, str]] = { 

1296 "observe": "_mission_observe", 

1297 "event": "_mission_event", 

1298} 

1299# The keys must mirror ``_MISSION_HELPER_ATTRIBUTES`` exactly: 

1300# the validator opens up ``mission.<attr>`` for those two attributes, 

1301# and the rewriter below has to translate the same two and only the 

1302# same two. A future widening of the helper set has to add an entry 

1303# here too, or the rewriter would leave the new attribute as an 

1304# ``Attribute`` callee and Monty's parser would reject it. 

1305assert set(_MISSION_HELPER_RUNTIME_NAMES) == set(_MISSION_HELPER_ATTRIBUTES) 

1306 

1307 

1308class _MissionAttributeCallRewriter(ast.NodeTransformer): 

1309 """Rewrite ``mission.observe(...)`` / ``mission.event(...)`` callees. 

1310 

1311 The transformer replaces the ``Attribute`` callee on accepted 

1312 ``mission.<helper>`` Call nodes with a ``Name`` referencing the 

1313 corresponding external-function key (``_mission_observe`` / 

1314 ``_mission_event``). Args and kwargs ride through unchanged: the 

1315 AST validator already vetted them, and the rewrite preserves 

1316 source positions so any subsequent error in those subtrees still 

1317 points at the operator's original column. 

1318 

1319 The validator's :meth:`_ScriptValidator.visit_Attribute` already 

1320 rejects every other ``mission.<x>`` shape, so the transformer 

1321 only ever encounters the two helper attributes; defensive 

1322 fallthrough leaves any other ``Attribute`` callee untouched, but 

1323 in practice such a node would not have passed the gate. 

1324 """ 

1325 

1326 def visit_Call(self, node: ast.Call) -> ast.AST: 

1327 # Recurse into args / kwargs first so a nested 

1328 # ``mission.<helper>(...)`` (e.g. inside an f-string used as 

1329 # an argument) is rewritten too. ``self.generic_visit`` 

1330 # walks children and updates them in place. 

1331 self.generic_visit(node) 

1332 func = node.func 

1333 if ( 

1334 isinstance(func, ast.Attribute) 

1335 and isinstance(func.value, ast.Name) 

1336 and func.value.id == _MISSION_NAMESPACE_NAME 

1337 and func.attr in _MISSION_HELPER_RUNTIME_NAMES 

1338 ): 

1339 replacement = ast.Name( 

1340 id=_MISSION_HELPER_RUNTIME_NAMES[func.attr], 

1341 ctx=ast.Load(), 

1342 ) 

1343 ast.copy_location(replacement, func) 

1344 node.func = replacement 

1345 return node 

1346 

1347 

1348def _rewrite_mission_helpers(script: str) -> str: 

1349 """Re-parse ``script``, rewrite mission helper calls, and unparse. 

1350 

1351 Called after :func:`validate_script_ast` has already accepted the 

1352 source — so ``ast.parse`` cannot fail here on syntax that was 

1353 valid moments ago. Returns a fresh source string suitable for 

1354 handing to ``MontySandboxProvider.run``. 

1355 """ 

1356 tree = ast.parse(script, mode="exec") 

1357 rewritten = _MissionAttributeCallRewriter().visit(tree) 

1358 ast.fix_missing_locations(rewritten) 

1359 return ast.unparse(rewritten) 

1360 

1361 

1362# --------------------------------------------------------------------------- 

1363# Tool callable wrapper 

1364# --------------------------------------------------------------------------- 

1365 

1366 

1367def _make_tool_wrapper( 

1368 tool_name: str, 

1369 ctx: Any | None, 

1370 tool_dispatcher: Callable[[str, dict[str, Any], Any], Awaitable[Any]], 

1371 script_call_log: list[dict[str, Any]], 

1372 session_id: str, 

1373 iteration_index: int, 

1374) -> Callable[..., Awaitable[Any]]: 

1375 """Build the per-tool async wrapper inserted into ``external_functions``. 

1376 

1377 The wrapper is keyword-only by design — the Mission script grammar 

1378 passes tool args as kwargs (``submit_job_sqs(manifest_path=..., 

1379 region=...)``) and rejecting positionals at call time keeps the 

1380 wrapper's record shape aligned with the engine's 

1381 :class:`ToolCallRecord`. A script that calls 

1382 ``submit_job_sqs("examples/x.yaml")`` with a positional argument 

1383 fails immediately with a ``TypeError`` from Python's call 

1384 machinery; that error surfaces through Monty as a 

1385 ``MontyRuntimeError`` and is caught by the wrapper layer in 

1386 :meth:`MissionSandbox.run`. 

1387 

1388 The wrapper appends one record to ``script_call_log`` per call, 

1389 whether the call succeeded or raised. A raised exception still 

1390 propagates out of the wrapper (so Monty surfaces it to the script 

1391 as a Python exception the script can catch with 

1392 ``try``/``except``), but the record carries ``status="failed"`` 

1393 plus a truncated error message so the engine's audit path sees 

1394 every invocation. 

1395 

1396 On both success and failure the wrapper also emits a 

1397 ``mission_script_call_event`` audit row tagged 

1398 ``via_script=True``. The dispatch into ``tool_dispatcher`` runs 

1399 the registered tool function, so the standard ``@audit_logged`` 

1400 entry has already fired by the time the wrapper reaches its emit 

1401 site — the script-call event is a *second*, distinct row that 

1402 lets consumers distinguish in-script invocations from direct 

1403 ``tool_calls`` strategy invocations without having to walk 

1404 timestamps. 

1405 """ 

1406 

1407 async def wrapper(**kwargs: Any) -> Any: 

1408 # Snapshot the kwargs into a fresh dict before dispatch so the 

1409 # log entry preserves exactly what the script passed even if 

1410 # the dispatcher mutates the dict downstream. 

1411 args = dict(kwargs) 

1412 started = time.monotonic() 

1413 try: 

1414 result = await tool_dispatcher(tool_name, args, ctx) 

1415 except Exception as exc: 

1416 duration_ms = max(int((time.monotonic() - started) * 1000), 0) 

1417 error_message = f"{type(exc).__name__}: {exc}"[:200] 

1418 script_call_log.append( 

1419 { 

1420 "tool_name": tool_name, 

1421 "args": args, 

1422 "status": "failed", 

1423 "result_summary": None, 

1424 "duration_ms": duration_ms, 

1425 # Truncated to 200 chars to match the audit 

1426 # module's existing convention for error_message 

1427 # fields elsewhere in the engine. 

1428 "error_message": error_message, 

1429 } 

1430 ) 

1431 # Emit the via_script audit row before re-raising so the 

1432 # event is recorded even when the script catches the 

1433 # exception and continues executing. 

1434 _audit.emit_script_call_event( 

1435 session_id, 

1436 iteration_index, 

1437 tool_name, 

1438 "failed", 

1439 duration_ms, 

1440 error_message=error_message, 

1441 ) 

1442 raise 

1443 duration_ms = max(int((time.monotonic() - started) * 1000), 0) 

1444 record: dict[str, Any] = { 

1445 "tool_name": tool_name, 

1446 "args": args, 

1447 "status": "ok", 

1448 "result_summary": result, 

1449 "duration_ms": duration_ms, 

1450 } 

1451 script_call_log.append(record) 

1452 _audit.emit_script_call_event( 

1453 session_id, 

1454 iteration_index, 

1455 tool_name, 

1456 "ok", 

1457 duration_ms, 

1458 ) 

1459 return result 

1460 

1461 # Setting ``__name__`` makes Monty's traceback render the 

1462 # operator's tool name rather than ``wrapper`` when a call goes 

1463 # wrong inside the sandboxed script. The script_call_log remains 

1464 # the canonical record of what fired. 

1465 wrapper.__name__ = tool_name 

1466 return wrapper 

1467 

1468 

1469# --------------------------------------------------------------------------- 

1470# Observation assembly 

1471# --------------------------------------------------------------------------- 

1472 

1473 

1474def _annotate_call_result(call: dict[str, Any]) -> Any: 

1475 """Wrap a script-call ``result_summary`` with per-call markers. 

1476 

1477 Mirrors :meth:`MissionEngine._annotate_tool_result` for the 

1478 scripted-strategy path so the Observation's ``tool_results`` list 

1479 always carries the ``_status`` and ``tool_name`` markers the 

1480 predicate evaluator and the ``tool_call_succeeded`` evaluator 

1481 rely on, regardless of the underlying tool's return shape. 

1482 

1483 Strategy: 

1484 

1485 * **Dict result_summary** — augment in place with ``_status`` and 

1486 ``tool_name`` only when those keys are absent. This keeps any 

1487 caller-supplied marker visible while ensuring evaluators always 

1488 find them. 

1489 * **Non-dict result_summary** — wrap in a fresh dict carrying 

1490 the call's ``_status`` / ``tool_name`` plus a ``result`` field 

1491 that holds the original payload so predicates can still walk 

1492 into it. 

1493 """ 

1494 result = call.get("result_summary") 

1495 status = call.get("status") or "unknown" 

1496 tool_name = call.get("tool_name") 

1497 if isinstance(result, dict): 

1498 annotated = dict(result) 

1499 annotated.setdefault("_status", status) 

1500 annotated.setdefault("tool_name", tool_name) 

1501 return annotated 

1502 return { 

1503 "_status": status, 

1504 "tool_name": tool_name, 

1505 "result": result, 

1506 } 

1507 

1508 

1509def _build_script_observation( 

1510 *, 

1511 script_call_log: list[dict[str, Any]], 

1512 observe_log: list[dict[str, Any]], 

1513 event_log: list[dict[str, Any]], 

1514 phase_started_at: str, 

1515 phase_ended_at: str, 

1516) -> dict[str, Any]: 

1517 """Merge the closure-captured logs into an Observation dict. 

1518 

1519 Mirrors :meth:`MissionEngine._build_observation` for the 

1520 ``tool_calls`` strategy path so a downstream Evaluate_Phase / 

1521 Decide_Phase consumer cannot tell, from the Observation shape 

1522 alone, whether the iteration ran a scripted or a non-scripted 

1523 Strategy: 

1524 

1525 * ``tool_results`` lists every call's ``result_summary`` (including 

1526 failures, for stable indexing against ``script_call_log``). 

1527 * ``metrics`` lifts any top-level ``metrics`` dict from a 

1528 successful tool result, exactly like the engine does. 

1529 * ``events`` pools the events emitted by tool results with the 

1530 ``mission.event(...)`` calls so the criteria evaluator only 

1531 walks one list. 

1532 * ``errors`` carries failed / skipped calls in the same shape the 

1533 engine uses, so the decide-phase heuristic that triggers 

1534 ``adjust`` on new errors keeps working unchanged. 

1535 

1536 The ``mission.observe(...)`` rows fold into a dedicated 

1537 ``observations`` bucket inside ``metrics`` rather than flat-merging 

1538 so a script-collected key cannot silently overwrite a tool-derived 

1539 metric of the same name. A criterion that wants a script-collected 

1540 key reads ``metrics.observations.<key>``; a criterion that wants a 

1541 tool-derived metric reads ``metrics.<key>``. The two namespaces 

1542 stay distinct. 

1543 """ 

1544 tool_results: list[Any] = [] 

1545 metrics: dict[str, Any] = {} 

1546 events: list[dict[str, Any]] = [] 

1547 errors: list[dict[str, Any]] = [] 

1548 

1549 for call in script_call_log: 

1550 tool_results.append(_annotate_call_result(call)) 

1551 if call.get("status") == "ok": 1551 ↛ 1563line 1551 didn't jump to line 1563 because the condition on line 1551 was always true

1552 result = call.get("result_summary") 

1553 if isinstance(result, dict): 1553 ↛ 1554line 1553 didn't jump to line 1554 because the condition on line 1553 was never true

1554 result_metrics = result.get("metrics") 

1555 if isinstance(result_metrics, dict): 

1556 metrics.update(result_metrics) 

1557 result_events = result.get("events") 

1558 if isinstance(result_events, list): 

1559 for event in result_events: 

1560 if isinstance(event, dict): 

1561 events.append(event) 

1562 else: 

1563 errors.append( 

1564 { 

1565 "tool_name": call.get("tool_name"), 

1566 "status": call.get("status"), 

1567 "error_message": call.get("error_message"), 

1568 } 

1569 ) 

1570 

1571 # Pool the script-side ``mission.event(...)`` calls with 

1572 # tool-derived events. ``dict(ev)`` is a defensive copy so a later 

1573 # mutation of the closure list does not bleed into the persisted 

1574 # Observation. 

1575 for ev in event_log: 

1576 events.append(dict(ev)) 

1577 

1578 # ``mission.observe(...)`` rows fold into a dedicated bucket on 

1579 # metrics so they remain addressable without colliding with 

1580 # tool-derived metric names. 

1581 if observe_log: 1581 ↛ 1587line 1581 didn't jump to line 1587 because the condition on line 1581 was always true

1582 observations_bucket: dict[str, Any] = {} 

1583 for entry in observe_log: 

1584 observations_bucket[entry["key"]] = entry["value"] 

1585 metrics["observations"] = observations_bucket 

1586 

1587 observation: dict[str, Any] = { 

1588 "tool_results": tool_results, 

1589 "metrics": metrics, 

1590 "events": events, 

1591 "phase_started_at": phase_started_at, 

1592 "phase_ended_at": phase_ended_at, 

1593 } 

1594 if errors: 1594 ↛ 1595line 1594 didn't jump to line 1595 because the condition on line 1594 was never true

1595 observation["errors"] = errors 

1596 return observation 

1597 

1598 

1599# --------------------------------------------------------------------------- 

1600# MissionSandbox 

1601# --------------------------------------------------------------------------- 

1602 

1603 

1604class MissionSandbox: 

1605 """Run a validated Mission script under ``MontySandboxProvider`` limits. 

1606 

1607 One sandbox per iteration. The constructor freezes the per-iteration 

1608 ``mission`` namespace as a :class:`types.MappingProxyType` snapshot 

1609 (so a script cannot reach back through ``mission`` and mutate the 

1610 session record), pins the operator's tool allowlist, and builds the 

1611 underlying ``MontySandboxProvider`` with the duration / memory 

1612 limits read from the module-level constants. :meth:`run` then 

1613 drives a single script execution end to end: 

1614 

1615 1. AST validate via :func:`validate_script_ast` — propagation of 

1616 :class:`ScriptRejected` is the engine's signal to fail the 

1617 Execute_Phase with reason ``script_rejected``. 

1618 2. Build the ``external_functions`` map: one async wrapper per 

1619 allowlisted tool, each forwarding into the engine's tool 

1620 dispatcher so the wrapper preserves the existing 

1621 ``@audit_logged`` / feature-flag / allowlist semantics — running 

1622 inside a script is *not* a way to bypass any of those. 

1623 3. Execute under Monty's caps. Any ``MontyError`` (limit / 

1624 runtime / typing / syntax) is re-raised as 

1625 :class:`SandboxTerminated` carrying whatever the script 

1626 collected before being killed. 

1627 4. Fold the closure-captured tool log, observe log, and event log 

1628 into an Observation dict whose shape exactly matches the 

1629 engine's tool-calls path. 

1630 

1631 The sandbox is immutable after construction: there are no setters, 

1632 no rebuild methods, and the underlying provider is held by 

1633 reference rather than recreated per call. Each iteration gets its 

1634 own MissionSandbox so a stale frozen namespace cannot leak across 

1635 iterations. 

1636 """ 

1637 

1638 def __init__( 

1639 self, 

1640 allowlist: list[str], 

1641 session: Any, 

1642 ) -> None: 

1643 # Defensive copy of the allowlist: the engine pins the 

1644 # allowlist on the session at create time, but a shared list 

1645 # reference would let later mutations slip past the AST 

1646 # validator's frozenset (which is constructed once per 

1647 # validation call from ``self._allowlist``). 

1648 self._allowlist: list[str] = list(allowlist) 

1649 

1650 # Build the per-iteration mission namespace as an immutable 

1651 # snapshot. Each iteration summary carries only the four 

1652 # fields a script needs to reason about prior progress — 

1653 # full IterationRecord shapes would be both heavy and 

1654 # tempting for a script to walk in ways the engine does not 

1655 # support. 

1656 iteration_summaries: list[dict[str, Any]] = [] 

1657 for it in session.get("iterations") or []: 1657 ↛ 1658line 1657 didn't jump to line 1658 because the loop on line 1657 never started

1658 iteration_summaries.append( 

1659 { 

1660 "iteration_index": it.get("iteration_index"), 

1661 "verdict": it.get("verdict"), 

1662 "verdict_reason": it.get("verdict_reason"), 

1663 "checkpoint_evaluated": it.get("checkpoint_evaluated"), 

1664 } 

1665 ) 

1666 # ``copy.deepcopy`` on criteria + budget so a script that 

1667 # walks them via subscripting cannot mutate the session 

1668 # record even if Python's MappingProxyType were ever 

1669 # bypassed by a future change. 

1670 ns: dict[str, Any] = { 

1671 "session_id": session["session_id"], 

1672 "iteration_index": len(session.get("iterations") or []), 

1673 "directive_text": session.get("directive_text", ""), 

1674 "criteria": copy.deepcopy(session.get("criteria") or []), 

1675 "budget": copy.deepcopy(session.get("budget") or {}), 

1676 "iterations": iteration_summaries, 

1677 } 

1678 self._frozen_mission_ns: MappingProxyType[str, Any] = MappingProxyType(ns) 

1679 

1680 # Construct the provider once and pin it on the instance. 

1681 # The provider holds no per-call state, so reusing it across 

1682 # multiple ``run`` calls would be safe in principle, but the 

1683 # one-sandbox-per-iteration lifetime keeps the failure 

1684 # surface small and matches the rest of the per-iteration 

1685 # state above. 

1686 provider_cls, _ = _import_provider() 

1687 self._provider = provider_cls( 

1688 limits={ 

1689 "max_duration_secs": _DURATION_LIMIT_SECS, 

1690 "max_memory": _MEMORY_LIMIT_BYTES, 

1691 } 

1692 ) 

1693 

1694 # ---- read-only accessors ------------------------------------------ 

1695 

1696 @property 

1697 def frozen_mission_ns(self) -> MappingProxyType[str, Any]: 

1698 """The iteration's frozen ``mission`` namespace snapshot.""" 

1699 return self._frozen_mission_ns 

1700 

1701 @property 

1702 def allowlist(self) -> list[str]: 

1703 """Defensive copy of the per-session tool allowlist.""" 

1704 return list(self._allowlist) 

1705 

1706 # ---- public surface ----------------------------------------------- 

1707 

1708 async def run( 

1709 self, 

1710 script: str, 

1711 ctx: Any | None, 

1712 tool_dispatcher: Callable[[str, dict[str, Any], Any], Awaitable[Any]], 

1713 ) -> tuple[dict[str, Any], list[dict[str, Any]]]: 

1714 """Validate, execute, and observe a Mission script. 

1715 

1716 Returns ``(observation, script_call_log)`` matching the shape 

1717 the engine's ``_execute_script`` expects: the observation is a 

1718 plain dict (engine cast to :class:`Observation` at the call 

1719 site) and the call log is a list of 

1720 :class:`ToolCallRecord`-shaped dicts. 

1721 

1722 On any ``MontyError`` from the provider — duration cap, memory 

1723 cap, runtime / typing / syntax error inside the script — the 

1724 method re-raises as :class:`SandboxTerminated` carrying the 

1725 closure-captured partial observations and events. The engine's 

1726 decide-phase pattern-matches on this exception and produces a 

1727 ``terminate`` verdict for the iteration. 

1728 

1729 ``ScriptRejected`` from the AST validator propagates upward 

1730 unchanged: the engine's Execute_Phase treats that as a 

1731 ``script_rejected`` failure and never reaches the runtime path 

1732 below. 

1733 """ 

1734 # Step 1: AST gate. Propagating ``ScriptRejected`` upward is 

1735 # deliberate — the engine's _execute_phase wraps it as a 

1736 # phase failure with reason ``script_rejected``; doing the 

1737 # rejection here means the runtime path never sees a 

1738 # disallowed source. 

1739 validate_script_ast(script, self._allowlist) 

1740 

1741 _, monty_error_cls = _import_provider() 

1742 

1743 # Closure-captured collectors. Populated synchronously by the 

1744 # host-side helper closures registered as 

1745 # ``external_functions`` and the per-tool wrappers; observed 

1746 # post-run (or post-termination) to build the Observation. 

1747 # Lists rather than dicts so the order in which the script 

1748 # called ``mission.event`` / ``mission.observe`` is preserved 

1749 # in the final record. 

1750 observe_log: list[dict[str, Any]] = [] 

1751 event_log: list[dict[str, Any]] = [] 

1752 script_call_log: list[dict[str, Any]] = [] 

1753 

1754 # Host-side helpers for ``mission.observe`` and 

1755 # ``mission.event``. Routing them through the 

1756 # ``external_functions`` channel — rather than as bound 

1757 # methods on a dataclass shipped via ``inputs`` — is what 

1758 # makes script-side mutations visible to the host: 

1759 # ``MontySandboxProvider`` round-trips ``inputs`` values into 

1760 # the underlying Monty VM by value, so a closure list 

1761 # captured on a method body of an ``inputs`` dataclass would 

1762 # only ever see the VM-side copy. The external-functions 

1763 # channel runs each call back in host Python, so the lists 

1764 # below receive the appends. 

1765 # 

1766 # The signatures match the original ``mission.observe`` / 

1767 # ``mission.event`` script-facing surface: ``observe`` takes 

1768 # ``(key, value)`` positionally, ``event`` takes ``name`` 

1769 # positionally plus arbitrary keyword arguments. The AST 

1770 # rewrite below replaces the attribute callee with a bare 

1771 # Name lookup but leaves args / kwargs unchanged, so the 

1772 # call shape that lands on these helpers is exactly what an 

1773 # operator would write at the script surface. 

1774 async def _mission_observe(key: str, value: Any) -> None: 

1775 observe_log.append({"key": key, "value": value}) 

1776 

1777 async def _mission_event(name: str, **kwargs: Any) -> None: 

1778 event_row: dict[str, Any] = {"event_name": name} 

1779 event_row.update(kwargs) 

1780 event_log.append(event_row) 

1781 

1782 # The frozen mission namespace remains pinned on this 

1783 # sandbox instance (``self._frozen_mission_ns``) so a future 

1784 # widening of the script surface can expose it without 

1785 # rebuilding the construction-time snapshot. It does *not* 

1786 # ride through the ``inputs`` channel today: the validator 

1787 # never accepts attribute access on anything other than 

1788 # ``mission`` (and the only two ``mission`` attributes are 

1789 # the ``observe`` / ``event`` helpers handled by the 

1790 # preamble below), so a script has no way to read the 

1791 # snapshot through Monty's runtime. Holding it on the host 

1792 # side is the simpler shape; routing it as a ``Mapping`` 

1793 # through ``inputs`` would require Monty to convert the 

1794 # full dataclass + nested dicts to its own value model and 

1795 # pay a per-iteration translation cost for data nothing 

1796 # observes. 

1797 

1798 # Build the external_functions mapping. Each tool name maps 

1799 # to an async wrapper; Monty's ``external_functions`` channel 

1800 # auto-wraps sync callables to async, but we register native 

1801 # async functions so the dispatcher's ``await`` chain stays 

1802 # explicit and the wrapper can do its own timing. 

1803 external_functions: dict[str, Callable[..., Any]] = {} 

1804 # Pull the per-iteration identifiers off the frozen namespace 

1805 # snapshot built at construction time so the wrapper records 

1806 # the same ``session_id`` / ``iteration_index`` the rest of 

1807 # the iteration's audit rows carry. 

1808 session_id = self._frozen_mission_ns["session_id"] 

1809 iteration_index = self._frozen_mission_ns["iteration_index"] 

1810 for tool_name in self._allowlist: 

1811 external_functions[tool_name] = _make_tool_wrapper( 

1812 tool_name, 

1813 ctx, 

1814 tool_dispatcher, 

1815 script_call_log, 

1816 session_id, 

1817 iteration_index, 

1818 ) 

1819 

1820 # The two helper functions ride alongside the per-tool 

1821 # wrappers under reserved underscore-prefixed names. Operator 

1822 # scripts cannot collide with these: the AST validator 

1823 # rejects ``_mission_observe`` and ``_mission_event`` as 

1824 # bare names (neither is on the per-session tool allowlist 

1825 # nor any of the safe-builtin / exception / mission base 

1826 # sets), so a script that wrote ``_mission_observe(...)`` 

1827 # directly would fail the gate with ``name_not_allowed``. 

1828 # Only the AST rewrite below — applied *after* the gate — 

1829 # ever produces those Name nodes. 

1830 external_functions["_mission_observe"] = _mission_observe 

1831 external_functions["_mission_event"] = _mission_event 

1832 

1833 # The validated operator source is re-parsed and rewritten 

1834 # so every accepted ``mission.<helper>(...)`` Call's callee 

1835 # becomes a bare-Name lookup of the corresponding reserved 

1836 # external-function name. Monty's parser does not accept 

1837 # ``class`` / nested-attribute shims that would otherwise 

1838 # let us preserve the surface attribute call, so the 

1839 # rewrite happens on the AST itself before the source ever 

1840 # reaches the underlying VM. Operator code keeps its 

1841 # author-time surface (``await mission.observe(key, value)``); 

1842 # only the run-time surface differs. 

1843 final_source = _rewrite_mission_helpers(script) 

1844 

1845 phase_started_at = datetime.now(UTC).isoformat() 

1846 

1847 try: 

1848 await self._provider.run( 

1849 code=final_source, 

1850 inputs={}, 

1851 external_functions=external_functions, 

1852 ) 

1853 except monty_error_cls as exc: 

1854 # ``MontyError`` is the base of the limit / runtime / 

1855 # typing / syntax error family. Catching the base class 

1856 # rather than the leaves means a future Monty release 

1857 # adding a new error type still routes through 

1858 # ``SandboxTerminated`` rather than escaping as an opaque 

1859 # ``Exception``. 

1860 raise SandboxTerminated( 

1861 type(exc).__name__, 

1862 partial_observations=list(observe_log), 

1863 partial_events=list(event_log), 

1864 partial_script_call_log=list(script_call_log), 

1865 ) from exc 

1866 

1867 phase_ended_at = datetime.now(UTC).isoformat() 

1868 

1869 # The script's return value is intentionally ignored: the 

1870 # contract documented for the script surface is "use 

1871 # ``mission.observe(...)`` / ``mission.event(...)`` to report 

1872 # data". A script that returned a dict would conflict with 

1873 # the helper-driven observation list, and the engine's 

1874 # observe-phase already accepts a pre-built Observation 

1875 # without consulting any return value. 

1876 observation = _build_script_observation( 

1877 script_call_log=script_call_log, 

1878 observe_log=observe_log, 

1879 event_log=event_log, 

1880 phase_started_at=phase_started_at, 

1881 phase_ended_at=phase_ended_at, 

1882 ) 

1883 return observation, list(script_call_log) 

1884 

1885 

1886# --------------------------------------------------------------------------- 

1887# Default factory 

1888# --------------------------------------------------------------------------- 

1889 

1890 

1891def make_default_sandbox_runner( 

1892 allowlist: list[str], 

1893 session: Any, 

1894) -> Callable[ 

1895 [str, Any, Callable[[str, dict[str, Any], Any], Awaitable[Any]]], 

1896 Awaitable[tuple[dict[str, Any], list[dict[str, Any]]]], 

1897]: 

1898 """Build the default ``sandbox_runner`` callable for the engine. 

1899 

1900 The :class:`MissionEngine` takes a callable matching the 

1901 ``SandboxRunner`` protocol (``(script, ctx, tool_dispatcher) -> 

1902 (observation_dict, script_call_log)``); this helper wraps a fresh 

1903 :class:`MissionSandbox` for a given session and returns the bound 

1904 :meth:`MissionSandbox.run` method so the engine can drive the 

1905 sandbox without depending on the sandbox class itself. 

1906 

1907 One sandbox per session: the constructor freezes a snapshot of the 

1908 session's directive, criteria, budget, and prior-iteration 

1909 summaries into the ``mission`` namespace, so reusing a runner 

1910 across sessions would leak stale state. The engine's normal 

1911 construction path therefore calls this factory once per 

1912 ``mission_start`` and pins the returned callable on the engine 

1913 instance for the session's lifetime. 

1914 """ 

1915 sandbox = MissionSandbox( 

1916 allowlist=allowlist, 

1917 session=session, 

1918 ) 

1919 return sandbox.run 

1920 

1921 

1922# --------------------------------------------------------------------------- 

1923# Public surface 

1924# --------------------------------------------------------------------------- 

1925 

1926 

1927__all__ = [ 

1928 "MissionSandbox", 

1929 "ScriptRejected", 

1930 "SandboxTerminated", 

1931 "make_default_sandbox_runner", 

1932 "validate_script_ast", 

1933]