Coverage for mcp/mission/sandbox.py: 88%

1"""Restricted AST validator for Mission ``Strategy.script`` source.

3Where a ``Criterion(kind="predicate")`` carries a single expression, a

4``Strategy.script`` carries a multi-statement Python module that runs

5inside the Mission sandbox to drive an iteration. Both surfaces accept

6untrusted operator input, so both go through a parse-time AST allowlist

7before any execution. This module owns the script side: it parses

8scripts in ``exec`` mode, walks the tree with an explicit list of

9allowed nodes, and rejects everything else with :class:`ScriptRejected`.

11The script surface is wider than the predicate surface — multi-statement

12control flow, helper function definitions, named-exception ``try`` /

13``except`` / ``finally`` blocks, plus calls to the operator-supplied

14tool allowlist — so this module is its own validator rather than a

15shared base class. The structural decisions (an :class:`ast.NodeVisitor`

16that defines a ``visit_*`` for every accepted node and rejects in

17``generic_visit``, an exception type carrying ``reason`` /

18``failing_node`` / ``lineno`` / ``col_offset``, dunder filtering on

19strings and identifiers, comprehension-target shadowing checks) mirror

20:mod:`mcp.mission.predicate` so the two layers reject the same shapes

21the same way.

23Two layers, same as the predicate sandbox:

251. **Parse-time validation.** :func:`validate_script_ast` parses the

26 source in ``exec`` mode and walks the tree with

27 :class:`_ScriptValidator`. The first disallowed construct raises

28 :class:`ScriptRejected`; the script never runs.

292. **Run-time isolation.** The runtime layer (the

30 :class:`MissionSandbox` wrapper around ``MontySandboxProvider``)

31 executes a validated script under shared duration / memory limits

32 with an explicit namespace that withholds dangerous builtins like

33 ``open`` / ``getattr`` / ``__import__``. Even a tree that smuggled

34 past this validator would fail at lookup.

36Allowed surface

37---------------

38**Statements:** ``Module``, ``Expr``, ``Assign``, ``AugAssign``,

39``AnnAssign``, ``If``, ``While``, ``For``, ``Pass``, ``Break``,

40``Continue``, ``Return``, ``FunctionDef`` (no decorators), ``Try``

41(named-exception handlers only), ``Raise``.

43**Expressions:** constants, names from the allowlist, container

44literals (``List`` / ``Tuple`` / ``Set`` / ``Dict``), comprehensions

45(``ListComp`` / ``SetComp`` / ``DictComp`` / ``GeneratorExp``),

46``BinOp`` / ``UnaryOp`` / ``BoolOp`` / ``Compare`` / ``IfExp``,

47subscript and slice access, f-strings, lambdas, the walrus operator,

48plus calls.

50**Names visible to a script (the *base scope*):**

52- ``mission`` — the per-iteration namespace; the only allowed

53 attribute access is ``mission.observe`` and ``mission.event``.

54- The pure stdlib callables ``len``, ``min``, ``max``, ``sum``,

55 ``abs``, ``any``, ``all``, ``sorted``, ``range``, ``enumerate``,

56 ``zip``, ``list``, ``dict``, ``tuple``, ``set``, ``str``, ``int``,

57 ``float``, ``bool``.

58- A small set of built-in exception classes so ``raise ValueError(...)``

59 and ``except KeyError as e:`` both work without importing.

60- Every tool name the operator placed on the per-session allowlist.

62**Calls** may target a bare name from the base scope, a name a script

63introduced (a function it defined or a value it bound), or one of the

64two attribute calls ``mission.observe(...)`` / ``mission.event(...)``.

65``exec``, ``eval``, ``compile``, and ``__import__`` are rejected by

66name even if a script binds those identifiers locally.

68Rejected outright

69-----------------

70``Import`` / ``ImportFrom``, ``ClassDef``, ``AsyncFunctionDef`` /

71``AsyncFor`` / ``AsyncWith``, ``Yield`` / ``YieldFrom``, ``Global`` /

72``Nonlocal``, ``Match``, ``With``, ``Assert``, ``Delete``, decorators

73(the allowlist is currently empty), bare ``except:`` clauses,

74attribute access on anything other than ``mission``, calls on

75attributes / subscripts / other calls, dunder strings and identifiers,

76and any binding (``Assign``, ``AnnAssign``, ``AugAssign``, walrus,

77function parameter, function name, comprehension target, ``for``

78target, ``except as`` name) whose name shadows a base-scope identifier.

80``Await`` carries a single, narrow exception: ``await <tool>(...)``

81where ``<tool>`` is a bare name on the per-session tool allowlist.

82The runtime layer below exposes every allowlisted tool through the

83underlying Monty ``external_functions`` channel as a coroutine

84factory, so the script must ``await`` the call to receive the

85dispatcher's return value rather than a coroutine object. Every

86other ``Await`` shape — ``await name`` on a non-call, ``await

87mission.observe(...)``, ``await some_other_tool()`` for a tool not on

88the allowlist, ``await (lambda: ...)()`` — stays rejected with

89reason ``await_not_allowed``.

90"""

92from __future__ import annotations

94import ast

95from collections.abc import Iterable

96from typing import Final, NoReturn

98# <pyflowchart-code-diagram> BEGIN - auto-inserted, do not edit

99# Flowchart(s) generated from this file:

100# * ``validate_script_ast`` -> ``diagrams/code_diagrams/mcp/mission/sandbox.validate_script_ast.html``

101# (PNG: ``diagrams/code_diagrams/mcp/mission/sandbox.validate_script_ast.png``)

102# Regenerate with ``python diagrams/code_diagrams/generate.py``.

103# <pyflowchart-code-diagram> END

104

105

106# ---------------------------------------------------------------------------

107# Allowlists

108# ---------------------------------------------------------------------------

109

110_SAFE_BUILTINS: Final[frozenset[str]] = frozenset(

111 {

112 "len",

113 "min",

114 "max",

115 "sum",

116 "abs",

117 "any",

118 "all",

119 "sorted",

120 "range",

121 "enumerate",

122 "zip",

123 "list",

124 "dict",

125 "tuple",

126 "set",

127 "str",

128 "int",

129 "float",

130 "bool",

131 }

132)

133"""Pure stdlib callables a script may look up by bare name."""

134

135_ALLOWED_EXCEPTION_NAMES: Final[frozenset[str]] = frozenset(

136 {

137 "Exception",

138 "ValueError",

139 "TypeError",

140 "KeyError",

141 "IndexError",

142 "AttributeError",

143 "LookupError",

144 "RuntimeError",

145 "ArithmeticError",

146 "ZeroDivisionError",

147 "OverflowError",

148 "OSError",

149 "FileNotFoundError",

150 "TimeoutError",

151 "ConnectionError",

152 "StopIteration",

153 "AssertionError",

154 }

155)

156"""Built-in exception classes a script may name in ``raise`` and ``except``.

157

158Including these in the base scope is what lets a script say

159``except ValueError as e:`` or ``raise RuntimeError("msg")`` without an

160``import``. Constructing an exception instance is side-effect-free, so

161exposing the class is no broader than exposing the safe builtins.

162"""

163

164_MISSION_NAMESPACE_NAME: Final[str] = "mission"

165"""Top-level identifier reserved for the per-iteration helper namespace."""

166

167_MISSION_HELPER_ATTRIBUTES: Final[frozenset[str]] = frozenset({"observe", "event"})

168"""Only attributes the validator accepts on the ``mission`` namespace."""

169

170_FORBIDDEN_CALL_TARGETS: Final[frozenset[str]] = frozenset(

171 {"exec", "eval", "compile", "__import__"}

172)

173"""Names whose call form is rejected by name even if a script shadows them.

174

175A script could in principle write ``def exec(): ...`` and then call its

176own local. Rejecting these names at the call site as well as via the

177dunder filter (for ``__import__``) closes the gap.

178"""

179

180_ALLOWED_DECORATORS: Final[frozenset[str]] = frozenset()

181"""Decorator names a function definition may carry.

182

183Currently empty: any ``@decorator`` on a ``FunctionDef`` is rejected.

184The hook is here so a future iteration can vet a small set of operator-

185facing helpers (e.g. a retry decorator) by editing only this constant.

186"""

187

188_ALLOWED_BIN_OPS: Final[tuple[type[ast.operator], ...]] = (

189 ast.Add,

190 ast.Sub,

191 ast.Mult,

192 ast.Div,

193 ast.FloorDiv,

194 ast.Mod,

195 ast.Pow,

196 ast.MatMult,

197)

198

199_ALLOWED_UNARY_OPS: Final[tuple[type[ast.unaryop], ...]] = (

200 ast.UAdd,

201 ast.USub,

202 ast.Not,

203 ast.Invert,

204)

205

206_ALLOWED_COMPARE_OPS: Final[tuple[type[ast.cmpop], ...]] = (

207 ast.Eq,

208 ast.NotEq,

209 ast.Lt,

210 ast.LtE,

211 ast.Gt,

212 ast.GtE,

213 ast.Is,

214 ast.IsNot,

215 ast.In,

216 ast.NotIn,

217)

218

219_ALLOWED_BOOL_OPS: Final[tuple[type[ast.boolop], ...]] = (ast.And, ast.Or)

220

221

222# ---------------------------------------------------------------------------

223# Exception

224# ---------------------------------------------------------------------------

225

226

227class ScriptRejected(Exception):

228 """Raised when a script source contains a disallowed construct.

229

230 Mirror of :class:`mcp.mission.predicate.PredicateRejected` so callers

231 can render uniform structured errors regardless of which sandbox

232 layer rejected the input. ``reason`` is a short stable token (e.g.

233 ``"forbidden_node"``, ``"shadows_protected_name"``) suitable for

234 machine-readable error envelopes; ``failing_node`` is the

235 :class:`ast.AST` that triggered rejection (``None`` only when the

236 source failed to parse at all).

237 """

238

239 def __init__(

240 self,

241 reason: str,

242 *,

243 failing_node: ast.AST | None = None,

244 message: str | None = None,

245 ) -> None:

246 self.reason: str = reason

247 self.failing_node: ast.AST | None = failing_node

248 self.lineno: int | None = (

249 getattr(failing_node, "lineno", None) if failing_node is not None else None

250 )

251 self.col_offset: int | None = (

252 getattr(failing_node, "col_offset", None) if failing_node is not None else None

253 )

254 rendered = message if message is not None else reason

255 if self.lineno is not None:

256 rendered = f"{rendered} (line {self.lineno}, col {self.col_offset})"

257 super().__init__(rendered)

258

259

260# ---------------------------------------------------------------------------

261# Validator

262# ---------------------------------------------------------------------------

263

264

265class _ScriptValidator(ast.NodeVisitor):

266 """Walk a script AST and reject any construct outside the allowlist.

267

268 The validator tracks two things across the walk:

269

270 * **The base scope** — the union of the operator-supplied tool

271 allowlist, the safe builtins, the allowed exception names, and

272 the ``mission`` namespace. These names are *protected*: a script

273 may read them but may not bind, rebind, or shadow them with a

274 local of any kind (assignment, walrus, function parameter,

275 function name, comprehension target, ``for`` target,

276 ``except as`` name). Protecting them keeps the security model

277 one-line-tall: if you see a Name in the source whose ``id`` is

278 ``submit_job_sqs``, you can be sure it resolves to the registered

279 tool.

280 * **A scope stack** — entries onto the stack carry the names a

281 script has bound at module level plus the names introduced by

282 function parameters, comprehension targets, ``for`` loops, and

283 ``except as`` clauses. The stack is what makes a helper function

284 that defines a parameter ``i`` validate cleanly without ``i``

285 leaking into the module-level scope.

286 """

287

288 def __init__(self, allowlist: Iterable[str]) -> None:

289 # Order does not matter; keep as a frozenset for fast membership.

290 self._tool_allowlist: frozenset[str] = frozenset(allowlist)

291

292 # Names that are visible from the start of the script and that

293 # script-introduced bindings may NOT shadow. The mission

294 # namespace counts as protected: rebinding it would defeat the

295 # one-allowed-attribute-base rule in :meth:`visit_Attribute`.

296 # The forbidden call targets (``eval``, ``exec``, ``compile``,

297 # ``__import__``) are folded into the protected set so that a

298 # script trying to shadow them — ``(eval := 1)``, ``def exec():

299 # ...``, ``for compile in xs:`` — is rejected at the binding

300 # site with ``shadows_protected_name``, in addition to the

301 # call-site rejection in :meth:`visit_Call`. Two layers of

302 # defense for the same risk: a reader does not have to chase

303 # every later use to know whether the shadow is harmful.

304 self._base_scope: frozenset[str] = (

305 self._tool_allowlist

306 | _SAFE_BUILTINS

307 | _ALLOWED_EXCEPTION_NAMES

308 | _FORBIDDEN_CALL_TARGETS

309 | {_MISSION_NAMESPACE_NAME}

310 )

311

312 # Stack of frozensets of script-bound names (function params,

313 # for-loop targets, comprehension targets, assignment targets,

314 # function definitions). The base frame is empty; each scope

315 # push appends a new frame whose contents accumulate from the

316 # parent frame so a nested lookup can see outer locals.

317 self._scopes: list[frozenset[str]] = [frozenset()]

318

319 # ---- helpers -------------------------------------------------------

320

321 def _current_locals(self) -> frozenset[str]:

322 return self._scopes[-1]

323

324 def _name_is_visible(self, name: str) -> bool:

325 return name in self._base_scope or name in self._current_locals()

326

327 @staticmethod

328 def _is_dunder(name: str) -> bool:

329 return name.startswith("__")

330

331 @staticmethod

332 def _reject(reason: str, node: ast.AST, message: str | None = None) -> NoReturn:

333 raise ScriptRejected(reason, failing_node=node, message=message)

334

335 def _push_scope(self, locals_: frozenset[str]) -> None:

336 self._scopes.append(self._current_locals() | locals_)

337

338 def _pop_scope(self) -> None:

339 self._scopes.pop()

340

341 def _bind_local(self, name: str, node: ast.AST) -> None:

342 """Add ``name`` to the current frame, rejecting protected shadows.

343

344 Used by every binding form (assignment, walrus, function name,

345 function parameter, ``for`` target, comprehension target,

346 ``except as`` name). The shadow check is what prevents a

347 script from rebinding ``submit_job_sqs`` or ``mission`` and

348 thereby sneaking past later name-based validation.

349 """

350 if self._is_dunder(name):

351 self._reject(

352 "dunder_binding",

353 node,

354 f"binding to '{name}' is not allowed (starts with '__')",

355 )

356 if name in self._base_scope:

357 self._reject(

358 "shadows_protected_name",

359 node,

360 f"binding to '{name}' shadows a protected name",

361 )

362 # The accumulated-frame model means we replace the top frame

363 # rather than mutate it in place: every ``_push_scope`` already

364 # captured the parent, and append-adds at the leaf are local to

365 # this frame.

366 self._scopes[-1] = self._scopes[-1] | {name}

367

368 def _collect_target_names(self, target: ast.AST) -> list[ast.Name]:

369 """Flatten an assignment / for / comprehension target.

370

371 Tuples and lists nest (``for (a, b) in pairs``). ``Starred``

372 wraps (``a, *rest = xs``). Anything else under a target —

373 ``Subscript``, ``Attribute`` — would be a write into a

374 non-local namespace and is rejected by the caller via the

375 ``invalid_target`` reason.

376 """

377 if isinstance(target, ast.Name):

378 return [target]

379 if isinstance(target, (ast.Tuple, ast.List)):

380 collected: list[ast.Name] = []

381 for elt in target.elts:

382 collected.extend(self._collect_target_names(elt))

383 return collected

384 if isinstance(target, ast.Starred):

385 return self._collect_target_names(target.value)

386 self._reject(

387 "invalid_target",

388 target,

389 "assignment / loop target must be a plain identifier",

390 )

391 return [] # unreachable; _reject raises

392

393 def _bind_targets(self, target: ast.AST) -> None:

394 for name_node in self._collect_target_names(target):

395 self._bind_local(name_node.id, name_node)

396

397 # ---- top-level entry ----------------------------------------------

398

399 def visit_Module(self, node: ast.Module) -> None:

400 # ``ast.parse(..., mode="exec")`` produces a Module whose body

401 # is a list of statements. Walk each in order so any forward

402 # binding (e.g. a function definition followed by a call)

403 # validates with the binding visible in the same module scope.

404 for stmt in node.body:

405 self.visit(stmt)

406

407 # ---- catch-all -----------------------------------------------------

408

409 def generic_visit(self, node: ast.AST) -> None:

410 # Default rejection: the validator opts in to every supported

411 # node via a dedicated ``visit_*`` method. Anything reaching

412 # ``generic_visit`` is something the operator wrote that the

413 # script surface deliberately does not support — ``Import``,

414 # ``ClassDef``, ``Global``, ``Nonlocal``, ``Match``, ``With``,

415 # ``Assert``, ``Delete``, ``Yield``, ``AsyncFunctionDef`` /

416 # ``AsyncFor`` / ``AsyncWith`` (``Await`` is handled by its

417 # own narrow visitor), etc.

418 self._reject(

419 "forbidden_node",

420 node,

421 f"{type(node).__name__} is not allowed in a script",

422 )

423

424 # ---- statements ----------------------------------------------------

425

426 def visit_Expr(self, node: ast.Expr) -> None:

427 self.visit(node.value)

428

429 def visit_Pass(self, node: ast.Pass) -> None:

430 # No children; the visitor still has to opt in to keep

431 # generic_visit from rejecting it.

432 pass

433

434 def visit_Break(self, node: ast.Break) -> None:

435 pass

436

437 def visit_Continue(self, node: ast.Continue) -> None:

438 pass

439

440 def visit_Assign(self, node: ast.Assign) -> None:

441 # Validate the RHS *first* under the current scope, then bind

442 # the LHS targets. This ordering matters for ``x = x + 1``: the

443 # right-hand ``x`` must already exist as a local; if it does

444 # not, the ``visit_Name`` lookup fails. Conversely, ``x = 1``

445 # introduces ``x`` only after the literal validates.

446 self.visit(node.value)

447 for target in node.targets:

448 self._bind_targets(target)

449

450 def visit_AugAssign(self, node: ast.AugAssign) -> None:

451 if not isinstance(node.op, _ALLOWED_BIN_OPS): 451 ↛ 452line 451 didn't jump to line 452 because the condition on line 451 was never true

452 self._reject(

453 "binop_not_allowed",

454 node,

455 f"augmented operator {type(node.op).__name__} is not allowed",

456 )

457 # ``x += 1`` reads ``x`` then writes ``x``. The target Name

458 # must be visible already (no defining via aug-assign), and

459 # the target itself must not be a protected name. We re-use

460 # ``_bind_local`` for the shadow check; if ``x`` is already

461 # local the bind is a no-op.

462 if not isinstance(node.target, ast.Name):

463 self._reject(

464 "invalid_target",

465 node.target,

466 "augmented assignment target must be a plain identifier",

467 )

468 # Read-side check: target must already be in scope.

469 self.visit(node.target)

470 self.visit(node.value)

471 # Bind defensively — protects against aug-assign on a

472 # protected name even though the read-side visit above would

473 # already accept it (protected names ARE visible). The

474 # shadow check fires here.

475 self._bind_local(node.target.id, node.target)

476

477 def visit_AnnAssign(self, node: ast.AnnAssign) -> None:

478 # ``x: int = 1`` and ``x: int`` are accepted; ``obj.attr: int``

479 # is not (target must be a plain identifier).

480 if node.value is not None: 480 ↛ 482line 480 didn't jump to line 482 because the condition on line 480 was always true

481 self.visit(node.value)

482 if node.annotation is not None: 482 ↛ 484line 482 didn't jump to line 484 because the condition on line 482 was always true

483 self.visit(node.annotation)

484 if not isinstance(node.target, ast.Name):

485 self._reject(

486 "invalid_target",

487 node.target,

488 "annotated assignment target must be a plain identifier",

489 )

490 self._bind_local(node.target.id, node.target)

491

492 def visit_If(self, node: ast.If) -> None:

493 self.visit(node.test)

494 for stmt in node.body:

495 self.visit(stmt)

496 for stmt in node.orelse: 496 ↛ 497line 496 didn't jump to line 497 because the loop on line 496 never started

497 self.visit(stmt)

498

499 def visit_While(self, node: ast.While) -> None:

500 self.visit(node.test)

501 for stmt in node.body:

502 self.visit(stmt)

503 for stmt in node.orelse:

504 self.visit(stmt)

505

506 def visit_For(self, node: ast.For) -> None:

507 # Validate the iterable in the *outer* scope, then bind the

508 # loop targets in the same scope as the body. ``for x in xs:``

509 # leaks ``x`` after the loop, matching Python semantics.

510 self.visit(node.iter)

511 self._bind_targets(node.target)

512 for stmt in node.body:

513 self.visit(stmt)

514 for stmt in node.orelse:

515 self.visit(stmt)

516

517 def visit_Return(self, node: ast.Return) -> None:

518 if node.value is not None: 518 ↛ exitline 518 didn't return from function 'visit_Return' because the condition on line 518 was always true

519 self.visit(node.value)

520

521 def visit_Raise(self, node: ast.Raise) -> None:

522 if node.exc is not None: 522 ↛ 524line 522 didn't jump to line 524 because the condition on line 522 was always true

523 self.visit(node.exc)

524 if node.cause is not None: 524 ↛ 525line 524 didn't jump to line 525 because the condition on line 524 was never true

525 self.visit(node.cause)

526

527 def visit_Try(self, node: ast.Try) -> None:

528 # Body of the try block runs in the current scope.

529 for stmt in node.body:

530 self.visit(stmt)

531 for handler in node.handlers:

532 # Bare ``except:`` is rejected — operators must name the

533 # exception class so an unrelated bug is not silently

534 # swallowed by the same handler that catches a tool

535 # timeout.

536 if handler.type is None:

537 self._reject(

538 "bare_except",

539 handler,

540 "bare 'except:' is not allowed; name the exception class",

541 )

542 self.visit(handler.type)

543 # ``except Exc as name:`` introduces ``name`` only inside

544 # the handler block, mirroring Python semantics. Push a

545 # new scope so the binding does not leak to siblings.

546 self._push_scope(frozenset())

547 try:

548 if handler.name is not None:

549 # ``handler`` is the canonical AST node for the

550 # binding location; reuse it as the failing-node

551 # context for shadow rejections.

552 self._bind_local(handler.name, handler)

553 for stmt in handler.body:

554 self.visit(stmt)

555 finally:

556 self._pop_scope()

557 for stmt in node.orelse:

558 self.visit(stmt)

559 for stmt in node.finalbody:

560 self.visit(stmt)

561

562 def visit_FunctionDef(self, node: ast.FunctionDef) -> None:

563 # Decorators are gated by a dedicated allowlist so the security

564 # surface stays small. The list is currently empty.

565 for deco in node.decorator_list:

566 if not (isinstance(deco, ast.Name) and deco.id in _ALLOWED_DECORATORS): 566 ↛ 565line 566 didn't jump to line 565 because the condition on line 566 was always true

567 self._reject(

568 "decorator_not_allowed",

569 deco,

570 "decorators are not allowed on script functions",

571 )

572 # Bind the function name in the *current* scope so the rest of

573 # the module can call it. The body opens a new scope under

574 # which arguments live.

575 self._bind_local(node.name, node)

576 self._validate_function_signature_and_body(node.args, node.body, node)

577

578 def _validate_function_signature_and_body(

579 self,

580 args: ast.arguments,

581 body: list[ast.stmt],

582 owner: ast.AST,

583 ) -> None:

584 # No defaults that touch the outer scope are forbidden, but

585 # the default expressions still validate under the *outer*

586 # scope (Python evaluates them once at def time, not per call).

587 for default in args.defaults:

588 self.visit(default)

589 for kw_default in args.kw_defaults:

590 if kw_default is not None:

591 self.visit(kw_default)

592

593 # Collect parameter names. Reject duplicates and protected

594 # shadows up front so the body sees a coherent local frame.

595 param_names: list[tuple[str, ast.AST]] = []

596

597 def _collect_arg(arg: ast.arg) -> None:

598 param_names.append((arg.arg, arg))

599 if arg.annotation is not None: 599 ↛ 600line 599 didn't jump to line 600 because the condition on line 599 was never true

600 self.visit(arg.annotation)

601

602 for arg in args.posonlyargs: 602 ↛ 603line 602 didn't jump to line 603 because the loop on line 602 never started

603 _collect_arg(arg)

604 for arg in args.args:

605 _collect_arg(arg)

606 if args.vararg is not None:

607 _collect_arg(args.vararg)

608 for arg in args.kwonlyargs:

609 _collect_arg(arg)

610 if args.kwarg is not None:

611 _collect_arg(args.kwarg)

612

613 # Push a fresh frame; bindings inside the function do not

614 # leak to the module-level scope.

615 self._push_scope(frozenset())

616 try:

617 seen: set[str] = set()

618 for name, owning_node in param_names:

619 if name in seen:

620 self._reject(

621 "duplicate_parameter",

622 owning_node,

623 f"duplicate parameter '{name}'",

624 )

625 seen.add(name)

626 self._bind_local(name, owning_node)

627 for stmt in body:

628 self.visit(stmt)

629 finally:

630 self._pop_scope()

631

632 # ---- expressions ---------------------------------------------------

633

634 def visit_Constant(self, node: ast.Constant) -> None:

635 # Reject dunder strings even when used as plain data. The same

636 # rationale as in the predicate sandbox: a string like

637 # ``"__class__"`` only ever appears in source code as part of

638 # an introspection escape pattern (``getattr(x, "__class__")``,

639 # ``locals()["__import__"]``). Forbidding them at the constant

640 # level closes those off even if a future change widened the

641 # call or attribute allowlist.

642 if isinstance(node.value, str) and self._is_dunder(node.value):

643 self._reject(

644 "dunder_string",

645 node,

646 "string constants starting with '__' are not allowed",

647 )

648

649 def visit_Name(self, node: ast.Name) -> None:

650 if self._is_dunder(node.id): 650 ↛ 651line 650 didn't jump to line 651 because the condition on line 650 was never true

651 self._reject(

652 "dunder_name",

653 node,

654 f"identifier '{node.id}' starts with '__'",

655 )

656 if not self._name_is_visible(node.id):

657 self._reject(

658 "name_not_allowed",

659 node,

660 f"name '{node.id}' is not in the script allowlist",

661 )

662

663 def visit_NamedExpr(self, node: ast.NamedExpr) -> None:

664 # ``(x := expr)`` — the walrus binds ``x`` in the enclosing

665 # scope. Validate the value first, then route through the

666 # standard binding helper so the protected-name shadow check

667 # fires for ``(mission := ...)`` etc.

668 self.visit(node.value)

669 if not isinstance(node.target, ast.Name): 669 ↛ 670line 669 didn't jump to line 670 because the condition on line 669 was never true

670 self._reject(

671 "invalid_target",

672 node.target,

673 "walrus target must be a plain identifier",

674 )

675 self._bind_local(node.target.id, node.target)

676

677 def visit_Lambda(self, node: ast.Lambda) -> None:

678 # Lambdas are scoped expressions: validate parameters + body

679 # under a fresh frame, exactly like a ``FunctionDef`` minus

680 # the decorator list and statement body. The lambda itself

681 # produces no binding in the enclosing scope.

682 self._validate_function_signature_and_body(node.args, [ast.Expr(value=node.body)], node)

683

684 # ---- containers ----------------------------------------------------

685

686 def visit_List(self, node: ast.List) -> None:

687 for elt in node.elts:

688 self.visit(elt)

689

690 def visit_Tuple(self, node: ast.Tuple) -> None:

691 for elt in node.elts:

692 self.visit(elt)

693

694 def visit_Set(self, node: ast.Set) -> None:

695 for elt in node.elts:

696 self.visit(elt)

697

698 def visit_Dict(self, node: ast.Dict) -> None:

699 for key in node.keys:

700 if key is not None:

701 self.visit(key)

702 else:

703 # ``{**other}`` would let a script splat arbitrary

704 # mappings into a dict literal; reject for the same

705 # reason as in the predicate sandbox.

706 self._reject(

707 "dict_unpacking",

708 node,

709 "dict unpacking is not allowed in a script",

710 )

711 for value in node.values:

712 self.visit(value)

713

714 def visit_Starred(self, node: ast.Starred) -> None:

715 # ``[*xs]``, ``f(*xs)``, ``a, *rest = xs`` — recurse into the

716 # inner expression so the nested Name still hits the

717 # allowlist check.

718 self.visit(node.value)

719

720 # ---- operators -----------------------------------------------------

721

722 def visit_BinOp(self, node: ast.BinOp) -> None:

723 if not isinstance(node.op, _ALLOWED_BIN_OPS): 723 ↛ 724line 723 didn't jump to line 724 because the condition on line 723 was never true

724 self._reject(

725 "binop_not_allowed",

726 node,

727 f"binary operator {type(node.op).__name__} is not allowed",

728 )

729 self.visit(node.left)

730 self.visit(node.right)

731

732 def visit_UnaryOp(self, node: ast.UnaryOp) -> None:

733 if not isinstance(node.op, _ALLOWED_UNARY_OPS):

734 self._reject(

735 "unaryop_not_allowed",

736 node,

737 f"unary operator {type(node.op).__name__} is not allowed",

738 )

739 self.visit(node.operand)

740

741 def visit_BoolOp(self, node: ast.BoolOp) -> None:

742 if not isinstance(node.op, _ALLOWED_BOOL_OPS):

743 self._reject(

744 "boolop_not_allowed",

745 node,

746 f"bool operator {type(node.op).__name__} is not allowed",

747 )

748 for value in node.values:

749 self.visit(value)

750

751 def visit_Compare(self, node: ast.Compare) -> None:

752 for op in node.ops:

753 if not isinstance(op, _ALLOWED_COMPARE_OPS): 753 ↛ 754line 753 didn't jump to line 754 because the condition on line 753 was never true

754 self._reject(

755 "compareop_not_allowed",

756 node,

757 f"comparison operator {type(op).__name__} is not allowed",

758 )

759 self.visit(node.left)

760 for comparator in node.comparators:

761 self.visit(comparator)

762

763 def visit_IfExp(self, node: ast.IfExp) -> None:

764 self.visit(node.test)

765 self.visit(node.body)

766 self.visit(node.orelse)

767

768 # ---- attribute and subscript --------------------------------------

769

770 def visit_Attribute(self, node: ast.Attribute) -> None:

771 # The script surface allows attribute access on exactly one

772 # name — the ``mission`` namespace — and only for the two

773 # helper attributes ``observe`` and ``event``. Every other

774 # ``foo.bar`` reads raise ``ScriptRejected``: tool results are

775 # opaque values, not deep object graphs, so a script that

776 # needs nested data should use subscripting on a return value.

777 if self._is_dunder(node.attr):

778 self._reject(

779 "dunder_attribute",

780 node,

781 f"attribute '{node.attr}' starts with '__'",

782 )

783 if not isinstance(node.value, ast.Name):

784 self._reject(

785 "attribute_target_not_name",

786 node,

787 "attribute access is only allowed on the 'mission' namespace",

788 )

789 if node.value.id != _MISSION_NAMESPACE_NAME:

790 self._reject(

791 "attribute_target_not_allowed",

792 node,

793 "attribute access is only allowed on the 'mission' namespace",

794 )

795 if node.attr not in _MISSION_HELPER_ATTRIBUTES:

796 self._reject(

797 "attribute_not_allowed",

798 node,

799 f"'mission.{node.attr}' is not an allowed helper",

800 )

801 # ``mission`` itself is a base-scope name; visit it for

802 # regularity so any future Name-side check still fires here.

803 self.visit(node.value)

804

805 def visit_Subscript(self, node: ast.Subscript) -> None:

806 # Recurse into both the value and the slice. The base of the

807 # chain falls out as a ``Name`` lookup that hits the

808 # allowlist; slices may themselves contain Names and Calls

809 # that go through the same validation path.

810 self.visit(node.value)

811 self.visit(node.slice)

812

813 def visit_Slice(self, node: ast.Slice) -> None:

814 if node.lower is not None:

815 self.visit(node.lower)

816 if node.upper is not None:

817 self.visit(node.upper)

818 if node.step is not None:

819 self.visit(node.step)

820

821 # ---- calls ---------------------------------------------------------

822

823 def visit_Call(self, node: ast.Call) -> None:

824 # The callee form decides which rule applies. Three shapes are

825 # allowed:

826 #

827 # * ``name(...)`` — bare name call. The name must already be

828 # visible (base-scope or script-bound local).

829 # * ``mission.observe(...)`` / ``mission.event(...)`` — the

830 # only attribute-call shape supported.

831 #

832 # ``foo()()`` (call returning a callable, then call), ``a[0]()``

833 # (subscript-then-call), and ``x.y()`` for any ``y`` not on the

834 # mission helper list are all rejected outright.

835 func = node.func

836 if isinstance(func, ast.Name):

837 # ``__import__``, ``exec``, ``eval``, ``compile`` are

838 # rejected by name even if a script defined a local with

839 # one of those names. The dunder filter in

840 # :meth:`visit_Name` already rejects ``__import__`` for

841 # plain reads; the explicit list is what blocks the

842 # ``def exec(): ...; exec()`` shadow attempt.

843 if func.id in _FORBIDDEN_CALL_TARGETS:

844 self._reject(

845 "forbidden_call_target",

846 node,

847 f"call to '{func.id}' is not allowed",

848 )

849 # Visit the Name so the visibility / dunder check fires.

850 self.visit(func)

851 elif isinstance(func, ast.Attribute):

852 # Only ``mission.observe`` / ``mission.event``. The

853 # attribute visit raises with a structured reason for

854 # every other shape (non-Name base, non-mission base,

855 # disallowed attribute), so we just recurse here.

856 self.visit(func)

857 else:

858 # ``f()()``, ``xs[0]()``, ``(lambda: ...)()`` — the

859 # callee is neither a Name nor a single ``mission.<x>``

860 # attribute access. Reject without descending; the

861 # blanket ``call_target_shape`` reason captures all three.

862 self._reject(

863 "call_target_shape",

864 node,

865 "script calls must target a bare name or 'mission.<helper>'",

866 )

867 for arg in node.args:

868 self.visit(arg)

869 for kw in node.keywords:

870 # ``**kwargs`` shows up as a keyword with arg=None; allow

871 # the value but recurse so its content is still validated

872 # against the same name and call rules.

873 self.visit(kw.value)

874

875 def visit_Await(self, node: ast.Await) -> None:

876 # The runtime layer (:class:`MissionSandbox`) exposes every

877 # allowlisted tool through Monty's ``external_functions``

878 # channel, where a registered async callable surfaces inside

879 # the script as a coroutine factory. Calling

880 # ``find_examples(query="gpu")`` from inside a script returns

881 # a coroutine object, not the dispatcher's return value;

882 # consuming the value requires writing ``await

883 # find_examples(query="gpu")``. The two ``mission`` helpers

884 # ride the same channel — the runtime layer prepends a small

885 # source-level shim that makes ``mission.observe`` /

886 # ``mission.event`` route into host-side closures via the same

887 # coroutine-factory channel, so awaiting them is required for

888 # the side effect (an observation row, an event row) to land

889 # on the iteration's audit log. The validator therefore opens

890 # ``Await`` for exactly two shapes:

891 #

892 # * ``await <name>(...)`` where ``<name>`` is on the per-

893 # session tool allowlist.

894 # * ``await mission.observe(...)`` / ``await mission.event(...)``

895 # — attribute calls on the ``mission`` namespace whose

896 # attribute is one of the two helper names that

897 # :meth:`visit_Attribute` already accepts.

898 #

899 # Both forms route the wrapped Call back through

900 # :meth:`visit_Call` so kwargs, positional args, and the

901 # forbidden-call-target rules apply unchanged.

902 #

903 # Rejected (folded into ``await_not_allowed``):

904 #

905 # * ``await x`` — bare name (no Call inside).

906 # * ``await some_other_tool()`` — call on a Name that is not

907 # on the per-session tool allowlist (a safe builtin, an

908 # exception class, ``mission`` itself, a script-bound local,

909 # or simply unknown).

910 # * ``await mission.foo(...)`` for any ``foo`` outside the

911 # helper set — :meth:`visit_Attribute` would already reject

912 # the inner call, but the early reject here keeps the reason

913 # token stable as ``await_not_allowed``.

914 # * ``await x.observe(...)`` for any ``x`` other than

915 # ``mission`` — same rationale.

916 # * ``await (lambda: ...)()`` / ``await xs[0]()`` —

917 # subscript-then-call / call-of-call shapes; the underlying

918 # Call would already fail :meth:`visit_Call`'s

919 # ``call_target_shape`` check, but reject at the await

920 # level too so the reason token stays ``await_not_allowed``.

921 #

922 # ``AsyncFunctionDef`` / ``AsyncFor`` / ``AsyncWith`` continue

923 # to fall through to :meth:`generic_visit` and stay rejected

924 # with ``forbidden_node`` — the relaxation here covers only

925 # the bare ``Await`` expression on the two accepted call

926 # shapes.

927 inner = node.value

928 if not isinstance(inner, ast.Call):

929 self._reject(

930 "await_not_allowed",

931 node,

932 "'await' may only be used on a call to an allowlisted "

933 "tool or a 'mission.<helper>' call",

934 )

935 func = inner.func

936 if isinstance(func, ast.Name):

937 if func.id not in self._tool_allowlist: 937 ↛ 938line 937 didn't jump to line 938 because the condition on line 937 was never true

938 self._reject(

939 "await_not_allowed",

940 node,

941 "'await' may only be used on a call to an allowlisted "

942 "tool or a 'mission.<helper>' call",

943 )

944 elif isinstance(func, ast.Attribute): 944 ↛ 958line 944 didn't jump to line 958 because the condition on line 944 was always true

945 # Only ``mission.observe(...)`` / ``mission.event(...)``.

946 if not ( 946 ↛ 951line 946 didn't jump to line 951 because the condition on line 946 was never true

947 isinstance(func.value, ast.Name)

948 and func.value.id == _MISSION_NAMESPACE_NAME

949 and func.attr in _MISSION_HELPER_ATTRIBUTES

950 ):

951 self._reject(

952 "await_not_allowed",

953 node,

954 "'await' may only be used on a call to an allowlisted "

955 "tool or a 'mission.<helper>' call",

956 )

957 else:

958 self._reject(

959 "await_not_allowed",

960 node,

961 "'await' may only be used on a call to an allowlisted "

962 "tool or a 'mission.<helper>' call",

963 )

964 # Hand the Call node back to the existing call-validation

965 # machinery so kwargs, positional args, and the

966 # forbidden-call-target check all fire exactly as they would

967 # for the non-awaited form.

968 self.visit(inner)

969

970 # ---- f-strings -----------------------------------------------------

971

972 def visit_JoinedStr(self, node: ast.JoinedStr) -> None:

973 for value in node.values:

974 self.visit(value)

975

976 def visit_FormattedValue(self, node: ast.FormattedValue) -> None:

977 self.visit(node.value)

978 if node.format_spec is not None: 978 ↛ 979line 978 didn't jump to line 979 because the condition on line 978 was never true

979 self.visit(node.format_spec)

980

981 # ---- comprehensions -----------------------------------------------

982

983 def _validate_comprehensions(self, generators: list[ast.comprehension]) -> frozenset[str]:

984 """Walk comprehension generators and return their target names.

985

986 Each generator's ``iter`` is validated against the *outer*

987 scope (it cannot reference targets of its own generator), then

988 the targets are added to the local set so the next generator's

989 ``ifs`` and any later ``iter`` can see them. Async generators

990 (``async for``) are rejected; the script body is sync.

991 """

992 accumulated: set[str] = set()

993 for gen in generators:

994 if gen.is_async: 994 ↛ 995line 994 didn't jump to line 995 because the condition on line 994 was never true

995 self._reject(

996 "async_comprehension",

997 gen.iter,

998 "async comprehensions are not allowed",

999 )

1000 self.visit(gen.iter)

1001 target_names = self._collect_target_names(gen.target)

1002 for name_node in target_names:

1003 if self._is_dunder(name_node.id):

1004 self._reject(

1005 "dunder_comprehension_target",

1006 name_node,

1007 f"comprehension target '{name_node.id}' starts with '__'",

1008 )

1009 if name_node.id in self._base_scope:

1010 self._reject(

1011 "shadows_protected_name",

1012 name_node,

1013 f"comprehension target '{name_node.id}' shadows a protected name",

1014 )

1015 accumulated.add(name_node.id)

1016 self._push_scope(frozenset(accumulated))

1017 try:

1018 for if_clause in gen.ifs:

1019 self.visit(if_clause)

1020 finally:

1021 self._pop_scope()

1022 return frozenset(accumulated)

1023

1024 def _visit_comprehension_like(

1025 self,

1026 node: ast.ListComp | ast.SetComp | ast.GeneratorExp,

1027 ) -> None:

1028 locals_ = self._validate_comprehensions(node.generators)

1029 self._push_scope(locals_)

1030 try:

1031 self.visit(node.elt)

1032 finally:

1033 self._pop_scope()

1034

1035 def visit_ListComp(self, node: ast.ListComp) -> None:

1036 self._visit_comprehension_like(node)

1037

1038 def visit_SetComp(self, node: ast.SetComp) -> None:

1039 self._visit_comprehension_like(node)

1040

1041 def visit_GeneratorExp(self, node: ast.GeneratorExp) -> None:

1042 self._visit_comprehension_like(node)

1043

1044 def visit_DictComp(self, node: ast.DictComp) -> None:

1045 locals_ = self._validate_comprehensions(node.generators)

1046 self._push_scope(locals_)

1047 try:

1048 self.visit(node.key)

1049 self.visit(node.value)

1050 finally:

1051 self._pop_scope()

1052

1053

1054# ---------------------------------------------------------------------------

1055# Public API

1056# ---------------------------------------------------------------------------

1057

1058

1059def validate_script_ast(script: str, allowlist: list[str]) -> None:

1060 """Parse and validate a Mission script source string.

1061

1062 On success, the function returns ``None`` and the caller may pass

1063 ``script`` to the sandbox runtime layer. On any disallowed

1064 construct, raises :class:`ScriptRejected` carrying ``reason``,

1065 ``failing_node``, ``lineno``, and ``col_offset``. The script is

1066 *never* executed by this function; it only walks the AST.

1067

1068 ``allowlist`` is the per-session list of MCP tool names the script

1069 may call. Each name becomes a visible bare-Name and a permitted

1070 call target. Names not in the allowlist (and not in the safe

1071 builtin / exception / mission set) are rejected at every Name

1072 lookup.

1073 """

1074 if not isinstance(script, str): 1074 ↛ 1075line 1074 didn't jump to line 1075 because the condition on line 1074 was never true

1075 raise ScriptRejected(

1076 "not_a_string",

1077 message="script source must be a str",

1078 )

1079 try:

1080 parsed = ast.parse(script, mode="exec")

1081 except SyntaxError as exc:

1082 rejection = ScriptRejected(

1083 "syntax_error",

1084 message=f"could not parse script: {exc.msg}",

1085 )

1086 rejection.lineno = exc.lineno

1087 rejection.col_offset = exc.offset

1088 raise rejection from exc

1089 _ScriptValidator(allowlist).visit(parsed)

1090

1091

1092# ===========================================================================

1093# Runtime layer — MissionSandbox wrapper around MontySandboxProvider

1094# ===========================================================================

1095#

1096# Where ``validate_script_ast`` above is the parse-time gate, the wrapper

1097# below is the run-time isolation. A validated script is handed to the

1098# Monty sandbox under shared duration / memory limits, with two extras

1099# layered on top:

1100#

1101# * The operator-supplied tool allowlist is exposed as a set of async

1102# callables in the script's namespace. Each callable forwards into the

1103# engine's tool dispatcher so the existing ``@audit_logged`` /

1104# feature-flag / allowlist semantics still fire — running inside a

1105# script is *not* a way to bypass any of those.

1106# * A ``mission`` namespace object exposes the iteration's read-only

1107# metadata (deep-copied snapshot of the session's directive, criteria,

1108# budget, and prior-iteration summaries) plus the two streaming

1109# helpers ``mission.observe(...)`` / ``mission.event(...)``. The

1110# helpers append into closure-captured lists that ``MissionSandbox.run``

1111# merges into the resulting Observation.

1112#

1113# On any limit violation (duration, memory, runtime / typing / syntax

1114# from inside the script) the ``MontyError`` family bubbles out of the

1115# provider; the wrapper re-raises it as :class:`SandboxTerminated`

1116# carrying whatever the script collected before it was killed so the

1117# engine's ``_decide_phase`` can produce a deterministic ``terminate``

1118# verdict with the partial observation attached.

1119

1120import copy # noqa: E402 — runtime layer below; keep imports near their consumers

1121import os # noqa: E402

1122import time # noqa: E402

1123from collections.abc import Awaitable, Callable # noqa: E402

1124from datetime import UTC, datetime # noqa: E402

1125from types import MappingProxyType # noqa: E402

1126from typing import Any # noqa: E402

1127

1128from . import audit as _audit # noqa: E402

1129

1130# ---------------------------------------------------------------------------

1131# Env helpers — module-level so the constants below are read once at import

1132# time. Tests pin the constants by monkey-patching the module attributes; a

1133# per-call read of os.environ would defeat that.

1134# ---------------------------------------------------------------------------

1135

1136

1137def _int_env(name: str, default: int) -> int:

1138 """Parse an integer env var; fall back to default on missing/empty/non-numeric.

1139

1140 Mirrors the helper in :mod:`mcp.server` so the two code-mode entry

1141 points read the same caps with the same parsing semantics. Empty,

1142 whitespace-only, and non-numeric values all collapse to ``default``

1143 rather than raising — an operator who fat-fingers the env should

1144 still get a working sandbox.

1145 """

1146 raw = os.environ.get(name, "").strip()

1147 if not raw:

1148 return default

1149 try:

1150 return int(raw)

1151 except ValueError:

1152 return default

1153

1154

1155def _float_env(name: str, default: float) -> float:

1156 """Parse a float env var; fall back to default on missing/empty/non-numeric.

1157

1158 Same fall-back semantics as :func:`_int_env`. The duration cap is a

1159 float so fractional seconds remain expressible.

1160 """

1161 raw = os.environ.get(name, "").strip()

1162 if not raw:

1163 return default

1164 try:

1165 return float(raw)

1166 except ValueError:

1167 return default

1168

1169

1170# Read the resource caps once at import time. Tests pin behaviour by

1171# monkey-patching these module-level constants before constructing a

1172# MissionSandbox. The defaults match the existing precedent in

1173# ``mcp/server.py`` where the same env names are wired into the

1174# Code Mode discovery transform's sandbox.

1175_DURATION_LIMIT_SECS: float = _float_env("GCO_MCP_CODE_MODE_MAX_DURATION_SECS", 30.0)

1176_MEMORY_LIMIT_BYTES: int = _int_env("GCO_MCP_CODE_MODE_MAX_MEMORY", 200_000_000)

1177

1178

1179# ---------------------------------------------------------------------------

1180# Lazy import of the runtime dependencies

1181# ---------------------------------------------------------------------------

1182#

1183# The AST validator above must remain importable on a host where

1184# ``fastmcp`` and ``pydantic_monty`` are not installed (for example a

1185# CLI-only environment that runs ``gco mission validate`` against a

1186# stored session JSON without ever wiring an engine). The provider class

1187# and the error class are pulled in lazily by ``_import_provider`` and

1188# cached at module level so repeated MissionSandbox constructions in the

1189# same process pay the import cost exactly once.

1190

1191_MONTY_PROVIDER_CLASS: Any = None

1192_MONTY_ERROR_CLASS: Any = None

1193

1194

1195def _import_provider() -> tuple[Any, Any]:

1196 """Lazy-import ``MontySandboxProvider`` and ``MontyError`` and cache them.

1197

1198 Returns the ``(provider_cls, error_cls)`` pair. The provider class

1199 is the value the wrapper instantiates with a ``ResourceLimits``

1200 dict; the error class is the *base* ``pydantic_monty.MontyError``

1201 that covers the whole limit / runtime / typing / syntax family

1202 raised from inside a script. We catch the base class rather than

1203 the leaves so a future Monty release that adds a new error type

1204 still routes through ``SandboxTerminated`` rather than escaping as

1205 an opaque ``Exception``.

1206 """

1207 global _MONTY_PROVIDER_CLASS, _MONTY_ERROR_CLASS

1208 if _MONTY_PROVIDER_CLASS is None:

1209 from fastmcp.experimental.transforms.code_mode import MontySandboxProvider

1210 from pydantic_monty import MontyError

1211

1212 _MONTY_PROVIDER_CLASS = MontySandboxProvider

1213 _MONTY_ERROR_CLASS = MontyError

1214 return _MONTY_PROVIDER_CLASS, _MONTY_ERROR_CLASS

1215

1216

1217# ---------------------------------------------------------------------------

1218# Termination signal

1219# ---------------------------------------------------------------------------

1220

1221

1222class SandboxTerminated(Exception):

1223 """Raised when the Monty sandbox killed the script for exceeding a limit.

1224

1225 The Mission engine catches this exception in its decide-phase and

1226 produces a ``terminate`` verdict for the iteration. Whatever the

1227 script collected via ``mission.observe(...)`` / ``mission.event(...)``

1228 before being killed is carried on the exception so the engine can

1229 surface the partial Observation in the iteration's audit record —

1230 a script that ran for 29 seconds and observed five intermediate

1231 states should not lose those five states just because the 30-second

1232 cap fired before the script returned.

1233

1234 ``cause`` is the underlying Monty exception's class name (e.g.

1235 ``"MontyRuntimeError"``, ``"MontyTypingError"``) so callers can render

1236 a stable structured-error envelope without holding a reference to

1237 the original Monty exception object.

1238 """

1239

1240 def __init__(

1241 self,

1242 cause: str,

1243 *,

1244 partial_observations: list[dict[str, Any]] | None = None,

1245 partial_events: list[dict[str, Any]] | None = None,

1246 partial_script_call_log: list[dict[str, Any]] | None = None,

1247 ) -> None:

1248 self.cause: str = cause

1249 # Defensive copies: callers occasionally inspect these lists

1250 # after the exception has propagated several frames up. A

1251 # shared reference would let a later mutation in the original

1252 # closure corrupt the audit record.

1253 self.partial_observations: list[dict[str, Any]] = list(partial_observations or [])

1254 self.partial_events: list[dict[str, Any]] = list(partial_events or [])

1255 # Partial in-script tool-call log captured by the per-tool

1256 # wrappers up to the moment Monty killed the script. Carrying

1257 # this onto the exception lets the engine's

1258 # ``_execute_script`` stash the partial calls on the iteration

1259 # record so a script that fired ten ``submit_job_sqs(...)``

1260 # calls before tripping the duration cap still records all ten

1261 # in the audit log. Defensive copy for the same reason as the

1262 # observe / event lists above.

1263 self.partial_script_call_log: list[dict[str, Any]] = list(partial_script_call_log or [])

1264 super().__init__(f"sandbox terminated: {cause}")

1265

1266

1267# ---------------------------------------------------------------------------

1268# Script rewrite — mission.observe/event → _mission_observe/_mission_event

1269# ---------------------------------------------------------------------------

1270#

1271# The AST gate above accepts ``mission.observe(...)`` and

1272# ``mission.event(...)`` as the only two attribute calls a script may

1273# write on the ``mission`` namespace. The runtime needs those calls to

1274# land on host-side closures so the iteration's ``observe_log`` /

1275# ``event_log`` lists actually receive the appends — passing the

1276# helpers in through ``inputs={"mission": <object>}`` would not work,

1277# because :class:`MontySandboxProvider` round-trips ``inputs`` values

1278# into the Monty VM by value (any in-script mutation lands on the VM

1279# copy, not the host's). Wrapping the helpers in a small host-side

1280# class and prepending it to the script as a preamble would not work

1281# either: Monty's parser does not support ``class`` definitions.

1282#

1283# Instead, after validation, the host re-parses the script and

1284# rewrites every accepted ``mission.<helper>(...)`` Call so its

1285# callee becomes a bare-Name lookup of the corresponding reserved

1286# external-function name. The rewritten source is then handed to

1287# Monty, where ``_mission_observe`` / ``_mission_event`` resolve to

1288# the host-side closures registered via ``external_functions``.

1289# Operator scripts cannot reference these names directly: the AST

1290# validator rejects them under ``name_not_allowed`` (neither is on

1291# the per-session tool allowlist nor in any safe-builtin / exception

1292# / mission base set), so the only path that produces those Name

1293# nodes is the rewrite below.

1294

1295_MISSION_HELPER_RUNTIME_NAMES: Final[dict[str, str]] = {

1296 "observe": "_mission_observe",

1297 "event": "_mission_event",

1298}

1299# The keys must mirror ``_MISSION_HELPER_ATTRIBUTES`` exactly:

1300# the validator opens up ``mission.<attr>`` for those two attributes,

1301# and the rewriter below has to translate the same two and only the

1302# same two. A future widening of the helper set has to add an entry

1303# here too, or the rewriter would leave the new attribute as an

1304# ``Attribute`` callee and Monty's parser would reject it.

1305assert set(_MISSION_HELPER_RUNTIME_NAMES) == set(_MISSION_HELPER_ATTRIBUTES)

1306

1307

1308class _MissionAttributeCallRewriter(ast.NodeTransformer):

1309 """Rewrite ``mission.observe(...)`` / ``mission.event(...)`` callees.

1310

1311 The transformer replaces the ``Attribute`` callee on accepted

1312 ``mission.<helper>`` Call nodes with a ``Name`` referencing the

1313 corresponding external-function key (``_mission_observe`` /

1314 ``_mission_event``). Args and kwargs ride through unchanged: the

1315 AST validator already vetted them, and the rewrite preserves

1316 source positions so any subsequent error in those subtrees still

1317 points at the operator's original column.

1318

1319 The validator's :meth:`_ScriptValidator.visit_Attribute` already

1320 rejects every other ``mission.<x>`` shape, so the transformer

1321 only ever encounters the two helper attributes; defensive

1322 fallthrough leaves any other ``Attribute`` callee untouched, but

1323 in practice such a node would not have passed the gate.

1324 """

1325

1326 def visit_Call(self, node: ast.Call) -> ast.AST:

1327 # Recurse into args / kwargs first so a nested

1328 # ``mission.<helper>(...)`` (e.g. inside an f-string used as

1329 # an argument) is rewritten too. ``self.generic_visit``

1330 # walks children and updates them in place.

1331 self.generic_visit(node)

1332 func = node.func

1333 if (

1334 isinstance(func, ast.Attribute)

1335 and isinstance(func.value, ast.Name)

1336 and func.value.id == _MISSION_NAMESPACE_NAME

1337 and func.attr in _MISSION_HELPER_RUNTIME_NAMES

1338 ):

1339 replacement = ast.Name(

1340 id=_MISSION_HELPER_RUNTIME_NAMES[func.attr],

1341 ctx=ast.Load(),

1342 )

1343 ast.copy_location(replacement, func)

1344 node.func = replacement

1345 return node

1346

1347

1348def _rewrite_mission_helpers(script: str) -> str:

1349 """Re-parse ``script``, rewrite mission helper calls, and unparse.

1350

1351 Called after :func:`validate_script_ast` has already accepted the

1352 source — so ``ast.parse`` cannot fail here on syntax that was

1353 valid moments ago. Returns a fresh source string suitable for

1354 handing to ``MontySandboxProvider.run``.

1355 """

1356 tree = ast.parse(script, mode="exec")

1357 rewritten = _MissionAttributeCallRewriter().visit(tree)

1358 ast.fix_missing_locations(rewritten)

1359 return ast.unparse(rewritten)

1360

1361

1362# ---------------------------------------------------------------------------

1363# Tool callable wrapper

1364# ---------------------------------------------------------------------------

1365

1366

1367def _make_tool_wrapper(

1368 tool_name: str,

1369 ctx: Any | None,

1370 tool_dispatcher: Callable[[str, dict[str, Any], Any], Awaitable[Any]],

1371 script_call_log: list[dict[str, Any]],

1372 session_id: str,

1373 iteration_index: int,

1374) -> Callable[..., Awaitable[Any]]:

1375 """Build the per-tool async wrapper inserted into ``external_functions``.

1376

1377 The wrapper is keyword-only by design — the Mission script grammar

1378 passes tool args as kwargs (``submit_job_sqs(manifest_path=...,

1379 region=...)``) and rejecting positionals at call time keeps the

1380 wrapper's record shape aligned with the engine's

1381 :class:`ToolCallRecord`. A script that calls

1382 ``submit_job_sqs("examples/x.yaml")`` with a positional argument

1383 fails immediately with a ``TypeError`` from Python's call

1384 machinery; that error surfaces through Monty as a

1385 ``MontyRuntimeError`` and is caught by the wrapper layer in

1386 :meth:`MissionSandbox.run`.

1387

1388 The wrapper appends one record to ``script_call_log`` per call,

1389 whether the call succeeded or raised. A raised exception still

1390 propagates out of the wrapper (so Monty surfaces it to the script

1391 as a Python exception the script can catch with

1392 ``try``/``except``), but the record carries ``status="failed"``

1393 plus a truncated error message so the engine's audit path sees

1394 every invocation.

1395

1396 On both success and failure the wrapper also emits a

1397 ``mission_script_call_event`` audit row tagged

1398 ``via_script=True``. The dispatch into ``tool_dispatcher`` runs

1399 the registered tool function, so the standard ``@audit_logged``

1400 entry has already fired by the time the wrapper reaches its emit

1401 site — the script-call event is a *second*, distinct row that

1402 lets consumers distinguish in-script invocations from direct

1403 ``tool_calls`` strategy invocations without having to walk

1404 timestamps.

1405 """

1406

1407 async def wrapper(**kwargs: Any) -> Any:

1408 # Snapshot the kwargs into a fresh dict before dispatch so the

1409 # log entry preserves exactly what the script passed even if

1410 # the dispatcher mutates the dict downstream.

1411 args = dict(kwargs)

1412 started = time.monotonic()

1413 try:

1414 result = await tool_dispatcher(tool_name, args, ctx)

1415 except Exception as exc:

1416 duration_ms = max(int((time.monotonic() - started) * 1000), 0)

1417 error_message = f"{type(exc).__name__}: {exc}"[:200]

1418 script_call_log.append(

1419 {

1420 "tool_name": tool_name,

1421 "args": args,

1422 "status": "failed",

1423 "result_summary": None,

1424 "duration_ms": duration_ms,

1425 # Truncated to 200 chars to match the audit

1426 # module's existing convention for error_message

1427 # fields elsewhere in the engine.

1428 "error_message": error_message,

1429 }

1430 )

1431 # Emit the via_script audit row before re-raising so the

1432 # event is recorded even when the script catches the

1433 # exception and continues executing.

1434 _audit.emit_script_call_event(

1435 session_id,

1436 iteration_index,

1437 tool_name,

1438 "failed",

1439 duration_ms,

1440 error_message=error_message,

1441 )

1442 raise

1443 duration_ms = max(int((time.monotonic() - started) * 1000), 0)

1444 record: dict[str, Any] = {

1445 "tool_name": tool_name,

1446 "args": args,

1447 "status": "ok",

1448 "result_summary": result,

1449 "duration_ms": duration_ms,

1450 }

1451 script_call_log.append(record)

1452 _audit.emit_script_call_event(

1453 session_id,

1454 iteration_index,

1455 tool_name,

1456 "ok",

1457 duration_ms,

1458 )

1459 return result

1460

1461 # Setting ``__name__`` makes Monty's traceback render the

1462 # operator's tool name rather than ``wrapper`` when a call goes

1463 # wrong inside the sandboxed script. The script_call_log remains

1464 # the canonical record of what fired.

1465 wrapper.__name__ = tool_name

1466 return wrapper

1467

1468

1469# ---------------------------------------------------------------------------

1470# Observation assembly

1471# ---------------------------------------------------------------------------

1472

1473

1474def _annotate_call_result(call: dict[str, Any]) -> Any:

1475 """Wrap a script-call ``result_summary`` with per-call markers.

1476

1477 Mirrors :meth:`MissionEngine._annotate_tool_result` for the

1478 scripted-strategy path so the Observation's ``tool_results`` list

1479 always carries the ``_status`` and ``tool_name`` markers the

1480 predicate evaluator and the ``tool_call_succeeded`` evaluator

1481 rely on, regardless of the underlying tool's return shape.

1482

1483 Strategy:

1484

1485 * **Dict result_summary** — augment in place with ``_status`` and

1486 ``tool_name`` only when those keys are absent. This keeps any

1487 caller-supplied marker visible while ensuring evaluators always

1488 find them.

1489 * **Non-dict result_summary** — wrap in a fresh dict carrying

1490 the call's ``_status`` / ``tool_name`` plus a ``result`` field

1491 that holds the original payload so predicates can still walk

1492 into it.

1493 """

1494 result = call.get("result_summary")

1495 status = call.get("status") or "unknown"

1496 tool_name = call.get("tool_name")

1497 if isinstance(result, dict):

1498 annotated = dict(result)

1499 annotated.setdefault("_status", status)

1500 annotated.setdefault("tool_name", tool_name)

1501 return annotated

1502 return {

1503 "_status": status,

1504 "tool_name": tool_name,

1505 "result": result,

1506 }

1507

1508

1509def _build_script_observation(

1510 *,

1511 script_call_log: list[dict[str, Any]],

1512 observe_log: list[dict[str, Any]],

1513 event_log: list[dict[str, Any]],

1514 phase_started_at: str,

1515 phase_ended_at: str,

1516) -> dict[str, Any]:

1517 """Merge the closure-captured logs into an Observation dict.

1518

1519 Mirrors :meth:`MissionEngine._build_observation` for the

1520 ``tool_calls`` strategy path so a downstream Evaluate_Phase /

1521 Decide_Phase consumer cannot tell, from the Observation shape

1522 alone, whether the iteration ran a scripted or a non-scripted

1523 Strategy:

1524

1525 * ``tool_results`` lists every call's ``result_summary`` (including

1526 failures, for stable indexing against ``script_call_log``).

1527 * ``metrics`` lifts any top-level ``metrics`` dict from a

1528 successful tool result, exactly like the engine does.

1529 * ``events`` pools the events emitted by tool results with the

1530 ``mission.event(...)`` calls so the criteria evaluator only

1531 walks one list.

1532 * ``errors`` carries failed / skipped calls in the same shape the

1533 engine uses, so the decide-phase heuristic that triggers

1534 ``adjust`` on new errors keeps working unchanged.

1535

1536 The ``mission.observe(...)`` rows fold into a dedicated

1537 ``observations`` bucket inside ``metrics`` rather than flat-merging

1538 so a script-collected key cannot silently overwrite a tool-derived

1539 metric of the same name. A criterion that wants a script-collected

1540 key reads ``metrics.observations.<key>``; a criterion that wants a

1541 tool-derived metric reads ``metrics.<key>``. The two namespaces

1542 stay distinct.

1543 """

1544 tool_results: list[Any] = []

1545 metrics: dict[str, Any] = {}

1546 events: list[dict[str, Any]] = []

1547 errors: list[dict[str, Any]] = []

1548

1549 for call in script_call_log:

1550 tool_results.append(_annotate_call_result(call))

1551 if call.get("status") == "ok": 1551 ↛ 1563line 1551 didn't jump to line 1563 because the condition on line 1551 was always true

1552 result = call.get("result_summary")

1553 if isinstance(result, dict): 1553 ↛ 1554line 1553 didn't jump to line 1554 because the condition on line 1553 was never true

1554 result_metrics = result.get("metrics")

1555 if isinstance(result_metrics, dict):

1556 metrics.update(result_metrics)

1557 result_events = result.get("events")

1558 if isinstance(result_events, list):

1559 for event in result_events:

1560 if isinstance(event, dict):

1561 events.append(event)

1562 else:

1563 errors.append(

1564 {

1565 "tool_name": call.get("tool_name"),

1566 "status": call.get("status"),

1567 "error_message": call.get("error_message"),

1568 }

1569 )

1570

1571 # Pool the script-side ``mission.event(...)`` calls with

1572 # tool-derived events. ``dict(ev)`` is a defensive copy so a later

1573 # mutation of the closure list does not bleed into the persisted

1574 # Observation.

1575 for ev in event_log:

1576 events.append(dict(ev))

1577

1578 # ``mission.observe(...)`` rows fold into a dedicated bucket on

1579 # metrics so they remain addressable without colliding with

1580 # tool-derived metric names.

1581 if observe_log: 1581 ↛ 1587line 1581 didn't jump to line 1587 because the condition on line 1581 was always true

1582 observations_bucket: dict[str, Any] = {}

1583 for entry in observe_log:

1584 observations_bucket[entry["key"]] = entry["value"]

1585 metrics["observations"] = observations_bucket

1586

1587 observation: dict[str, Any] = {

1588 "tool_results": tool_results,

1589 "metrics": metrics,

1590 "events": events,

1591 "phase_started_at": phase_started_at,

1592 "phase_ended_at": phase_ended_at,

1593 }

1594 if errors: 1594 ↛ 1595line 1594 didn't jump to line 1595 because the condition on line 1594 was never true

1595 observation["errors"] = errors

1596 return observation

1597

1598

1599# ---------------------------------------------------------------------------

1600# MissionSandbox

1601# ---------------------------------------------------------------------------

1602

1603

1604class MissionSandbox:

1605 """Run a validated Mission script under ``MontySandboxProvider`` limits.

1606

1607 One sandbox per iteration. The constructor freezes the per-iteration

1608 ``mission`` namespace as a :class:`types.MappingProxyType` snapshot

1609 (so a script cannot reach back through ``mission`` and mutate the

1610 session record), pins the operator's tool allowlist, and builds the

1611 underlying ``MontySandboxProvider`` with the duration / memory

1612 limits read from the module-level constants. :meth:`run` then

1613 drives a single script execution end to end:

1614

1615 1. AST validate via :func:`validate_script_ast` — propagation of

1616 :class:`ScriptRejected` is the engine's signal to fail the

1617 Execute_Phase with reason ``script_rejected``.

1618 2. Build the ``external_functions`` map: one async wrapper per

1619 allowlisted tool, each forwarding into the engine's tool

1620 dispatcher so the wrapper preserves the existing

1621 ``@audit_logged`` / feature-flag / allowlist semantics — running

1622 inside a script is *not* a way to bypass any of those.

1623 3. Execute under Monty's caps. Any ``MontyError`` (limit /

1624 runtime / typing / syntax) is re-raised as

1625 :class:`SandboxTerminated` carrying whatever the script

1626 collected before being killed.

1627 4. Fold the closure-captured tool log, observe log, and event log

1628 into an Observation dict whose shape exactly matches the

1629 engine's tool-calls path.

1630

1631 The sandbox is immutable after construction: there are no setters,

1632 no rebuild methods, and the underlying provider is held by

1633 reference rather than recreated per call. Each iteration gets its

1634 own MissionSandbox so a stale frozen namespace cannot leak across

1635 iterations.

1636 """

1637

1638 def __init__(

1639 self,

1640 allowlist: list[str],

1641 session: Any,

1642 ) -> None:

1643 # Defensive copy of the allowlist: the engine pins the

1644 # allowlist on the session at create time, but a shared list

1645 # reference would let later mutations slip past the AST

1646 # validator's frozenset (which is constructed once per

1647 # validation call from ``self._allowlist``).

1648 self._allowlist: list[str] = list(allowlist)

1649

1650 # Build the per-iteration mission namespace as an immutable

1651 # snapshot. Each iteration summary carries only the four

1652 # fields a script needs to reason about prior progress —

1653 # full IterationRecord shapes would be both heavy and

1654 # tempting for a script to walk in ways the engine does not

1655 # support.

1656 iteration_summaries: list[dict[str, Any]] = []

1657 for it in session.get("iterations") or []: 1657 ↛ 1658line 1657 didn't jump to line 1658 because the loop on line 1657 never started

1658 iteration_summaries.append(

1659 {

1660 "iteration_index": it.get("iteration_index"),

1661 "verdict": it.get("verdict"),

1662 "verdict_reason": it.get("verdict_reason"),

1663 "checkpoint_evaluated": it.get("checkpoint_evaluated"),

1664 }

1665 )

1666 # ``copy.deepcopy`` on criteria + budget so a script that

1667 # walks them via subscripting cannot mutate the session

1668 # record even if Python's MappingProxyType were ever

1669 # bypassed by a future change.

1670 ns: dict[str, Any] = {

1671 "session_id": session["session_id"],

1672 "iteration_index": len(session.get("iterations") or []),

1673 "directive_text": session.get("directive_text", ""),

1674 "criteria": copy.deepcopy(session.get("criteria") or []),

1675 "budget": copy.deepcopy(session.get("budget") or {}),

1676 "iterations": iteration_summaries,

1677 }

1678 self._frozen_mission_ns: MappingProxyType[str, Any] = MappingProxyType(ns)

1679

1680 # Construct the provider once and pin it on the instance.

1681 # The provider holds no per-call state, so reusing it across

1682 # multiple ``run`` calls would be safe in principle, but the

1683 # one-sandbox-per-iteration lifetime keeps the failure

1684 # surface small and matches the rest of the per-iteration

1685 # state above.

1686 provider_cls, _ = _import_provider()

1687 self._provider = provider_cls(

1688 limits={

1689 "max_duration_secs": _DURATION_LIMIT_SECS,

1690 "max_memory": _MEMORY_LIMIT_BYTES,

1691 }

1692 )

1693

1694 # ---- read-only accessors ------------------------------------------

1695

1696 @property

1697 def frozen_mission_ns(self) -> MappingProxyType[str, Any]:

1698 """The iteration's frozen ``mission`` namespace snapshot."""

1699 return self._frozen_mission_ns

1700

1701 @property

1702 def allowlist(self) -> list[str]:

1703 """Defensive copy of the per-session tool allowlist."""

1704 return list(self._allowlist)

1705

1706 # ---- public surface -----------------------------------------------

1707

1708 async def run(

1709 self,

1710 script: str,

1711 ctx: Any | None,

1712 tool_dispatcher: Callable[[str, dict[str, Any], Any], Awaitable[Any]],

1713 ) -> tuple[dict[str, Any], list[dict[str, Any]]]:

1714 """Validate, execute, and observe a Mission script.

1715

1716 Returns ``(observation, script_call_log)`` matching the shape

1717 the engine's ``_execute_script`` expects: the observation is a

1718 plain dict (engine cast to :class:`Observation` at the call

1719 site) and the call log is a list of

1720 :class:`ToolCallRecord`-shaped dicts.

1721

1722 On any ``MontyError`` from the provider — duration cap, memory

1723 cap, runtime / typing / syntax error inside the script — the

1724 method re-raises as :class:`SandboxTerminated` carrying the

1725 closure-captured partial observations and events. The engine's

1726 decide-phase pattern-matches on this exception and produces a

1727 ``terminate`` verdict for the iteration.

1728

1729 ``ScriptRejected`` from the AST validator propagates upward

1730 unchanged: the engine's Execute_Phase treats that as a

1731 ``script_rejected`` failure and never reaches the runtime path

1732 below.

1733 """

1734 # Step 1: AST gate. Propagating ``ScriptRejected`` upward is

1735 # deliberate — the engine's _execute_phase wraps it as a

1736 # phase failure with reason ``script_rejected``; doing the

1737 # rejection here means the runtime path never sees a

1738 # disallowed source.

1739 validate_script_ast(script, self._allowlist)

1740

1741 _, monty_error_cls = _import_provider()

1742

1743 # Closure-captured collectors. Populated synchronously by the

1744 # host-side helper closures registered as

1745 # ``external_functions`` and the per-tool wrappers; observed

1746 # post-run (or post-termination) to build the Observation.

1747 # Lists rather than dicts so the order in which the script

1748 # called ``mission.event`` / ``mission.observe`` is preserved

1749 # in the final record.

1750 observe_log: list[dict[str, Any]] = []

1751 event_log: list[dict[str, Any]] = []

1752 script_call_log: list[dict[str, Any]] = []

1753

1754 # Host-side helpers for ``mission.observe`` and

1755 # ``mission.event``. Routing them through the

1756 # ``external_functions`` channel — rather than as bound

1757 # methods on a dataclass shipped via ``inputs`` — is what

1758 # makes script-side mutations visible to the host:

1759 # ``MontySandboxProvider`` round-trips ``inputs`` values into

1760 # the underlying Monty VM by value, so a closure list

1761 # captured on a method body of an ``inputs`` dataclass would

1762 # only ever see the VM-side copy. The external-functions

1763 # channel runs each call back in host Python, so the lists

1764 # below receive the appends.

1765 #

1766 # The signatures match the original ``mission.observe`` /

1767 # ``mission.event`` script-facing surface: ``observe`` takes

1768 # ``(key, value)`` positionally, ``event`` takes ``name``

1769 # positionally plus arbitrary keyword arguments. The AST

1770 # rewrite below replaces the attribute callee with a bare

1771 # Name lookup but leaves args / kwargs unchanged, so the

1772 # call shape that lands on these helpers is exactly what an

1773 # operator would write at the script surface.

1774 async def _mission_observe(key: str, value: Any) -> None:

1775 observe_log.append({"key": key, "value": value})

1776

1777 async def _mission_event(name: str, **kwargs: Any) -> None:

1778 event_row: dict[str, Any] = {"event_name": name}

1779 event_row.update(kwargs)

1780 event_log.append(event_row)

1781

1782 # The frozen mission namespace remains pinned on this

1783 # sandbox instance (``self._frozen_mission_ns``) so a future

1784 # widening of the script surface can expose it without

1785 # rebuilding the construction-time snapshot. It does *not*

1786 # ride through the ``inputs`` channel today: the validator

1787 # never accepts attribute access on anything other than

1788 # ``mission`` (and the only two ``mission`` attributes are

1789 # the ``observe`` / ``event`` helpers handled by the

1790 # preamble below), so a script has no way to read the

1791 # snapshot through Monty's runtime. Holding it on the host

1792 # side is the simpler shape; routing it as a ``Mapping``

1793 # through ``inputs`` would require Monty to convert the

1794 # full dataclass + nested dicts to its own value model and

1795 # pay a per-iteration translation cost for data nothing

1796 # observes.

1797

1798 # Build the external_functions mapping. Each tool name maps

1799 # to an async wrapper; Monty's ``external_functions`` channel

1800 # auto-wraps sync callables to async, but we register native

1801 # async functions so the dispatcher's ``await`` chain stays

1802 # explicit and the wrapper can do its own timing.

1803 external_functions: dict[str, Callable[..., Any]] = {}

1804 # Pull the per-iteration identifiers off the frozen namespace

1805 # snapshot built at construction time so the wrapper records

1806 # the same ``session_id`` / ``iteration_index`` the rest of

1807 # the iteration's audit rows carry.

1808 session_id = self._frozen_mission_ns["session_id"]

1809 iteration_index = self._frozen_mission_ns["iteration_index"]

1810 for tool_name in self._allowlist:

1811 external_functions[tool_name] = _make_tool_wrapper(

1812 tool_name,

1813 ctx,

1814 tool_dispatcher,

1815 script_call_log,

1816 session_id,

1817 iteration_index,

1818 )

1819

1820 # The two helper functions ride alongside the per-tool

1821 # wrappers under reserved underscore-prefixed names. Operator

1822 # scripts cannot collide with these: the AST validator

1823 # rejects ``_mission_observe`` and ``_mission_event`` as

1824 # bare names (neither is on the per-session tool allowlist

1825 # nor any of the safe-builtin / exception / mission base

1826 # sets), so a script that wrote ``_mission_observe(...)``

1827 # directly would fail the gate with ``name_not_allowed``.

1828 # Only the AST rewrite below — applied *after* the gate —

1829 # ever produces those Name nodes.

1830 external_functions["_mission_observe"] = _mission_observe

1831 external_functions["_mission_event"] = _mission_event

1832

1833 # The validated operator source is re-parsed and rewritten

1834 # so every accepted ``mission.<helper>(...)`` Call's callee

1835 # becomes a bare-Name lookup of the corresponding reserved

1836 # external-function name. Monty's parser does not accept

1837 # ``class`` / nested-attribute shims that would otherwise

1838 # let us preserve the surface attribute call, so the

1839 # rewrite happens on the AST itself before the source ever

1840 # reaches the underlying VM. Operator code keeps its

1841 # author-time surface (``await mission.observe(key, value)``);

1842 # only the run-time surface differs.

1843 final_source = _rewrite_mission_helpers(script)

1844

1845 phase_started_at = datetime.now(UTC).isoformat()

1846

1847 try:

1848 await self._provider.run(

1849 code=final_source,

1850 inputs={},

1851 external_functions=external_functions,

1852 )

1853 except monty_error_cls as exc:

1854 # ``MontyError`` is the base of the limit / runtime /

1855 # typing / syntax error family. Catching the base class

1856 # rather than the leaves means a future Monty release

1857 # adding a new error type still routes through

1858 # ``SandboxTerminated`` rather than escaping as an opaque

1859 # ``Exception``.

1860 raise SandboxTerminated(

1861 type(exc).__name__,

1862 partial_observations=list(observe_log),

1863 partial_events=list(event_log),

1864 partial_script_call_log=list(script_call_log),

1865 ) from exc

1866

1867 phase_ended_at = datetime.now(UTC).isoformat()

1868

1869 # The script's return value is intentionally ignored: the

1870 # contract documented for the script surface is "use

1871 # ``mission.observe(...)`` / ``mission.event(...)`` to report

1872 # data". A script that returned a dict would conflict with

1873 # the helper-driven observation list, and the engine's

1874 # observe-phase already accepts a pre-built Observation

1875 # without consulting any return value.

1876 observation = _build_script_observation(

1877 script_call_log=script_call_log,

1878 observe_log=observe_log,

1879 event_log=event_log,

1880 phase_started_at=phase_started_at,

1881 phase_ended_at=phase_ended_at,

1882 )

1883 return observation, list(script_call_log)

1884

1885

1886# ---------------------------------------------------------------------------

1887# Default factory

1888# ---------------------------------------------------------------------------

1889

1890

1891def make_default_sandbox_runner(

1892 allowlist: list[str],

1893 session: Any,

1894) -> Callable[

1895 [str, Any, Callable[[str, dict[str, Any], Any], Awaitable[Any]]],

1896 Awaitable[tuple[dict[str, Any], list[dict[str, Any]]]],

1897]:

1898 """Build the default ``sandbox_runner`` callable for the engine.

1899

1900 The :class:`MissionEngine` takes a callable matching the

1901 ``SandboxRunner`` protocol (``(script, ctx, tool_dispatcher) ->

1902 (observation_dict, script_call_log)``); this helper wraps a fresh

1903 :class:`MissionSandbox` for a given session and returns the bound

1904 :meth:`MissionSandbox.run` method so the engine can drive the

1905 sandbox without depending on the sandbox class itself.

1906

1907 One sandbox per session: the constructor freezes a snapshot of the

1908 session's directive, criteria, budget, and prior-iteration

1909 summaries into the ``mission`` namespace, so reusing a runner

1910 across sessions would leak stale state. The engine's normal

1911 construction path therefore calls this factory once per

1912 ``mission_start`` and pins the returned callable on the engine

1913 instance for the session's lifetime.

1914 """

1915 sandbox = MissionSandbox(

1916 allowlist=allowlist,

1917 session=session,

1918 )

1919 return sandbox.run

1920

1921

1922# ---------------------------------------------------------------------------

1923# Public surface

1924# ---------------------------------------------------------------------------

1925

1926

1927__all__ = [

1928 "MissionSandbox",

1929 "ScriptRejected",

1930 "SandboxTerminated",

1931 "make_default_sandbox_runner",

1932 "validate_script_ast",

1933]