Multi-Agent Orchestration

From prompt to production — per-job containers, weak/strong tier execution, contracts-first verification

Overview

The Synoema orchestrator is a distributed agent fleet daemon. A human submits a single job — free-form task text plus an acceptance.toml — and the daemon drives the entire lifecycle to a verified deploy. It plans the work into a DAG of subtasks, runs each subtask in an isolated Docker container with its own short-lived virtual API key, watches the container logs for failures, classifies them, fixes the trivial ones, escalates the rest, captures a pre-deploy git snapshot so it can roll back if a fresh deploy starts crashing, and gates the whole pipeline on optional human approvals.

LLM coding agents are individually capable but operationally unsupervised. Running ten of them in parallel on the same workstation is a recipe for: leaked credentials, runaway cost, container OOM, regressions deployed to production, and an inbox full of "I'm not sure" half-answers. The orchestrator is the substrate that turns "ten agents in parallel" into a system you can leave running overnight.

Three load-bearing constraints define the design philosophy:

One job per container. No multiplexing. Each subtask gets its own ephemeral filesystem, its own short-lived virtual API key, its own resource caps. Multiplexing is achieved by spawning more containers, never by sharing one.
Weak / strong model tiering. Planning, review, and goal-drift detection use a "strong" model (Claude Opus / GPT-4 class); implementation, tests, and docs use a "weak" model (Sonnet / GPT-4-mini / Haiku class). Configured per-tier in config::ModelRoster, weighted-sampled at dispatch.
Contracts-first verification at every boundary. Every transition is gated: budget check before the LLM call, type/contract check before write, drift score before deploy, snapshot before deploy, approval before deploy when configured, log classifier after deploy, automatic rollback if a Critical event fires within the post-deploy window. The orchestrator does not believe an agent's "done" — it verifies it.

What the orchestrator is not. Not a CI server (no GitHub triggers, no PR comments). Not a Kubernetes scheduler (Docker subprocess driver, no orchestration of long-running services). Not a model router (delegates to LiteLLM via synoema-cred-broker). Not the agent itself — that lives in synoema-agent and runs inside the container.

The execution loop

One tokio task ticks every 10 seconds, BFS-walks queued subtasks, and dispatches anything whose dependencies are satisfied. Dispatch hands the subtask to an injectable ExecFn; in production this is default_docker_exec, which spawns a docker run, streams output via docker logs --follow, and writes a per-subtask cost report when the container exits. Status changes broadcast over an SSE channel so the dashboard updates in real time.

sno orchestrator submit
        │
        ▼
┌──────────────────────────────────────────┐
│  HTTP server  (axum)                     │
│  POST /jobs → row in jobs table          │
│  optional Socratic clarification         │
└────────────────┬─────────────────────────┘
                 ▼
┌──────────────────────────────────────────┐
│  Planner  (planner.rs)                   │
│  strong-tier LLM call                    │
│  fallback: dag::plan canned 4-subtask    │
│  (plan → impl ‖ tests → review)          │
└────────────────┬─────────────────────────┘
                 ▼
┌──────────────────────────────────────────┐
│  ExecutorLoop  (executor.rs, every 10s)  │
│  BFS-walk DAG → ready set                │
│  for each ready subtask:                 │
│    BudgetTracker.check                   │
│    cred-broker.issue_key                 │
│    ExecFn.spawn(docker run)              │
│    LimitsPoller.attach                   │
└────────────────┬─────────────────────────┘
                 ▼
┌──────────────────────────────────────────┐
│  Container  (one per subtask)            │
│  synoema-agent runs inside               │
│  writes telemetry JSONL → stdout         │
└────────────────┬─────────────────────────┘
                 ▼
┌──────────────────────────────────────────┐
│  log_watcher  (3-tier classifier)        │
│  Tier 1: regex patterns                  │
│  Tier 2: known-bug-signatures table      │
│  Tier 3: LLM stub                        │
│  → auto_fix or escalation                │
└────────────────┬─────────────────────────┘
                 ▼
┌──────────────────────────────────────────┐
│  Drift gate + snapshot + deploy          │
│  drift_score < 5 blocks deploy           │
│  pre-deploy git SHA → snapshots table    │
│  webhook → host-side deploy daemon       │
│  rollback on Critical (3/day cap)        │
└────────────────┬─────────────────────────┘
                 ▼
┌──────────────────────────────────────────┐
│  Cost report → SSE broadcast             │
│  audit row → approvals / rollback /      │
│  deploy / escalations tables             │
└──────────────────────────────────────────┘

Safety nets

Six independent gates catch the failure modes that bite multi-agent fleets in practice. Each one is a separate module, separately testable, and writes its own audit row.

Budget caps (`budget.rs`)

Three independent caps enforced in front of every LLM call: daily USD across the whole daemon, per-job USD for the active job, concurrent semaphore on the number of in-flight container dispatches. Exceeding any cap returns a structured BudgetExceeded error and never spawns the container.

Approval gates (`container.rs`, `approvals` table)

Every container has an ApprovalMode field: Auto (no gate), Ask (writes pending row, waits for human), Block (always rejected, human must edit). Consulted by deploy_container_handler at the deploy gate and by auto_fix::decide at the auto-fix gate. Every decision is recorded in the approvals audit table with the decider, timestamp, and reason.

Snapshots and rollback (`rollback.rs`, `snapshots` table)

Every deploy captures the pre-deploy git SHA into the snapshots table before the webhook fires. If log_watcher classifies a post-deploy event as Critical within the post-deploy observation window, the orchestrator rolls back to the captured SHA and re-deploys. Capped at 3 automatic rollbacks per day to prevent flap loops.

Drift block (`drift.rs`)

The original prompt is hashed at submit. After the DAG completes, a strong-tier reviewer compares the resulting diff against the original prompt and emits a drift score. A score below 5 (out of 10) blocks the deploy and queues an escalation. Catches the case where an agent has technically passed every contract but quietly delivered something unrelated to what was asked.

Integration tests (`integration_test.rs`)

The [test.*] sections of the submitted acceptance.toml are parsed and enqueued as [INT-TEST] subtasks after the main DAG completes. They run in their own ephemeral containers with the deploy artifact mounted read-only. Failure blocks deploy; pass writes to the integration_tests table.

Log classifier (`log_watcher.rs`)

3-tier classifier reads container stdout/stderr line-by-line. Tier 1 is hand-rolled regex against known panic / OOM / segfault / TLS-handshake-failure signatures — matches in microseconds. Tier 2 consults the known_bug_signatures table for project-specific patterns observed in past runs. Tier 3 is an LLM stub for unknown patterns. (Minor, Trivial) events become [AUTO-FIX] subtasks; everything else becomes an inbox row.

Subsystem map

Nine logical subsystems. Each maps to one or more modules under lang/crates/synoema-orchestrator/src/.

Subsystem	Modules	Role
HTTP server + state	`server.rs`, `router.rs`	axum listener, ~40 routes, SSE channel, model roster, scheduled tasks registry, long-lived background tasks.
Persistence	`db.rs`, `error.rs`	`OrchestratorDb` wraps a `rusqlite::Connection` behind `Arc<Mutex<…>>`. 19 tables on first open, additive `ALTER TABLE` migrations on reopen.
Job + DAG model	`job.rs`, `subtask.rs`, `dag.rs`, `planner.rs`, `session.rs`	Job/Subtask FSM. `dag::plan` canned fallback (plan → impl ‖ tests → review). `planner::call_llm_planner` calls strong-tier LLM for richer DAGs.
Container management	`container.rs`, `vault.rs`, `limits.rs`, `docker.rs`	`ContainerConfig` (disk, env, deploy_mode, limits, approval_mode). `Vault` = AES-256-GCM keyed off `SNO_ORCH_VAULT_KEY`. `LimitsPoller` polls `docker stats`, kills containers exceeding caps.
Model roster + credentials	`config.rs`, `cred.rs`	`ModelRoster::strong`/`weak` weighted-sample. `cred.rs` shims `synoema-cred-broker` for per-job virtual keys against LiteLLM.
Execution + scheduling	`executor.rs`, `poller.rs`, `scheduled.rs`, `gc.rs`	`ExecutorLoop` ticks every 10s. `scheduled.rs` runs cron-driven builtin tasks (nightly log audit, weekly GC, daily report). `gc.rs` retention-prunes old job rows.
Deployment + safety	`deploy.rs`, `rollback.rs`	`dispatch_deploy` POSTs to webhook (host-side `sno orchestrator deploy-hook` daemon does `rsync` or `docker pull && docker restart`). `rollback.rs` auto-rolls-back on Critical events with 3/day cap.
Observability + classification	`log_watcher.rs`, `topology.rs`, `telemetry/`, `aggregator/`	3-tier log classifier. `topology.rs` polls `docker ps` every 60s and renders an SVG graph. `telemetry/` ingests JSONL counters; `aggregator/` rolls them up daily.
Quality + decision gates	`budget.rs`, `drift.rs`, `socratic.rs`, `auto_fix.rs`, `integration_test.rs`, `verifier.rs`, `escalation.rs`	Three budget caps. Drift score < 5 blocks deploy. Socratic detects uncertainty markers. Integration tests gate deploy. `escalation.rs` is the kind-tagged inbox shared by all gates.

Available CLI commands

Daemon and client live in the same binary. Run from any directory once sno is on PATH.

sno orchestrator start [--port 7777] [--detach]
        Start the daemon. Writes ~/.sno/orchestrator.pid; --detach forks
        and re-execs as a background process.

sno orchestrator stop
        Graceful daemon shutdown via POST /shutdown. Drains in-flight
        subtasks, then exits.

sno orchestrator status [--daemon | --job <id>] [--json]
        Daemon-level health (default) or per-job status snapshot.

sno orchestrator submit
        --task <text> --acceptance <toml>
        [--policy <toml>] [--budget '$X,Ymin,Zturns'] [--workspace <path>]
        Submit a job; the rest is autonomous.

sno orchestrator logs <job-id> [--follow]
        Stream job logs over SSE.

sno orchestrator cancel <job-id>
        Async cancel — graceful drain then docker stop.

sno orchestrator inbox [--json]
        List pending escalations awaiting human resolution.

sno orchestrator resolve <job-id> <subtask-id>
        --pick <option> | --abort
        Resolve an escalation row.

sno orchestrator review <job-id>
        Print the final acceptance report (verifier output).

sno orchestrator metrics
        [--aggregate | --by-profile | --by-model | --by-day]
        [--drift-map | --doc-coherence | --export <path>]
        Telemetry-hub dashboards over the aggregator tables.

Status

Beta. 17 modules, ~7000 LOC, 287 unit tests + 14 integration tests = 301 tests, 0 warnings. SQLite store with 19 tables — including dedicated audit tables for approvals, rollback, deploy, and the escalations inbox.

What is production-ready today: dispatch loop, budget caps, approval gates, snapshots and rollback, drift gate, integration tests, Tier-1 + Tier-2 log classifier, HTTP API, CLI, telemetry aggregation.

What is stubbed (and the architecture doc is explicit about it): Tier-3 LLM log classifier, planner LLM call (falls back to a canned 4-subtask DAG), reviewer LLM agents. These are seams the project will move; they don't block the gates above.

Read the full architecture document

This page is the introduction. The full reference — data flow walk-throughs, state machines, full SQL schema, concurrency model, security model, extension points — lives in the Synoema repo:

docs/architecture/orchestrator.md on GitHub — 705 lines, 9 sections.
CLI Reference — the rest of the sno command surface.
Architecture — the compilation pipeline the orchestrator dispatches against.
AI Agent — the per-container worker (synoema-agent) the orchestrator runs inside each Docker container.