Multi-Agent Orchestration: From Prompt to Production
The Problem No Demo Shows You
Every agent demo looks the same. Type a prompt, watch a spinner, get a pull request. The code probably runs — on the demo machine. What the recording does not show is the part where you have to actually ship what the agent produced. That is where most agent stacks fall apart.
Production code generation is not a single LLM call. It is a pipeline with hard requirements that nobody talks about: every edit needs verification before it touches the main branch; the bill cannot run away while you sleep; when something explodes — and it will explode — there must be a recovery story that does not involve reading 4,000 lines of agent transcript at 3am. Most "autonomous coding" frameworks treat these as version-two problems. The Synoema orchestrator was built starting from them.
This article is a tour of the orchestrator as it stands today: how a prompt becomes a verified, shipped, rollback-able change. The system is in beta — 17 subsystems, 301 tests, 19 SQL tables — and several pieces are still stubs waiting for production LLM wiring. We will be honest about which.
The Execution Loop
The daemon is a single tokio process that exposes 11 HTTP endpoints (axum + SSE for log streaming) and a SQLite job store. Submitting a job writes a row, returns a ULID, and queues work. A poller wakes every few hundred milliseconds, picks the next runnable subtask, and hands it to the Docker driver.
The driver does exactly one thing per subtask: spin up a fresh container, mount a workspace volume, run sno code with the chosen model and prompt, capture stdout/stderr, and wait for a JSON cost report on the way out. Container teardown is unconditional. The next subtask gets a clean filesystem and a clean process tree.
Per-container is not free — we pay 1–3 seconds of cold-start per task — but the isolation properties are worth it. A misbehaving agent cannot poison the next agent's environment. A failed task does not leak sockets, file handles, or background processes. The container is the unit of isolation, of resource accounting, and (with the cost report) of billing.
Tier 1 vs Tier 2: Spending Tokens Where They Matter
Not every step in a software change needs the same brain. Reading a file, applying a small edit, running tests — these are mechanical. Designing the change, reviewing the diff, deciding whether the agent is going off the rails — these need judgment. The orchestrator splits work along that line.
Tier 2 (cheap, fast: Haiku-class) handles execution. Each subtask in the DAG — a single edit, a single verification, a single test run — runs against Tier 2. Tier 1 (slow, expensive: Opus-class) handles planning, code review, drift checking, and escalations. A typical job burns 30–50 Tier-2 calls and 2–4 Tier-1 calls.
The cost asymmetry is brutal in our favor. A multi-file feature that would cost $4 end-to-end on a frontier model lands closer to $0.60 with this split — and the verification gates catch the cases where the cheap model went wrong. The principle: do not pay frontier prices for token-shoveling work.
Budget Gates: Block Before You Spend
Three budget gates run before any container starts. A daily cap (configurable, defaults to a sane spend ceiling) checks aggregate spend across all jobs. A per-job cap is set at submission. A concurrent-execution semaphore caps the number of in-flight containers regardless of cost.
If any gate would be violated, the job stays queued and a structured warning lands in the inbox. The principle is "block before you spend" — the budget check happens against the projected cost of the next subtask before the docker run, not after. There is no scenario where the bill silently doubles because a runaway loop generated 500 retry calls overnight.
The Verification Chain: Three Checks at Three Boundaries
This is the part most agent stacks skip. After every meaningful edit, three independent checks run, each at a different boundary.
Drift score. Every prompt-edit pair gets hashed. A Tier-1 reviewer compares the diff against the original task and emits a score 0–1 for how far the change has wandered from its stated intent. High drift — the agent silently expanded scope or "fixed" a bug nobody asked about — trips a guard and the change is rejected. The check at the intent level: did we do what we said we would?
Socratic uncertainty. The executing agent emits self-graded confidence at each subtask boundary: "I am 80% sure this matches the spec; I had to guess about X." When confidence drops below threshold, the subtask escalates to the inbox instead of cascading downstream. The check at the execution level: do we know what we just did?
Integration test reviewer. A Tier-1 reviewer reads the final diff plus existing tests, generates an integration test that exercises the new behavior end-to-end, and runs it in a fresh container. Pass = green. Fail = red, the change does not advance. The check at the behavior level: does the system actually do the new thing?
Three checks, three independent failure modes. Drift catches scope creep. Socratic catches uncertain reasoning. Integration catches "passes unit tests but breaks in real use." Each check has known false-positive rates — but the geometric combination of three independent gates is hard to slip past by accident. (Drift reviewer and integration generator are currently stubbed pending production LLM wiring; the framework is in place, the prompts are pending.)
Approval Flow: Auto, Ask, Block
Every subtask has an approval mode. Auto — pass verification, advance silently. Ask — pass verification, queue for human review in the inbox before advancing. Block — never auto-advance, always require explicit human resolve.
The default for "production" deploys is Block. This is deliberate. The cost of bothering a human with an unnecessary review is low. The cost of an autonomous agent shipping the wrong thing to production is unbounded. Every approval — auto or human — lands in an audit trail keyed by job + subtask, with the diff, the verification scores, the model versions, and the resolving identity. If something ever goes wrong in prod, you have the receipt.
The Recovery Story: When Things Break
Verification catches most problems before they ship. Recovery handles the ones that slip through anyway. This is the longest section because this is where most agent infrastructure has a hand-wave.
The log watcher is a 3-tier classifier sipping logs from running containers and deployed services. Tier 1: fast pattern matching against a catalog of error signatures — OOM, panic, segfault, well-known stack-trace shapes. Tier 2: structural classification — regression vs. environmental failure vs. flaky test. Tier 3: a Tier-1 model call when the first two cannot decide. The expensive call only fires for genuinely ambiguous cases.
When a real bug is identified, the orchestrator can spawn an auto-fix subtask — a fresh job with the failing trace as input. Hard cap: one auto-fix per identified bug per day, with a dedup key from the trace fingerprint. Without dedup, a single recurring crash spawns 200 fix attempts in an hour, all failing the same way, all costing money. With dedup, the system tries once; if that fails the bug stays open for human attention.
For deploys, every promotion captures a git snapshot first. If the post-deploy log watcher detects a regression within the monitoring window, auto-rollback runs — revert to snapshot, redeploy, file the incident. Capped at 3 per day, after which the system goes manual-only. Three rollbacks in a day means something is structurally wrong, not a bug to retry.
The general shape: a budget for every recovery action, a cap on automatic retries, and a clean handoff to a human when the automatic system has run out of options. Recovery is a feature, not a fallback. (Auto-fix LLM wiring is in flight; the dedup tracker, snapshot capture, and rollback executor are live.)
The Topology Map
A 60-second poll runs docker ps against the host, joins the result with the in-memory job table, and renders an SVG dashboard at /dashboard. Containers, current subtask, model in use, accumulated cost, and last log line — one screen, no JavaScript framework, no client-side state. It is unglamorous, and it has saved us several times when "the orchestrator is doing something weird" turned out to be visible at a glance.
What We Learned Building It
A few design decisions paid for themselves repeatedly.
Hand-rolled, not pulled. No regex dep, no cron crate, no serde in the orchestrator surface. The classifier patterns, the schedule parser, the JSON cost-report reader are all hand-written. This sounds like a vanity rule, but it has a payoff: the entire orchestrator compiles in under 30 seconds, the binary is small, and there is no transitive dependency chain to audit. When a subsystem broke, we could read it.
Additive-only schema. 19 SQL tables, zero destructive migrations. New columns are added with defaults; old columns are deprecated but not removed. This is critical when the daemon is supposed to keep running across upgrades — a half-migrated table is a half-broken orchestrator.
Injectable function pointers for tests. Every external call — docker, the LLM endpoint, git, the clock — goes through an injectable seam. The 301 tests run with everything external mocked. Test runtime is under 8 seconds. We can rebuild any failure scenario locally without burning a real LLM call. The principle: production code only depends on traits, never on concrete external clients.
Mock everything external first. Every new subsystem started with its mock. The real client came after the integration tests passed against the mock. This is slower for the first hour and dramatically faster for the next month.
Status — and What's Next
The orchestrator is in beta. The HTTP daemon, Docker driver, DAG planner, budget gates, snapshot system, log watcher, escalation queue, and telemetry hub are all live and exercised by the 301-test suite. The CLI surface (sno orchestrator submit / status / logs / inbox / resolve / metrics / cancel) is stable.
What is in flight: production LLM wiring for the drift reviewer, the integration test generator, and the auto-fix subtask runner — the framework is there, the prompts are being tuned. After that, Podman runtime support (drop-in alternative to Docker for hardened environments) and a richer dashboard with historical cost rollups.
If you want to try it today, here is the minimum loop:
$ sno orchestrator start --port 7777 --detach
orchestrator started, pid 48211, dashboard at http://localhost:7777/dashboard
$ sno orchestrator submit \
--task "add a /healthz endpoint that returns build sha + uptime" \
--acceptance ./acceptance.toml \
--policy ./policy.toml \
--budget '$2,15min,40turns' \
--workspace .
job submitted: 01HXYZK8QS9F6D2VTAK7N7PT3M
$ sno orchestrator logs 01HXYZK8QS9F6D2VTAK7N7PT3M --follow
[plan] 5 subtasks queued (4 tier-2, 1 tier-1 review)
[exec] subtask 1/5 (tier-2): edit src/server.sno ... ok ($0.04, 1.2s)
[exec] subtask 2/5 (tier-2): add tests/healthz_test.sno ... ok ($0.06, 1.7s)
[verify] drift score 0.08, socratic 0.92, integration PASS
[ask] subtask 5/5 awaiting human approval (mode=Ask)
$ sno orchestrator inbox
1 pending: job 01HXYZK8QS9F6D2VTAK7N7PT3M, subtask 5, deploy-to-prod (Ask)
$ sno orchestrator resolve 01HXYZK8QS9F6D2VTAK7N7PT3M 5 --pick approve
deploy queued. snapshot captured: a3f7b2c. monitoring window: 5min.
And when something goes wrong — because eventually something does — the recovery side looks like this:
$ sno orchestrator review 01HXYZK8QS9F6D2VTAK7N7PT3M
job: 01HXYZK8QS9F6D2VTAK7N7PT3M status: rolled_back
deploy: a3f7b2c -> 7d1e9f4 reverted at +3min12s
trigger: log_watcher tier-2 (regression: 5xx rate +180% vs baseline)
auto-rollback: 1/3 today
artifacts:
- audit-trail: ~/.sno/audit/01HXYZK8QS9F6D2VTAK7N7PT3M.jsonl
- cost-report: $0.42 (under $2 budget)
- snapshot: git ref a3f7b2c (recoverable)
That is the whole loop. Submit, plan, execute under tier-split, verify at three boundaries, approve under policy, ship with a snapshot, watch with a classifier, recover with a budget. None of it is novel in isolation — the contribution is in actually wiring all of it together and being honest about where the seams still are.
Related Articles
How We Automated Our Entire Dev Workflow with Claude Code Skills
The skills system that the orchestrator builds on: directory-based, parallel-capable, zero-config.
ExplainerHow a Compiler Catches AI Mistakes Before They Run
Verification at the language level — the foundation underneath the orchestrator's verification chain.
ResultsFrom Zero to 41%: Building an AI That Writes Working Code
Where the agent quality numbers come from and where the remaining 59% of failures live.