Benchmarks
Performance data for token efficiency, JIT compilation, parallel speedup, and IoT artifact sizes
1. Token Efficiency
Synoema is designed so that LLMs generate correct programs with the fewest possible tokens. Fewer tokens per program means lower API cost, faster generation, and fewer opportunities for the model to drift off course. Token counts use cl100k_base (tiktoken, exact — not estimated). SPDX headers are stripped for a fair comparison.
The benchmark suite covers 16 tasks across 5 languages (Phase A). Tasks include: factorial, fibonacci, quicksort, filter_map, binary_search, string_ops, json_build, error_handling, pattern_match, type_definition, and more.
| Benchmark suite | Tasks | Method |
|---|---|---|
| Phase A — Token Efficiency | 16 | cl100k_base BPE count, 5 languages |
| Phase B — Runtime Performance | 12 | Median of 5 runs, 3 warm-ups discarded |
| Phase C — LLM Generation | 9 | 5 repeats, temperature 0.2 |
| Phase D — Model Size Reduction | 30 | Qwen2.5-Coder 0.5B–7B, 3 configs |
Token counting methodology
- Counting: tiktoken cl100k_base (exact, not approximate)
- SPDX license headers stripped for all languages before counting
- Same algorithm implemented identically in each language
- Synoema uses no extra boilerplate: no
importnoise, no class wrappers, no type annotations required
Run it yourself
cd benchmarks
cargo run --manifest-path runner/Cargo.toml -- run --phases token
Results are saved to benchmarks/results/<date>_run_<NNN>/summary.txt.
2. JIT Performance
Synoema compiles to native code via Cranelift JIT — the same backend used by Wasmtime and rustc (debug mode). The JIT is invoked with sno jit file.sno.
Runtime benchmark setup (Phase B)
- 12 tasks: factorial, fibonacci, quicksort, mergesort, collatz, gcd, fizzbuzz, filter_map, binary_search, matrix_mult, string_ops, and more
- Measured via subprocess timing (
std::time::Instant) - 3 warm-up runs discarded; 5 runs measured; median reported
- C++ compiled with
-O2 -std=c++17 - Synoema uses a pre-built release binary (not debug build)
Runtime benchmark numbers require a full Phase B run against your hardware. See docs/benchmarks.md for instructions.
Run it yourself
# Token + runtime (no API key needed)
cargo run --manifest-path runner/Cargo.toml -- run --phases token,runtime
# Verbose: see exact commands, timing per run
cargo run --manifest-path runner/Cargo.toml -- run --phases runtime -v
JIT vs interpreter
For single-threaded numeric workloads, the JIT backend typically delivers significantly lower latency than the tree-walking interpreter. The interpreter is the default for sno run; the JIT is invoked explicitly with sno jit. Both share the same type system and semantics.
Caveats
- Runtime benchmarks include JIT compilation time for Synoema (not just execution)
- TypeScript via
tsxincludes startup overhead (not representative of production TS) - C++ compiled with
-O2, not-O3or-Ofast
3. Parallel Speedup (pmap)
pmap is the parallel map combinator in Synoema. It distributes work across available CPU cores using real OS threads. Below the threshold of 64 elements (PMAP_PAR_THRESHOLD), pmap falls back to sequential execution to avoid closure-cloning overhead on short lists.
Measured speedup — PM-4 stress test
Measured via cargo test -p synoema-codegen --test stress pm4 and cargo test -p synoema-eval --test stress pm4 on a local Apple Silicon workstation (10 logical cores).
| Backend | List size | Per-element work | Cores | Sequential map | Parallel pmap | Speedup |
|---|---|---|---|---|---|---|
| JIT | 1024 | 400-iter heavy loop | 10 | 120 ms | 60 ms | 2.04× |
| Interpreter | 128 | 20-iter heavy loop | 10 | 2055 ms | 582 ms | 3.53× |
Honest caveats
- Sample size: single runs during PM-4 test execution; timings vary ±20% under concurrent cargo-test harness load
- Assertion in CI is loose (
ratio ≥ 1.0×on ≥4 cores) to avoid flakes; the numbers above come from clean runs - Lists below
PMAP_PAR_THRESHOLD = 64run sequentially —pmapof short lists is slower thanmapby a small constant (closure-cloning overhead) - Nested
pmapfalls back to sequential on the inner call (PM-8) — no dedicated benchmark yet
Syntax
-- pmap: parallel version of map
results = pmap (\x -> heavy_computation x) large_list
-- Falls back to sequential for short lists
results_short = pmap (\x -> x * 2) [1 2 3] -- sequential, list < 64
Pending
- Throughput numbers for
send/recvat varying payload sizes - Scaling studies for
spawn/scopeacross core counts (1/2/4/8 sweep) - Longevity results (RSS growth, thread-count stability)
4. IoT WASM Artifact Sizes
Synoema compiles IoT automation rules to WebAssembly (sno wasm rule.sno). The resulting .wasm files are the artifacts deployed to microcontrollers and edge devices via wasm3 or wasmtime.
Measured artifact sizes
| Metric | Value | Context |
|---|---|---|
| Mean | 82.1 B | Across 15 real vertical IoT rules |
| Max | 107 B | Most complex rule in the set |
| Min | 71 B | Simplest threshold rule |
These measurements come from the 15 rules across Home / Industrial / Wearable verticals. All rules compile from a single Synoema source file with no external dependencies.
Wave 2 results (30 rules)
Wave 2 expanded to 3 additional verticals (Automotive, Agriculture, Healthcare) for 30 rules total.
| Metric | Value |
|---|---|
| Compile pass (sno check) | 30 / 30 |
| WASM compile pass (sno wasm) | 29 / 30 (96.7%) |
| Mean artifact size | 200 B |
The one failing rule uses list pattern matching (Nil/Cons) not yet registered in WASM ctor_tags — a known limitation, not a language design issue.
Why size matters
Microcontrollers have flash storage measured in kilobytes, not megabytes. A 107 B rule fits in the L1 instruction cache of essentially any modern MCU and deploys over-the-air in a single UDP packet.
Run it yourself
# Compile one IoT rule to WASM
sno wasm lang/examples/iot/rules/rule_fan_on_temp.sno
# Check artifact size
ls -la *.wasm
5. LLM Generation Quality (Phase C & D)
Phase C measures how well LLMs generate correct Synoema code from natural-language prompts. Phase D specifically measures the impact of compact references and multi-pass error correction on small model (0.5B–7B) generation quality.
Phase C models (3 tiers)
| Tier | Models |
|---|---|
| Frontier | GPT-4o, Gemini 2.5 Pro, Qwen3 Max |
| Mid | GPT-4o-mini, DeepSeek V3, Qwen3 Coder, Llama 4 Maverick |
| Weak | Qwen3.5 9B, LFM 1.2B (free), Reka Edge 7B |
Phase D configurations
| Config | Reference | Tokens | Multi-pass | Purpose |
|---|---|---|---|---|
baseline | docs/llm/synoema.md | ~1800 | No | Control |
compact | docs/llm/synoema-compact.md | ~900 | No | Compact reference impact |
multipass | docs/llm/synoema-compact.md | ~900 | Yes (2 retries) | Error feedback impact |
Multi-pass uses temperature decay (0.7 → 0.4 → 0.2) with llm_hint error feedback from the compiler.
Fine-tuned models (local, no API key)
| Model | Size | run_pass | Best for |
|---|---|---|---|
| synoema-coder-3b-v3 | 1.8 GB | 70.2% | Daily use, low memory |
| synoema-coder-7b-v1 | 4.4 GB | 71.2% | Higher accuracy |
| synoema-iot-lite-1.5b-v2 | — | 97.1%* | IoT rules only |
* compile_pass on IoT rule test set
Async throughput
Async event loop reactor benchmarks (timer wheel, file IO pool, TCP via mio) are under measurement. The reactor handles 1k concurrent 5ms sleeps in ~5ms using a single OS thread timer wheel.
6. Run the Benchmark Suite
# Token efficiency only (fastest, no runtime deps)
cd benchmarks
cargo run --manifest-path runner/Cargo.toml -- run --phases token
# Token + runtime (no API key needed)
cargo run --manifest-path runner/Cargo.toml -- run --phases token,runtime
# Full suite including LLM code generation (OpenRouter)
cargo run --manifest-path runner/Cargo.toml -- run --all --openrouter-key YOUR_KEY
# Local ollama (no API key needed)
cargo run --manifest-path runner/Cargo.toml -- run --all --ollama
Results are saved to benchmarks/results/<date>_run_<NNN>/: summary.txt, details.txt, raw.json.
Full methodology: docs/benchmarks.md