Benchmarks

Performance data for token efficiency, JIT compilation, parallel speedup, and IoT artifact sizes

1. Token Efficiency

Synoema is designed so that LLMs generate correct programs with the fewest possible tokens. Fewer tokens per program means lower API cost, faster generation, and fewer opportunities for the model to drift off course. Token counts use cl100k_base (tiktoken, exact — not estimated). SPDX headers are stripped for a fair comparison.

The benchmark suite covers 16 tasks across 5 languages (Phase A). Tasks include: factorial, fibonacci, quicksort, filter_map, binary_search, string_ops, json_build, error_handling, pattern_match, type_definition, and more.

Benchmark suite	Tasks	Method
Phase A — Token Efficiency	16	cl100k_base BPE count, 5 languages
Phase B — Runtime Performance	12	Median of 5 runs, 3 warm-ups discarded
Phase C — LLM Generation	9	5 repeats, temperature 0.2
Phase D — Model Size Reduction	30	Qwen2.5-Coder 0.5B–7B, 3 configs

Token counting methodology

Counting: tiktoken cl100k_base (exact, not approximate)
SPDX license headers stripped for all languages before counting
Same algorithm implemented identically in each language
Synoema uses no extra boilerplate: no import noise, no class wrappers, no type annotations required

Run it yourself

cd benchmarks
cargo run --manifest-path runner/Cargo.toml -- run --phases token

Results are saved to benchmarks/results/<date>_run_<NNN>/summary.txt.

2. JIT Performance

Synoema compiles to native code via Cranelift JIT — the same backend used by Wasmtime and rustc (debug mode). The JIT is invoked with sno jit file.sno.

Runtime benchmark setup (Phase B)

12 tasks: factorial, fibonacci, quicksort, mergesort, collatz, gcd, fizzbuzz, filter_map, binary_search, matrix_mult, string_ops, and more
Measured via subprocess timing (std::time::Instant)
3 warm-up runs discarded; 5 runs measured; median reported
C++ compiled with -O2 -std=c++17
Synoema uses a pre-built release binary (not debug build)

Runtime benchmark numbers require a full Phase B run against your hardware. See docs/benchmarks.md for instructions.

Run it yourself

# Token + runtime (no API key needed)
cargo run --manifest-path runner/Cargo.toml -- run --phases token,runtime

# Verbose: see exact commands, timing per run
cargo run --manifest-path runner/Cargo.toml -- run --phases runtime -v

JIT vs interpreter

For single-threaded numeric workloads, the JIT backend typically delivers significantly lower latency than the tree-walking interpreter. The interpreter is the default for sno run; the JIT is invoked explicitly with sno jit. Both share the same type system and semantics.

Caveats

Runtime benchmarks include JIT compilation time for Synoema (not just execution)
TypeScript via tsx includes startup overhead (not representative of production TS)
C++ compiled with -O2, not -O3 or -Ofast

3. Parallel Speedup (pmap)

pmap is the parallel map combinator in Synoema. It distributes work across available CPU cores using real OS threads. Below the threshold of 64 elements (PMAP_PAR_THRESHOLD), pmap falls back to sequential execution to avoid closure-cloning overhead on short lists.

Measured speedup — PM-4 stress test

Measured via cargo test -p synoema-codegen --test stress pm4 and cargo test -p synoema-eval --test stress pm4 on a local Apple Silicon workstation (10 logical cores).

Backend	List size	Per-element work	Cores	Sequential map	Parallel pmap	Speedup
JIT	1024	400-iter heavy loop	10	120 ms	60 ms	2.04×
Interpreter	128	20-iter heavy loop	10	2055 ms	582 ms	3.53×

Honest caveats

Sample size: single runs during PM-4 test execution; timings vary ±20% under concurrent cargo-test harness load
Assertion in CI is loose (ratio ≥ 1.0× on ≥4 cores) to avoid flakes; the numbers above come from clean runs
Lists below PMAP_PAR_THRESHOLD = 64 run sequentially — pmap of short lists is slower than map by a small constant (closure-cloning overhead)
Nested pmap falls back to sequential on the inner call (PM-8) — no dedicated benchmark yet

Syntax

-- pmap: parallel version of map
results = pmap (\x -> heavy_computation x) large_list

-- Falls back to sequential for short lists
results_short = pmap (\x -> x * 2) [1 2 3]  -- sequential, list < 64

Pending

Throughput numbers for send/recv at varying payload sizes
Scaling studies for spawn/scope across core counts (1/2/4/8 sweep)
Longevity results (RSS growth, thread-count stability)

4. IoT WASM Artifact Sizes

Synoema compiles IoT automation rules to WebAssembly (sno wasm rule.sno). The resulting .wasm files are the artifacts deployed to microcontrollers and edge devices via wasm3 or wasmtime.

Measured artifact sizes

Metric	Value	Context
Mean	82.1 B	Across 15 real vertical IoT rules
Max	107 B	Most complex rule in the set
Min	71 B	Simplest threshold rule

These measurements come from the 15 rules across Home / Industrial / Wearable verticals. All rules compile from a single Synoema source file with no external dependencies.

Wave 2 results (30 rules)

Wave 2 expanded to 3 additional verticals (Automotive, Agriculture, Healthcare) for 30 rules total.

Metric	Value
Compile pass (sno check)	30 / 30
WASM compile pass (sno wasm)	29 / 30 (96.7%)
Mean artifact size	200 B

The one failing rule uses list pattern matching (Nil/Cons) not yet registered in WASM ctor_tags — a known limitation, not a language design issue.

Why size matters

Microcontrollers have flash storage measured in kilobytes, not megabytes. A 107 B rule fits in the L1 instruction cache of essentially any modern MCU and deploys over-the-air in a single UDP packet.

Run it yourself

# Compile one IoT rule to WASM
sno wasm lang/examples/iot/rules/rule_fan_on_temp.sno

# Check artifact size
ls -la *.wasm

5. LLM Generation Quality (Phase C & D)

Phase C measures how well LLMs generate correct Synoema code from natural-language prompts. Phase D specifically measures the impact of compact references and multi-pass error correction on small model (0.5B–7B) generation quality.

Phase C models (3 tiers)

Tier	Models
Frontier	GPT-4o, Gemini 2.5 Pro, Qwen3 Max
Mid	GPT-4o-mini, DeepSeek V3, Qwen3 Coder, Llama 4 Maverick
Weak	Qwen3.5 9B, LFM 1.2B (free), Reka Edge 7B

Phase D configurations

Config	Reference	Tokens	Multi-pass	Purpose
`baseline`	docs/llm/synoema.md	~1800	No	Control
`compact`	docs/llm/synoema-compact.md	~900	No	Compact reference impact
`multipass`	docs/llm/synoema-compact.md	~900	Yes (2 retries)	Error feedback impact

Multi-pass uses temperature decay (0.7 → 0.4 → 0.2) with llm_hint error feedback from the compiler.

Fine-tuned models (local, no API key)

Model	Size	run_pass	Best for
synoema-coder-3b-v3	1.8 GB	70.2%	Daily use, low memory
synoema-coder-7b-v1	4.4 GB	71.2%	Higher accuracy
synoema-iot-lite-1.5b-v2	—	97.1%*	IoT rules only

* compile_pass on IoT rule test set

Async throughput

Async event loop reactor benchmarks (timer wheel, file IO pool, TCP via mio) are under measurement. The reactor handles 1k concurrent 5ms sleeps in ~5ms using a single OS thread timer wheel.

6. Run the Benchmark Suite

# Token efficiency only (fastest, no runtime deps)
cd benchmarks
cargo run --manifest-path runner/Cargo.toml -- run --phases token

# Token + runtime (no API key needed)
cargo run --manifest-path runner/Cargo.toml -- run --phases token,runtime

# Full suite including LLM code generation (OpenRouter)
cargo run --manifest-path runner/Cargo.toml -- run --all --openrouter-key YOUR_KEY

# Local ollama (no API key needed)
cargo run --manifest-path runner/Cargo.toml -- run --all --ollama

Results are saved to benchmarks/results/<date>_run_<NNN>/: summary.txt, details.txt, raw.json.

Full methodology: docs/benchmarks.md