Synoema

Benchmarks

Performance data for token efficiency, JIT compilation, parallel speedup, and IoT artifact sizes

1. Token Efficiency

Synoema is designed so that LLMs generate correct programs with the fewest possible tokens. Fewer tokens per program means lower API cost, faster generation, and fewer opportunities for the model to drift off course. Token counts use cl100k_base (tiktoken, exact — not estimated). SPDX headers are stripped for a fair comparison.

The benchmark suite covers 16 tasks across 5 languages (Phase A). Tasks include: factorial, fibonacci, quicksort, filter_map, binary_search, string_ops, json_build, error_handling, pattern_match, type_definition, and more.

Benchmark suiteTasksMethod
Phase A — Token Efficiency16cl100k_base BPE count, 5 languages
Phase B — Runtime Performance12Median of 5 runs, 3 warm-ups discarded
Phase C — LLM Generation95 repeats, temperature 0.2
Phase D — Model Size Reduction30Qwen2.5-Coder 0.5B–7B, 3 configs

Token counting methodology

  • Counting: tiktoken cl100k_base (exact, not approximate)
  • SPDX license headers stripped for all languages before counting
  • Same algorithm implemented identically in each language
  • Synoema uses no extra boilerplate: no import noise, no class wrappers, no type annotations required

Run it yourself

cd benchmarks
cargo run --manifest-path runner/Cargo.toml -- run --phases token

Results are saved to benchmarks/results/<date>_run_<NNN>/summary.txt.

2. JIT Performance

Synoema compiles to native code via Cranelift JIT — the same backend used by Wasmtime and rustc (debug mode). The JIT is invoked with sno jit file.sno.

Runtime benchmark setup (Phase B)

  • 12 tasks: factorial, fibonacci, quicksort, mergesort, collatz, gcd, fizzbuzz, filter_map, binary_search, matrix_mult, string_ops, and more
  • Measured via subprocess timing (std::time::Instant)
  • 3 warm-up runs discarded; 5 runs measured; median reported
  • C++ compiled with -O2 -std=c++17
  • Synoema uses a pre-built release binary (not debug build)

Runtime benchmark numbers require a full Phase B run against your hardware. See docs/benchmarks.md for instructions.

Run it yourself

# Token + runtime (no API key needed)
cargo run --manifest-path runner/Cargo.toml -- run --phases token,runtime

# Verbose: see exact commands, timing per run
cargo run --manifest-path runner/Cargo.toml -- run --phases runtime -v

JIT vs interpreter

For single-threaded numeric workloads, the JIT backend typically delivers significantly lower latency than the tree-walking interpreter. The interpreter is the default for sno run; the JIT is invoked explicitly with sno jit. Both share the same type system and semantics.

Caveats

  • Runtime benchmarks include JIT compilation time for Synoema (not just execution)
  • TypeScript via tsx includes startup overhead (not representative of production TS)
  • C++ compiled with -O2, not -O3 or -Ofast

3. Parallel Speedup (pmap)

pmap is the parallel map combinator in Synoema. It distributes work across available CPU cores using real OS threads. Below the threshold of 64 elements (PMAP_PAR_THRESHOLD), pmap falls back to sequential execution to avoid closure-cloning overhead on short lists.

Measured speedup — PM-4 stress test

Measured via cargo test -p synoema-codegen --test stress pm4 and cargo test -p synoema-eval --test stress pm4 on a local Apple Silicon workstation (10 logical cores).

BackendList sizePer-element workCoresSequential mapParallel pmapSpeedup
JIT1024400-iter heavy loop10120 ms60 ms2.04×
Interpreter12820-iter heavy loop102055 ms582 ms3.53×

Honest caveats

  • Sample size: single runs during PM-4 test execution; timings vary ±20% under concurrent cargo-test harness load
  • Assertion in CI is loose (ratio ≥ 1.0× on ≥4 cores) to avoid flakes; the numbers above come from clean runs
  • Lists below PMAP_PAR_THRESHOLD = 64 run sequentially — pmap of short lists is slower than map by a small constant (closure-cloning overhead)
  • Nested pmap falls back to sequential on the inner call (PM-8) — no dedicated benchmark yet

Syntax

-- pmap: parallel version of map
results = pmap (\x -> heavy_computation x) large_list

-- Falls back to sequential for short lists
results_short = pmap (\x -> x * 2) [1 2 3]  -- sequential, list < 64

Pending

  • Throughput numbers for send/recv at varying payload sizes
  • Scaling studies for spawn/scope across core counts (1/2/4/8 sweep)
  • Longevity results (RSS growth, thread-count stability)

4. IoT WASM Artifact Sizes

Synoema compiles IoT automation rules to WebAssembly (sno wasm rule.sno). The resulting .wasm files are the artifacts deployed to microcontrollers and edge devices via wasm3 or wasmtime.

Measured artifact sizes

MetricValueContext
Mean82.1 BAcross 15 real vertical IoT rules
Max107 BMost complex rule in the set
Min71 BSimplest threshold rule

These measurements come from the 15 rules across Home / Industrial / Wearable verticals. All rules compile from a single Synoema source file with no external dependencies.

Wave 2 results (30 rules)

Wave 2 expanded to 3 additional verticals (Automotive, Agriculture, Healthcare) for 30 rules total.

MetricValue
Compile pass (sno check)30 / 30
WASM compile pass (sno wasm)29 / 30 (96.7%)
Mean artifact size200 B

The one failing rule uses list pattern matching (Nil/Cons) not yet registered in WASM ctor_tags — a known limitation, not a language design issue.

Why size matters

Microcontrollers have flash storage measured in kilobytes, not megabytes. A 107 B rule fits in the L1 instruction cache of essentially any modern MCU and deploys over-the-air in a single UDP packet.

Run it yourself

# Compile one IoT rule to WASM
sno wasm lang/examples/iot/rules/rule_fan_on_temp.sno

# Check artifact size
ls -la *.wasm

5. LLM Generation Quality (Phase C & D)

Phase C measures how well LLMs generate correct Synoema code from natural-language prompts. Phase D specifically measures the impact of compact references and multi-pass error correction on small model (0.5B–7B) generation quality.

Phase C models (3 tiers)

TierModels
FrontierGPT-4o, Gemini 2.5 Pro, Qwen3 Max
MidGPT-4o-mini, DeepSeek V3, Qwen3 Coder, Llama 4 Maverick
WeakQwen3.5 9B, LFM 1.2B (free), Reka Edge 7B

Phase D configurations

ConfigReferenceTokensMulti-passPurpose
baselinedocs/llm/synoema.md~1800NoControl
compactdocs/llm/synoema-compact.md~900NoCompact reference impact
multipassdocs/llm/synoema-compact.md~900Yes (2 retries)Error feedback impact

Multi-pass uses temperature decay (0.7 → 0.4 → 0.2) with llm_hint error feedback from the compiler.

Fine-tuned models (local, no API key)

ModelSizerun_passBest for
synoema-coder-3b-v31.8 GB70.2%Daily use, low memory
synoema-coder-7b-v14.4 GB71.2%Higher accuracy
synoema-iot-lite-1.5b-v297.1%*IoT rules only

* compile_pass on IoT rule test set

Async throughput

Async event loop reactor benchmarks (timer wheel, file IO pool, TCP via mio) are under measurement. The reactor handles 1k concurrent 5ms sleeps in ~5ms using a single OS thread timer wheel.

6. Run the Benchmark Suite

# Token efficiency only (fastest, no runtime deps)
cd benchmarks
cargo run --manifest-path runner/Cargo.toml -- run --phases token

# Token + runtime (no API key needed)
cargo run --manifest-path runner/Cargo.toml -- run --phases token,runtime

# Full suite including LLM code generation (OpenRouter)
cargo run --manifest-path runner/Cargo.toml -- run --all --openrouter-key YOUR_KEY

# Local ollama (no API key needed)
cargo run --manifest-path runner/Cargo.toml -- run --all --ollama

Results are saved to benchmarks/results/<date>_run_<NNN>/: summary.txt, details.txt, raw.json.

Full methodology: docs/benchmarks.md