The Scientific Method Behind Synoema

Research April 13, 2026 · ~14 min read

LLM research has a reproducibility problem. Results depend on exact prompts, model versions, temperature settings, and evaluation criteria that are often reported imprecisely. One team reports 73% accuracy; another team with a slightly different setup reports 41%. Neither is lying — they're measuring different things. But neither result is very useful either.

We've tried to do this differently. This article describes the methodology behind Synoema's empirical claims: how we state hypotheses, what we measure, how we handle statistical uncertainty, and how to reproduce our results.

The Hypothesis Framework

Every empirical claim in Synoema research is stated as a falsifiable hypothesis before the experiment runs. We don't look at data and find patterns to explain post-hoc; we state what we expect to find, design an experiment to test it, run the experiment, and report whatever comes out.

Phase D (baseline LLM evaluation) tested five hypotheses:

ID	Hypothesis	Verdict
H1	Compact reference (~900 tokens) performs as well as or better than baseline (~1800 tokens)	DISPROVED
H2	Larger models in the same family produce better Synoema code than smaller ones	CONFIRMED
H3	Multipass self-correction improves run rate	MODEL-SIZE-DEPENDENT
H4	Feature difficulty follows: fundamentals < data_structures < applications < abstractions	CONFIRMED
H5	Architecture and training objective produce meaningfully different results at similar parameter counts	CONFIRMED

H1 being disproved is important to highlight. We expected the compact reference to help. It significantly hurt. We're reporting this because disproved hypotheses are as scientifically valuable as confirmed ones — they tell you something about the space you're exploring.

Phase E (fine-tuning evaluation) adds seven more hypotheses:

ID	Hypothesis	Status
H6	Fine-tuning raises 7B run rate from 41% to ≥75%	Testing
H7	Fine-tuned 1.5B exceeds baseline 7B (41%) run rate	Testing
H8	Improvement is largest for fundamentals (most training examples)	Testing
H9	Applications category remains hardest post fine-tuning	Testing
H10	Token output length decreases after fine-tuning (model learns concise Synoema)	Testing
H11	3B and 7B show diminishing returns vs 1.5B	Testing
H12	Fine-tuned models retain >80% accuracy on non-Synoema coding tasks (no catastrophic forgetting)	Testing

H7 is the most interesting from a practical standpoint: if a fine-tuned 1.5B model can match a baseline 7B model, that's a meaningful result for edge deployment. H12 guards against a common failure mode in domain-specific fine-tuning: the model learns the new domain but forgets general coding ability.

Research Questions

Behind the hypotheses are eight research questions that motivate the overall program:

ID	Question	Type
RQ1	Does domain-specific fine-tuning significantly improve run rate for small LLMs on a novel PL?	Primary
RQ2	How does the improvement scale with model size (1.5B → 3B → 7B)?	Primary
RQ3	Can a fine-tuned 1.5B match or exceed the baseline 7B run rate?	Primary
RQ4	Which syntactic/semantic categories benefit most from fine-tuning?	Secondary
RQ5	Does fine-tuning reduce token verbosity or increase it?	Secondary
RQ6	What task categories remain unsolved after fine-tuning?	Secondary (roadmap)
RQ7	Is the improvement robust to prompt phrasing variations?	Secondary
RQ8	Does model size correlate with catastrophic forgetting?	Secondary

RQ3 and RQ7 are particularly important for practical deployment: RQ3 determines whether small models are viable with fine-tuning; RQ7 determines whether the model is robust enough to work in real-world conditions where prompts vary.

Measurement: What We Count and How

The primary metric is run rate: the percentage of generated programs that, when executed, produce the correct output as specified in the task. A program that parses but produces wrong output is a failure. A program that type-checks but crashes at runtime is a failure. Only programs that produce exactly the expected output (within a timeout) count as successes.

We also measure syntax rate separately: the percentage of programs that parse without errors. The gap between syntax rate and run rate is informative — a large gap suggests the model has learned surface syntax but not semantics.

Additional metrics collected per model and configuration:

type%: percentage that pass Hindley-Milner type checking
tokens_out: mean BPE tokens in the generated response (verbosity indicator)
pass@1: run rate with greedy decoding (temperature=0)
pass@3: at least 1 correct in 3 samples at temperature=0.7

Tasks are standardized: 9 tasks for the standard benchmark, 50 tasks for the extended corpus benchmark. Both are fixed — we don't change the task set between runs. Models are compared on exactly the same tasks with exactly the same evaluation criteria.

Statistical Methodology

For proportion comparisons (is model A's run rate significantly different from model B's?), we use the 2-proportion z-test with Bonferroni correction for multiple comparisons.

With n=5 repeats per task on 9 tasks, we have 45 binary observations per model-config. This gives us the following statistical properties:

Significance level: α = 0.05
Bonferroni-corrected threshold: α' = 0.05/7 = 0.007 (for 7 model comparisons)
Minimum detectable effect at 80% power: ±15 percentage points

This means we can detect differences of 15pp or more with 80% power, given our sample size. Smaller differences are below our detection threshold and should not be interpreted as significant. We flag this explicitly when reporting results near the threshold.

Effect sizes are reported using Cohen's h for proportion comparisons:

h = 2 · arcsin(√p1) − 2 · arcsin(√p2)

Interpretation: |h| < 0.2 is small, 0.2–0.5 is medium, >0.5 is large. This allows comparing the practical significance of results across different base rates — a 20pp improvement from 10% to 30% is a larger effect size than a 20pp improvement from 60% to 80%.

Each metric is reported as: mean ± 95% CI (Wilson interval for proportions). Example:

run% = 0.74 ± 0.06  [n=45, p<0.001 vs baseline 0.41, h=0.68 (large)]

Experimental Controls

Controlling for confounds is where LLM research most often goes wrong. We control for:

Variable	Control
Prompt template	Same baseline (1800-token reference) for all models
Temperature	0 for pass@1, 0.7 for pass@3
Max tokens	512 for all models — key finding: truncation at exactly 512 was an artifact in run_010 for Qwen2.5-Coder-7B that we caught and corrected
Task set	Same 9 standard tasks for all model size comparisons
Corpus	Same 5,037 examples for all fine-tuning runs
Hardware	Fine-tuning: AMD RX 7900 GRE 16GB, ROCm 6.4; inference: Ollama + OpenRouter

The truncation artifact deserves explanation. In run_010, Qwen2.5-Coder-7B on baseline showed average output tokens of 512 — exactly the max_tokens limit. This indicated the model was being truncated before completing its output, producing 0% syntax and run rate. We identified this as an artifact, not a true result, and re-ran with 3 repeats in the parallel benchmark (par_7b) with properly configured token limits. The true result (56% syntax, 41% run) is what we report.

This kind of artifact-catching requires careful attention to the raw data. We log average output tokens for every run precisely because it catches this failure mode: if avg_tokens is at or near max_tokens, the model is being truncated.

The Corpus: Validated by the Compiler

Our 5,037-example training corpus is unusual in one important respect: every example was validated by the Synoema compiler itself, not by human review.

The generation pipeline:

Generate a task description and expected output
Prompt Gemma 3 12B (via OpenRouter) to write the Synoema program
Run synoema run on the generated program and capture output
Compare output to expected — if they match, add to corpus; if not, discard

This creates a compiler-validated corpus: by definition, every training example is a correct Synoema program that produces the expected output. There are no label errors, no ambiguous cases, no programs that "look correct" but might not be.

The 99.9% validation pass rate (5,037 of 5,041 generated) reflects both the quality of the generating model and the clarity of the task specifications. The 4 rejected programs had off-by-one errors in output formatting or edge case handling — caught by the compiler comparison.

This approach is the core of "constitutional AI by compiler": the compiler is the ground truth, not human judgment. It scales: generating 50,000 examples requires the same pipeline, just more API calls.

Reproducibility

All results referenced in our articles can be reproduced from the repository:

Phase D benchmark scripts: benchmarks/scripts/run_phase_d.py, run_h2h5.py, run_openrouter.py
Raw results: benchmarks/results/2026-04-07_run_001 through run_010 and parallel runs
Corpus sample: benchmarks/tasks/corpus_sample_50.jsonl
Fine-tuning scripts: research/finetune/
Hypothesis test results: docs/research/hypothesis-test-results.md

To reproduce the Phase D standard-task benchmark:

cd lang/
cargo run -p synoema-repl -- run examples/website/website.sno  # verify language works
cd ../benchmarks/
python3 scripts/run_phase_d.py \
  --model qwen2.5-coder:7b \
  --config baseline \
  --repeats 5

The benchmark requires Ollama running locally with the target model pulled, or an OpenRouter API key for cloud models. Results are written to benchmarks/results/ in JSON format with full metadata.

What We Don't Claim

Scientific rigor requires being explicit about what the evidence doesn't support:

We don't claim Synoema is better than Python for all code generation tasks. It isn't. Python achieves better token efficiency on string operations, and Python's ecosystem makes it the right choice for data science, web development, and most general-purpose applications. Synoema's advantages are specific: recursive algorithms, pattern matching, typed domain modeling, constrained decoding.

We don't claim 41% is a good result in absolute terms. It isn't. 59% failure on a task set is a high failure rate. We report it because it's the honest starting point, and because understanding the failure modes is the first step to improving them.

We don't claim the fine-tuning results will confirm H6 or H7. We expect them to, based on prior work in domain-specific fine-tuning. But the benchmarks will determine the truth, and we commit to reporting whatever they show.

We don't claim generalizability to other novel programming languages without similar evidence. The results are specific to Synoema's design (BPE-aligned syntax, HM types, small grammar). A language with different design choices might show different learning curves.

Future Methodology

As the project evolves, so does the methodology. Planned additions:

Prompt robustness testing (RQ7): Run the same tasks with 3 prompt variants (standard, verbose, terse) to measure sensitivity. Models that only work with exactly the right prompt wording are less useful than models that generalize across phrasing variations.

Catastrophic forgetting test (H12): Run 10 Python coding tasks through the fine-tuned models to verify that Synoema-specific training didn't erase general coding ability. Success criterion: ≥80% correct on Python tasks.

Scaling law analysis: With results across 1.5B, 3B, and 7B, we can fit a scaling law: run rate as a function of model size, for both baseline and fine-tuned conditions. If the law holds, we can predict what 13B or 34B models would achieve without running them.

The goal is to build a rigorous empirical foundation for a research question that matters: can you design a programming language to be significantly more learnable by AI than existing languages? Our preliminary evidence says yes, within a specific niche. The full answer requires more data, more models, and more honest reporting of results — whatever they show.

Results