Intermediate Results: v6 Fine-Tuning

Results April 14, 2026 · ~7 min read

These are intermediate results — not a final announcement. We publish them now because honest intermediate data is more useful than polished end results that arrive six months later. The numbers are real, the regression is real, and the investigation is ongoing.

What We Did in v6

v6 represents the first major format change in our fine-tuning pipeline: we switched from Alpaca format to ChatML format, increased corpus size from 5,037 to 5,907 validated examples, and trained on all three model sizes simultaneously (1.5B, 3B, 7B).

The corpus expansion focused on semantic depth: error chains, data pipelines, ADT state machines, and contract-based tests. Generation was done via OpenRouter (Gemma 3 12B) with compiler validation — every example must execute with returncode=0 in under 10 seconds.

v6 Results: 7B Model

The 7B model (Qwen2.5-Coder-7B-Instruct) is the first fully evaluated v6 model. Results from the Phase D benchmark (9 standard tasks, 5 repeats, pass@1 at temperature=0):

Metric	Baseline 7B	Fine-tuned 7B v6	Delta
syntax_pass	56%	100%	+44pp
run_pass	41%	90.5%	+49.5pp
constructs_pass	—	44.6%	−8.1pp vs v5

The run rate improvement is the main result: +49.5 percentage points. The model went from generating syntactically broken or runtime-failing code 59% of the time to generating runnable code 90.5% of the time.

The Regression: Constructs

There is a problem. The constructs pass rate — which measures whether the model uses the specific Synoema language constructs asked for in the task — dropped from 52.7% (v5) to 44.6% (v6).

A program that runs but avoids the constructs it was asked to use is not the goal. If the task says "use |> pipe chains", we want pipe chains. If the task says "use and_then combinator", we want and_then, not manual pattern matching that happens to produce the same result.

The regression is real. We're not minimizing it.

Failure Analysis

From 40 semantic evaluation tasks, the most common construct failures were:

Pipe operator |> — models solve data pipeline tasks with explicit let-bindings instead of pipe chains. Most common failure.
test keyword — models generate the code body but omit the test declaration. Only 37 test examples in v6 corpus (need 100+).
Constructor naming — when asked for an Empty/Some ADT, models default to Haskell (Nothing) or Rust (None) naming. The correct name was in the task but models ignore it.
bind_maybe — 7B uses explicit pattern matching instead of the combinator. Functionally correct, not idiomatic.

Why ChatML May Be the Cause

The format change from Alpaca to ChatML is the most significant variable between v5 and v6. Our current hypothesis (H14) is that ChatML improved the model's ability to generate runnable code — it "speaks" Synoema more fluently — but changed how it interprets construct-specific task instructions.

Three sub-hypotheses under investigation:

H15: The regression is a corpus coverage artifact. The v6 corpus has proportionally fewer pipe chain examples relative to total size. More |> examples → constructs_pass recovers.
H14b: ChatML's instruction format changes how the model weighs task-specific details (like "use |>") vs. its prior distribution. The model is more fluent but less instruction-following for construct selection.
H16: The semantic eval benchmark is testing a different distribution than the training corpus — the tasks ask for constructs that are syntactically optional (the code runs either way) so the model learns to avoid them.

Status of 3B and 1.5B

At the time of writing:

3B model: Training completed. Full benchmark evaluation pending.
1.5B model: Evaluation in progress.

We will update this page when those results are available. Based on the semantic eval subset (40 tasks), the 3B v6 model shows run_pass 97.5% and constructs_pass 90.0% — both stronger than the 7B on this specific benchmark. We are running the full Phase D benchmark to confirm.

Is This a "Production" Model?

No. By our internal production criteria (RULES.md §7):

run_pass ≥ baseline + 15pp: YES (90.5% vs 41% baseline = +49.5pp)
syntax_pass ≥ 95%: YES (100%)
constructs_pass ≥ baseline − 5pp: NO (regression is −8.1pp, exceeds the −5pp tolerance)
catastrophic_forgetting ≥ 80%: pending

The constructs regression prevents production classification. The 7B v6 model is a research artifact with a known weakness, not a production release.

What Happens Next

v7 corpus work is planned to address the regression:

Add 100+ test declaration examples (currently 37)
Add 50+ pipe chain examples with complex data (pairs, records, nested transforms)
Add constructor naming discipline examples — following task-specified names
Add bind_maybe combinator preference examples
Consider controlled A/B: Alpaca v7 vs ChatML v7 to isolate format effect

We will also run baseline models in ChatML format before making v5 vs v6 comparisons — the format switch makes direct metric comparisons potentially invalid (apples vs oranges).

What the Numbers Mean

A run_pass rate of 90.5% means that if you ask the v6 7B model to write a Synoema program, it will produce working, runnable code 9 times out of 10. That is a massive improvement over the 41% baseline.

A constructs_pass rate of 44.6% means that in fewer than half of those cases, it used the specific language constructs the task asked for. For strict instruction-following (e.g., "write this using |> and and_then"), that is a problem.

For the use case of "generate working Synoema code given a description" — the model is very good now. For the use case of "generate idiomatic, construct-specific Synoema code" — the model needs more work.

Raw Data

All eval results are versioned in the repository under research/finetune/eval/results/. The v6 semantic evaluation (2026-04-13) is at:

eval/results/2026-04-13-semantic-v6/finetune_7b_v6.json
eval/results/2026-04-13-semantic-v6/analysis.md

All numbers on this page come from those files.