Why Build a New Programming Language in the Age of AI?

There is something strange happening in software development. Large language models can write Python, TypeScript, Java, and dozens of other languages with remarkable fluency. Ask GPT-4 or Claude to implement quicksort, and you'll get clean, idiomatic code in seconds. Ask for a REST API, a data pipeline, a parser — it delivers.

And yet when that code runs, it often doesn't work.

Not "doesn't work" in the sense of a subtle algorithm bug that even a human expert might miss. "Doesn't work" in the sense of a type error on line 3. A variable used before assignment. A function called with the wrong number of arguments. Errors that would be caught immediately by any competent reviewer, but that the model produced confidently, fluently, and at length.

This is the paradox we started with: AI that is clearly intelligent, in some sense, producing output that is clearly broken, in a very obvious sense. Why does this happen? And more importantly, what can be done about it?

The Wrong Diagnosis

The most common explanation is that AI models just aren't good enough yet. Give them more parameters, more data, more compute, and the problem will go away. This explanation has a certain appeal — it requires no action, just patience.

We don't think it's wrong exactly, but we think it's incomplete. The deeper issue is not the model's capability; it's the mismatch between how programming languages were designed and how LLMs generate code.

Programming languages were designed for humans. Python's readability is legendary — it was explicitly designed so that code reads like English. TypeScript adds gradual types to JavaScript, but in a way that a human can choose to ignore. Even Haskell, with its strong type system, was designed with the assumption that a human is reading the error messages and correcting the code interactively.

LLMs don't work that way. An LLM generates code token by token, left to right, without the ability to backtrack. It doesn't run the code; it doesn't see the error messages; it doesn't get a second pass. It commits to each token as it goes, guided only by its probability distribution over the vocabulary.

When a language is designed for humans, its structure is optimized for human readability and writability. When the same language is used for LLM generation, that structure may actively work against correctness. Keywords that span multiple tokens create opportunities for the model to get lost mid-expression. Implicit type coercions mean that a type error might not be visible in the output at all. The sheer surface area of Python's standard library means the model must navigate a vast space of possibilities with no structural constraint.

The Token-by-Token Problem

To understand why token alignment matters, consider how LLMs actually generate code. At each step, the model assigns a probability to every token in its vocabulary — typically 50,000 to 100,000 tokens — and samples from that distribution. The generated token is appended to the context, and the process repeats.

Now consider the Python keyword if. In the cl100k_base tokenizer (used by GPT-4, Llama 3, and Mistral), if is a single token. So is else. But elif is also a single token, and else if is two tokens. The model must generate the correct sequence from the space of all possible token continuations, with no structural guarantee that what it generates will be syntactically valid.

In Synoema, every operator and keyword is exactly one BPE token in cl100k_base. The ternary expression — Synoema's equivalent of if/then/else — is written as:

? condition -> then_value : else_value

That's three tokens: ?, ->, and :. Compare to Python's then_value if condition else else_value, which uses the same tokens if and else but in a different order with different semantics. The model must remember that in an expression context, if comes after the then-value, not before the condition. A mistake produces a runtime error, silently.

The 15% average token reduction we observe across 16 benchmark tasks is not the primary goal — it's a side effect of a more fundamental design choice: making the language structurally simpler for token-by-token generation. Fewer tokens, fewer chances to go wrong at each step.

Why Types Are Different From Linting

The second problem with existing languages for LLM generation is the absence — or inadequacy — of type information.

Python has type hints, added in Python 3.5, but they are optional and not enforced at runtime. TypeScript has types, but they are erased before execution. JavaScript, the language of the web, has no static types at all. Even languages with mandatory types — Java, C# — have type systems that were designed for human programmers writing code interactively with IDE support, not for machine generation where the entire program is produced in a single pass.

The research literature is unambiguous: type errors are the largest single category of failure in LLM-generated code. A 2025 study (Tambon et al.) found that type errors account for 33.6% of all compilation failures in LLM-generated programs. Another study (Mundler et al., PLDI 2025) found that type-constrained decoding — using type information to constrain the model's token choices at generation time — reduces compilation errors by 74.8%, compared to only 9.0% for syntax-only constraints.

Synoema uses Hindley-Milner type inference: the same type system used by Haskell and OCaml, proven to be sound and complete for the language's type system. Every function has an inferred type. Every application is type-checked. The compiler rejects programs with type errors before they run. And crucially, because inference is automatic, the model never has to write a type annotation — the type system is a constraint on the output space, not an additional thing the model must produce.

The Grammar Layer

Beyond types, there is a third layer of correctness that existing languages don't provide for LLM generation: grammar-constrained decoding.

A formal grammar defines exactly which token sequences are syntactically valid. If you can express your language's grammar as a formal grammar (GBNF, in the llama.cpp format), you can use it to constrain the model's token choices at generation time: at each step, only tokens that continue a valid prefix are allowed.

Synoema's grammar is 162 lines and 48 rules. It's compact enough to be practical and complete enough to express the full language. With constrained decoding enabled, the model cannot generate a syntactically invalid program — the grammar makes it impossible at the token level.

This is not a small thing. A 2022 study (Nguyen & Nadi) found that 24% of GitHub Copilot suggestions contain compilation errors. With grammar-constrained decoding, that number goes to zero by construction — not by making the model smarter, but by making it structurally impossible to be wrong at the syntactic level.

What We Are Building and Why

Synoema is not a general-purpose programming language competing with Python or Rust. We're not trying to win the ecosystem wars. The project exists to answer a specific question:

If you design a programming language specifically for LLM code generation — optimizing token alignment, type-inference coverage, and grammar-constrained decoding — how much better can you make the output?

The answer, so far, is: measurably better, but not completely solved. Our best results on standard tasks show a 60% run rate for a 3B model on a baseline prompt — meaning 60% of generated programs parse, type-check, and produce the correct output. On harder tasks from a 50-task corpus, a 7B model achieves 41% run rate. For a 70B model on the standard task set, 61%.

These numbers are genuinely interesting. They're not "solved AI code generation" — the 59% that fails is real and important. But they show that language design choices have a measurable effect on generation quality, independent of model size.

The Honesty Requirement

We want to be direct about what Synoema is and isn't.

It is a research project. The language is in alpha, the corpus is 5,037 examples, and the fine-tuned models haven't been benchmarked in full yet. The scientific findings we report are real, but they're preliminary — we are reporting them as we go, including the hypothesis that was disproved (H1: that a compact reference would help models generate better code — it actually hurt significantly).

It is not a production tool for most use cases. If you're writing Python or TypeScript by hand, Synoema doesn't help you — it's not designed for human authorship. If you need a web frontend, a data science stack, or an enterprise backend, the right answer is the language with the ecosystem for that use case.

Where Synoema has a genuine advantage is the narrow but important niche of machine-generated, correctness-critical code: automated microservices, LLM agent outputs, executable specifications. In those contexts, the combination of token efficiency, type inference, and grammar-constrained decoding provides something that no general-purpose language currently offers.

What Comes Next

We're currently running a fine-tuning experiment: three model sizes (1.5B, 3B, 7B), 5,037 training examples validated by the Synoema compiler itself, QLoRA on AMD hardware. The 1.5B model finished training in 34 minutes; the 3B in 45 minutes. Benchmarks are pending.

The question we're most interested in is H7 from our test plan: can a fine-tuned 1.5B model exceed the baseline 7B model's 41% run rate on the same task set? If yes, that would be a significant result — it would mean that language-specific fine-tuning at small scale can compensate for a 5x difference in model size. That has practical implications for edge deployment.

If no — if fine-tuning doesn't close that gap — that's also an interesting result. It would tell us something about the relative importance of model capacity versus language design choices, and would push us toward different approaches: curriculum learning, self-play RL, or more aggressive type-constrained decoding at inference time.

Either way, we'll report what we find. That's the project.

Related Articles

Why AI Writes Broken Code — and How Type Systems Can Fix It

The statistics behind LLM code failures and how type-guided generation addresses the dominant error class.

Introducing Synoema: A Language Machines Can Verify

A tour of the language features with code examples: pattern matching, ADTs, contracts, and more.

What We Learned Teaching AI a New Language

Experimental results across 10+ models: what worked, what didn't, and what surprised us.