Why AI Writes Broken Code — and How Type Systems Can Fix It

Ask an LLM to write a function that adds two numbers, and it will succeed. Ask it to implement a binary search tree with insertion and deletion, and there's a good chance something will be wrong — even if the code looks correct at a glance. The interesting question isn't whether LLMs make mistakes; it's which kind of mistakes they make, and whether that pattern tells us something we can act on.

It does.

The Anatomy of LLM Code Failures

A 2025 study by Tambon et al. analyzed a large corpus of LLM-generated programs and categorized their failures. The results were striking in their consistency: type errors account for 33.6% of all compilation failures. Not "program produces wrong output" — those are semantic errors, a different category. Type errors specifically: passing a string where an integer is expected, calling a method that doesn't exist on a type, returning the wrong type from a function.

Why type errors specifically? To understand this, you have to think about how LLMs generate code.

A language model doesn't compile the code it's writing. It doesn't maintain a symbol table, doesn't track which variable has which type, doesn't verify that a function call matches a signature. It predicts the next token based on the statistical patterns in its training data. Those patterns are extraordinarily rich — rich enough to produce code that is syntactically valid and semantically coherent for simple cases. But for complex cases, the model is essentially doing pattern completion, not type inference.

The result is predictable: the model knows the general shape of the code it's writing, but it can get the types wrong — especially when types flow across function boundaries, when a function returns a value that gets passed to another function several lines later, when a type must be consistent across multiple pattern-matching branches.

A Concrete Example

Consider implementing a simple stack in Python. A model might write:

class Stack:
    def __init__(self):
        self.items = []

    def push(self, item):
        self.items.append(item)

    def pop(self):
        return self.items.pop()

    def peek(self):
        return self.items[-1]

    def is_empty(self):
        return len(self.items) == 0

    def size(self):
        return len(self.items)

This code looks fine and will work for many inputs. But now ask the model to add a method that returns the sum of all elements:

    def sum(self):
        return sum(self.items)

The model has shadowed the built-in sum function with the method name. This is a type-adjacent error — it works if items are integers, fails with a confusing error if they're not, and the method name itself conflicts with the built-in in a way that's invisible at definition time. A type checker wouldn't catch this either, but it's emblematic of the pattern: the model applies a locally plausible pattern without tracking global consistency.

In Synoema, the same stack looks like:

Stack a = Empty | Node a (Stack a) derive (Show)

push item stack = Node item stack

pop Empty = error "empty stack"
pop (Node x rest) = (x, rest)

peek Empty = error "empty stack"
peek (Node x _) = x

is_empty Empty = true
is_empty _ = false

size Empty = 0
size (Node _ rest) = 1 + size rest

The type of Stack is inferred automatically: it's a polymorphic type Stack a parameterized by the element type. The pop function has inferred type Stack a -> (a, Stack a). If you try to use the result of pop as if it were just an element (not a tuple), the type checker catches the error before the program runs.

What Hindley-Milner Inference Provides

Hindley-Milner type inference — the type system used by Haskell, OCaml, and Synoema — has a precise mathematical property: it is sound and complete. Sound means that if the type checker accepts a program, the program will not produce a type error at runtime. Complete means that the type checker can infer the most general type of any expression without any annotations.

For LLM code generation, the completeness property matters most: the model never has to write a type annotation. Types are inferred. The type system is a constraint on the output, not an additional thing the model must produce and get right.

Consider how this changes the error landscape. When a model generates Python code, a type error may be invisible at generation time and surface only at runtime — potentially hours later, in production, on an unexpected input. When a model generates Synoema code, type errors are caught by the compiler immediately, as part of the normal development workflow.

More importantly for constrained decoding: types can be checked incrementally, as the model generates tokens. At each step, the type checker can determine which token choices are type-consistent and which aren't — allowing the decoding process to rule out type-incorrect continuations before they're generated.

The Research Evidence

Two studies quantify the impact of type-constrained generation:

Mundler et al. (PLDI 2025) found that type-constrained decoding reduces compilation errors by 74.8%, compared to 9.0% for syntax-only constraints. This is a large effect size — the type constraint is roughly 8x more powerful than the syntax constraint alone. The study used a typed subset of Python and showed the result holds across multiple model sizes.

Tambon et al. (2025) confirmed the 33.6% type-error share across a different corpus and showed that the distribution is consistent across model families and sizes — it's a property of how LLMs generate typed code, not of any particular model.

Both studies point to the same conclusion: if you want to reduce LLM code failures, the highest-leverage intervention is type-constrained decoding, not more training data or larger models. The gains from type constraints are complementary to gains from model size — they stack.

Types as Generation Constraint, Not Just Verification

The key insight is the difference between two ways of using a type system:

  1. Post-hoc verification: generate code, then type-check it, report errors. This is what TypeScript and mypy do. It's useful, but it doesn't prevent errors during generation.
  2. Generation constraint: use the type system to constrain which tokens the model can generate at each step. This prevents type errors from being generated in the first place.

Synoema's design targets the second approach. The GBNF grammar handles the first layer (syntactic validity). The Hindley-Milner type system provides the semantic layer. Combined, they create a generation process where a large class of errors is structurally impossible.

The practical implementation uses XGrammar (Dong et al., 2024), which achieves a 100x speedup over naive grammar-based decoding through context-free grammar preprocessing. The overhead of constrained decoding is reduced to near-zero, making it practical for production use rather than just research.

What Types Cannot Fix

Being honest about limits: type systems catch a specific category of errors. They don't catch algorithmic errors (a correctly typed function that computes the wrong thing), semantic errors (code that type-checks but doesn't match the specification), or performance issues.

Our own data confirms this. In Phase D testing, we observed tasks where a model achieved 91% syntax correctness but 0% run correctness — the model generated syntactically valid, type-correct Synoema code that produced wrong output. The examples were pattern_match and type_definition tasks where the model understood the surface syntax but failed on the semantics of algebraic data types.

This is the gap that verification contracts address. In Synoema:

discount : Int -> Int
  requires total > 0
  ensures result <= 500
discount total = ? total > 1000 -> min (total / 10) 500 : 0

The requires and ensures clauses are checked at runtime. They don't prevent semantic errors in general, but they can catch a specific class of logic errors that type inference misses: violations of the function's contract.

The Three-Layer Stack

Synoema's approach to LLM code correctness uses three complementary layers:

  1. GBNF grammar — eliminates all syntactic errors. The model cannot generate invalid syntax. Effect: ~24% of Copilot errors eliminated by construction.
  2. HM type inference — eliminates type errors. The model cannot generate type-incorrect programs that pass type checking. Effect: 74.8% reduction in compilation errors (Mundler et al.).
  3. Verification contracts — catches logic errors for functions with explicit contracts. Effect: depends on contract coverage; particularly valuable for financial logic, API invariants, business rules.

These layers are not redundant — each catches a different class of error. Together, they address the dominant failure modes in LLM code generation: syntax errors (layer 1), type errors (layer 2), and contract violations (layer 3).

The run rate we observe experimentally — 41% on a 50-task corpus with a 7B model — is after applying layers 1 and 2. The remaining failures are mostly semantic: the model understands the shape of the code but not the algorithm. That's the harder problem, and it requires a different approach: better training data, fine-tuning, and more sophisticated prompting strategies. Layer 3 helps at the margins but doesn't solve it fundamentally.

The 41% number, with its implied 59% failure rate, is not a success story. But it is a meaningful starting point — and more importantly, it's a starting point with a clear structure: we know which failures each layer catches, we can measure the contribution of each, and we can target improvements systematically.

Related Articles

Why Build a New Programming Language in the Age of AI?

The case for a language designed for machines — token alignment, type inference, constrained decoding.

How a Compiler Catches AI Mistakes Before They Run

The three-layer verification pipeline in detail, with LLM-friendly error messages.

What We Learned Teaching AI a New Language

Experimental results across 10+ models: what the data actually shows.