Thread-based Concurrency

spawn · scope · pmap · race · gather · channels

Synoema has two concurrency layers. This page covers the thread-based layer: real OS threads, deep-copy message passing, and parallel map. For the async/await layer (stackless state machines, IO reactor, async TCP), see /async.

Numbers on this page are measured, not estimated. pmap speedup: 2.04× in JIT, 3.53× in interpreter on 10 cores (PM-4 stress test, list size 1024, 400-iter per-element loop). Source: docs/benchmarks.md §Concurrency.

spawn & scope pmap speedup

1. spawn — fire-and-forget OS thread

spawn starts a new OS thread and returns immediately. The caller does not wait for the spawned thread to finish. Use it for background tasks that communicate results via a channel.

Primitive	Type	Description
`spawn f`	`(Unit -> a) -> Unit`	Start `f ()` in a new OS thread; caller continues immediately

-- Fire-and-forget: print from a background thread
main = scope {
  spawn (\_ -> print "hello from background")
  print "hello from main"
  -- Both lines print; order is not guaranteed
}

spawn is only valid inside a scope block. All spawned threads are joined when the scope closes, so values computed in a scope can safely outlive the threads that produced them.

spawn with channels

-- worker sends a triangular number over a channel
tri n acc = ? n == 0 -> acc : tri (n - 1) (acc + n)

worker ch n = send ch (tri n 0)

main = scope {
  ch = chan
  spawn (worker ch 15)   -- tri 15 = 120
  spawn (worker ch 16)   -- tri 16 = 136
  a = recv ch
  b = recv ch
  a + b    -- → 256
}

Channels are the primary way to return values from spawned threads. chan creates an unbounded MPSC channel; send ch v enqueues a value; recv ch blocks the caller until a value is available.

2. scope — scoped parallel block

scope { ... } creates a structured concurrency region. All spawn calls inside the block are joined before scope returns. The result of the scope block is the value of the last expression.

Primitive	Type	Description
`scope { ... }`	`a`	Structured block; all spawned threads are joined on exit; evaluates to the last expression

-- scope_parallel.sno — 4 parallel workers, channel-aggregated sum
-- tri 15 = 120, tri 16 = 136, tri 31 = 496, tri 32 = 528
-- Expected: 120 + 136 + 496 + 528 = 1280

tri n acc = ? n == 0 -> acc : tri (n - 1) (acc + n)

worker ch n = send ch (tri n 0)

main = scope {
  ch = chan
  spawn (worker ch 15)
  spawn (worker ch 16)
  spawn (worker ch 31)
  spawn (worker ch 32)
  a = recv ch
  b = recv ch
  c = recv ch
  d = recv ch
  a + b + c + d
}

Deep-copy across thread boundaries

When a value crosses a thread boundary (via send, or captured in a spawned closure), it is deep-copied. There is no shared mutable state between threads. Each thread owns its data independently. This eliminates data races by design — no locks, no reference counting across arenas.

Internally each thread gets its own bump-allocation arena. When send transmits a value, the runtime walks the value graph and copies every node into a fresh arena before the receiving thread touches it. This is the same Perceus-based RC strategy used for WASM heap management.

3. pmap — parallel map

pmap applies a function to each element of a list in parallel across all available cores. It is a drop-in replacement for map when the per-element work is CPU-bound and the list is long enough to amortize thread overhead.

Primitive	Type	Description
`pmap f xs`	`(a -> b) -> [a] -> [b]`	Apply `f` to each element of `xs` in parallel; return results in original order

-- parallel_compute.sno — CPU-bound pmap example
-- heavy x: sum_{i=1..400} (i*i - i) — result 21333200 independent of x
-- list [1..129] — 129 elements, well above PMAP_PAR_THRESHOLD=64
-- expected sum: 129 * 21333200 = 2751982800

heavy n acc i = ? i == 0 -> acc : heavy n (acc + (i * i - i)) (i - 1)

main = sum (pmap (\x -> heavy x 0 400) [1..129])

Measured speedup

Backend	List size	Per-elem work	Cores	map (sequential)	pmap (parallel)	Speedup
JIT	1024	400-iter heavy loop	10	120 ms	60 ms	2.04×
Interpreter	128	20-iter heavy loop	10	2055 ms	582 ms	3.53×

Source: PM-4 stress test in docs/benchmarks.md §Concurrency.

Threshold and nested pmap

PMAP_PAR_THRESHOLD = 64 — lists shorter than 64 elements run sequentially. pmap on a short list is identical in result to map but carries a small closure-cloning overhead. Use map for short lists (PM-5).
Nested pmap falls back to sequential — if a function passed to pmap itself calls pmap, the inner call runs sequentially. This prevents unbounded thread explosion (PM-8).
Workers — std::thread::Builder with 64 MB stacks across available_parallelism() cores. Worker count is determined once at program start and is not configurable at runtime.

4. race and gather

race and gather are task combinators that run a list of async tasks in parallel. They are part of the async layer (Phase H1) but serve as natural complements to the thread-based primitives.

Combinator	Type	Semantics
`race tasks`	`[Task a] -> Task a`	Run all tasks in parallel; return the first to complete (others are dropped)
`gather tasks`	`[Task a] -> Task [a]`	Run all tasks in parallel; return a list of results in original order

-- race and gather in action
-- Run: sno jit examples/async_race.sno

async fn fast_task = 1
async fn slow_task = _ = await (async_sleep 200); 2

async fn main =
  -- race: first to complete wins; slow_task result is discarded
  winner = await (race [slow_task fast_task])

  -- gather: both finish, results collected in order
  both = await (gather [fast_task fast_task])

  winner   -- → 1

race uses an AtomicBool winner flag so only the first completion is propagated. gather uses an AtomicUsize completion counter and collects results in original index order. Each sub-task gets its own OS thread. race and gather require async fn / await — see /async for the full async stack.

5. Channels

Channels are the primary synchronization primitive. Values sent over a channel are deep-copied, so the sender retains ownership of its copy and the receiver gets a fresh independent copy.

Primitive	Type	Description
`chan`	`Chan a`	Create an unbounded channel
`bounded_chan n`	`Int -> Chan a`	Create a bounded channel with capacity `n`; `send` blocks when full
`send ch v`	`Chan a -> a -> Unit`	Send value `v` (deep-copied); non-blocking on unbounded, blocking on bounded when full
`recv ch`	`Chan a -> a`	Receive next value; blocks until one is available
`try_send ch v`	`Chan a -> a -> Bool`	Non-blocking send; returns false if the bounded channel is full
`try_recv ch`	`Chan a -> Maybe a`	Non-blocking receive; returns `Nothing` if the channel is empty
`recv_timeout ms ch`	`Int -> Chan a -> Maybe a`	Receive with timeout in milliseconds; returns `Nothing` on timeout
`select chans`	`[Chan a] -> a`	Block until any channel in the list has a value; return the first received value

-- bounded producer/consumer
main = scope {
  ch = bounded_chan 4     -- capacity 4; send blocks when full
  spawn (\_ ->
    send ch 1
    send ch 2
    send ch 3
    send ch 4
    send ch 5)            -- 5th send blocks until consumer recvs
  a = recv ch
  b = recv ch
  c = recv ch
  d = recv ch
  e = recv ch
  a + b + c + d + e       -- → 15
}

-- select: fan-in from multiple channels
main = scope {
  ch1 = chan
  ch2 = chan
  spawn (\_ -> send ch1 "from ch1")
  spawn (\_ -> send ch2 "from ch2")
  first = select [ch1 ch2]   -- whichever arrives first
  first
}

6. Deep-copy semantics

Synoema's concurrency model has no shared mutable state. Every value that crosses a thread boundary is deep-copied. This is a deliberate design choice that enables the compiler to guarantee data-race freedom without requiring locks, atomic types, or borrow checking.

-- Deep-copy example: sender retains its own copy
main = scope {
  ch = chan
  xs = [1, 2, 3, 4, 5]
  spawn (\_ -> send ch xs)   -- xs is deep-copied into ch
  -- xs is still valid here — the send did not move it
  received = recv ch
  -- received is an independent copy of [1, 2, 3, 4, 5]
  sum received               -- → 15
}

Consequences:

No data races — by construction, not by convention
Large values (big lists, nested records) incur a copy cost on send — use channels for results, not for streaming large data inside a tight loop
Each thread has its own bump-allocation arena; the GC (Perceus RC) never crosses thread boundaries
This is the same model used for WASM v2/v3 heap management

7. When to use what

Situation	Primitive	Notes
Background task, result returned via channel	`scope { spawn ... ; recv ch }`	Standard pattern; all threads joined on scope exit
CPU-bound parallel transform on a list	`pmap f xs`	Use when list has >64 elements and per-element work is significant
Parallel I/O or async operations	`async fn` / `await`	Stackless; no OS thread per task; see /async
First result wins, others discarded	`race [t1 t2 ...]`	Requires `async fn`; each sub-task gets its own OS thread
All results needed in parallel, ordered	`gather [t1 t2 ...]`	Requires `async fn`; results returned in original list order
Producer/consumer with backpressure	`bounded_chan n`	Blocks producer when capacity `n` is full
Fan-in from multiple channels	`select [ch1 ch2 ...]`	Returns the first value available across all channels
Non-blocking send/receive	`try_send` / `try_recv`	Returns Bool / Maybe instead of blocking
Receive with deadline	`recv_timeout ms ch`	Returns `Nothing` after `ms` milliseconds
Short lists (<64 elements)	`map f xs`	pmap overhead exceeds benefit below threshold

8. Limitations (honest)

spawn is only valid inside scope. Calling spawn outside a scope block is a runtime error. The structured-concurrency model requires a join point.
pmap threshold is fixed at 64. There is no API to override PMAP_PAR_THRESHOLD. Lists shorter than 64 elements always run sequentially.
Nested pmap is sequential. The inner pmap call inside a parallel worker falls back to map. This is intentional to prevent unbounded thread creation (PM-8).
Deep-copy cost. Large values sent over channels incur a full deep-copy. This is not a problem for typical channel use (sending results, not streaming bulk data in a tight loop).
race/gather require async fn. The race and gather combinators work on Task a values. Pure synchronous functions must be wrapped in async fn to use them.
No shared memory. There are no atomics, mutexes, or shared references in user code. All cross-thread communication goes through channels.