ACE — Atomic Constraint Engine

Throughput · 3 atoms → 3 billion

Does it scale? gpu?

All-resident solvers (JS, GPU) hold the whole network in memory — they hit a wall in the low millions, marked in red. Fused streaming + GPU holds only a moving window and ships each window to the GPU: it simulates all the way to 3 billion atoms while memory stays flat. Simulate the whole thing; keep only the active window; spill the rest.

samples / run

memory budget

JS · all-resident GPU · all-resident Fused stream+GPU · flat memory ▌ memory wall

One Monte Carlo sample

A constraint with no data kills everything behind it

An atom is ready when every gate it holds has flipped — all its predecessors finished. Then its own work takes a sampled duration. A dark atom (no data from the world) never finishes, and everything downstream goes dark with it. That's the engine refusing to fake a date.

Why two loops

Streaming for memory, samples for speed

The outer loop walks the network in windows — only one window is ever in memory, so a 3-billion-atom project costs the same RAM as a 3-thousand-atom one. The inner axis is the samples: every Monte Carlo run is independent, so each window ships to the GPU as thousands of parallel threads. Memory comes from the outer loop; speed comes from the inner one.

One citizen, stored as columns

The atom

There is exactly one type. An atom is identity + state + the gates it holds — its inputs only. It never tracks what depends on it; that's derived by inverting the inputs once. For speed and GPU-portability, atoms aren't objects — they're columns of typed arrays. This is the form the engine actually runs.

The exact shader that runs in tab 01

GPU compute kernel · WGSL

Not a mockup. One thread = one full Monte Carlo sample, walking the entire window: real gates via CSR arrays, real dark-blocking via a −1 sentinel (finish time can never be negative, so −1 means "missing — no data from the world"), real conjunctive deadness. Same computation as the JS path — the P80s must agree within Monte-Carlo noise.

The API that drives it

WebGPU dispatch

Adapter → device → buffers → bind group → dispatch → read back. Windows tile into passes so per-thread scratch never exceeds GPU memory. That tiling is the honest answer to the memory wall.