Does it scale? gpu?
All-resident solvers (JS, GPU) hold the whole network in memory — they hit a wall in the low millions, marked in red. Fused streaming + GPU holds only a moving window and ships each window to the GPU: it simulates all the way to 3 billion atoms while memory stays flat. Simulate the whole thing; keep only the active window; spill the rest.
A constraint with no data kills everything behind it
An atom is ready when every gate it holds has flipped — all its predecessors finished. Then its own work takes a sampled duration. A dark atom (no data from the world) never finishes, and everything downstream goes dark with it. That's the engine refusing to fake a date.
Streaming for memory, samples for speed
The outer loop walks the network in windows — only one window is ever in memory, so a 3-billion-atom project costs the same RAM as a 3-thousand-atom one. The inner axis is the samples: every Monte Carlo run is independent, so each window ships to the GPU as thousands of parallel threads. Memory comes from the outer loop; speed comes from the inner one.
The atom
There is exactly one type. An atom is identity + state + the gates it holds — its inputs only. It never tracks what depends on it; that's derived by inverting the inputs once. For speed and GPU-portability, atoms aren't objects — they're columns of typed arrays. This is the form the engine actually runs.
GPU compute kernel · WGSL
Not a mockup. One thread = one full Monte Carlo sample, walking the entire window: real gates via CSR arrays, real dark-blocking via a −1 sentinel (finish time can never be negative, so −1 means "missing — no data from the world"), real conjunctive deadness. Same computation as the JS path — the P80s must agree within Monte-Carlo noise.
WebGPU dispatch
Adapter → device → buffers → bind group → dispatch → read back. Windows tile into passes so per-thread scratch never exceeds GPU memory. That tiling is the honest answer to the memory wall.