Multithreading with Wasm
How Multithreading Works in WebAssembly
Browsers are single-threaded by default. To run parallel Wasm code, the platform provides two building blocks:
Main Thread Web Workers (Thread Pool)
┌─────────────────┐ ┌─────────────────┐
│ JavaScript │ │ Worker 1 │
│ + Wasm instance │ │ Wasm instance │
│ │ spawn │ (shared memory) │
│ postMessage() ──┼───────>│ │
│ │ └─────────────────┘
│ │ ┌─────────────────┐
│ │ │ Worker 2 │
│ SharedArray ───┼───────>│ Wasm instance │
│ Buffer │ │ (shared memory) │
│ │ └─────────────────┘
│ │ ┌─────────────────┐
│ │ │ Worker 3 │
│ Atomics.wait() │ │ Wasm instance │
│ Atomics.notify()│<──────>│ Atomics ops │
└─────────────────┘ └─────────────────┘SharedArrayBuffer
SharedArrayBuffer is the key primitive. Unlike ArrayBuffer, it can be shared between the main thread and workers without copying. Wasm's linear memory can be backed by a SharedArrayBuffer, giving all threads access to the same memory:
// Creating shared Wasm memory
const memory = new WebAssembly.Memory({
initial: 256, // 256 pages (16 MB)
maximum: 4096, // 4096 pages (256 MB)
shared: true // ← This enables SharedArrayBuffer backing
});Atomics
The Atomics API provides low-level synchronization primitives that map directly to Wasm's memory.atomic.* instructions:
| Atomics Method | Purpose | Wasm Equivalent |
|---|---|---|
Atomics.load() |
Read shared value | i32.atomic.load |
Atomics.store() |
Write shared value | i32.atomic.store |
Atomics.add() |
Atomic increment | i32.atomic.rmw.add |
Atomics.compareExchange() |
CAS operation | i32.atomic.rmw.cmpxchg |
Atomics.wait() |
Block until notified (futex) | memory.atomic.wait32 |
Atomics.notify() |
Wake waiting threads | memory.atomic.notify |
wasm-bindgen-rayon: Parallel Iterators in Wasm
The rayon crate provides data-parallel iterators in Rust. The wasm-bindgen-rayon adapter bridges rayon's thread pool to Web Workers:
// Cargo.toml
// [dependencies]
// rayon = "1.8"
// wasm-bindgen-rayon = "1.2"
use rayon::prelude::*;
use wasm_bindgen::prelude::*;
#[wasm_bindgen]
pub fn parallel_sum(data: &[f64]) -> f64 {
data.par_iter().sum()
}
#[wasm_bindgen]
pub fn parallel_mandelbrot(width: u32, height: u32) -> Vec<u8> {
let pixels: Vec<u8> = (0..height)
.into_par_iter() // ← parallel iteration
.flat_map(|y| {
(0..width).map(move |x| compute_pixel(x, y, width, height))
})
.collect();
pixels
}Setup
// lib.rs — must call this before using rayon
pub use wasm_bindgen_rayon::init_thread_pool;// JavaScript side
import init, { initThreadPool, parallel_sum } from './pkg/my_crate';
async function main() {
await init();
await initThreadPool(navigator.hardwareConcurrency);
// Now rayon's par_iter() uses Web Workers!
const result = parallel_sum(new Float64Array([1, 2, 3, 4, 5]));
}Build Configuration
# .cargo/config.toml
[target.wasm32-unknown-unknown]
rustflags = ["-C", "target-feature=+atomics,+bulk-memory,+mutable-globals"]
[unstable]
build-std = ["panic_abort", "std"]COOP/COEP Headers (Required!)
SharedArrayBuffer is gated behind Cross-Origin Isolation. Your server must send these headers:
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp Browser Security Model:
Without headers: With headers:
┌──────────────────┐ ┌──────────────────┐
│ SharedArrayBuffer │ │ SharedArrayBuffer │
│ BLOCKED │ │ ALLOWED │
│ │ │ │
│ "SecurityError: │ │ Cross-origin │
│ SharedArray │ │ isolated = true │
│ Buffer is not │ │ │
│ defined" │ │ COOP: same-origin│
└──────────────────┘ │ COEP: require-corp│
└──────────────────┘Server configuration examples
# Nginx
add_header Cross-Origin-Opener-Policy same-origin;
add_header Cross-Origin-Embedder-Policy require-corp;
# Apache (.htaccess)
Header set Cross-Origin-Opener-Policy "same-origin"
Header set Cross-Origin-Embedder-Policy "require-corp"
# Vite (vite.config.ts)
export default {
server: {
headers: {
'Cross-Origin-Opener-Policy': 'same-origin',
'Cross-Origin-Embedder-Policy': 'require-corp',
},
},
};When Threading Helps vs Hurts
Threading adds overhead (worker creation, synchronization, message passing). It only helps when the work is large enough to amortize that cost:
Speedup
│
│ ● ideal (linear)
│ ●╱
│ ●╱ ●── real (Amdahl's law)
│ ●╱ ●
│ ●╱ ●
│●╱ ● ●── diminishing returns
│╱●
│●
├───────────────────── Number of threads
1 2 4 8 16
Amdahl's Law: Speedup = 1 / (S + P/N)
S = serial fraction
P = parallel fraction (S + P = 1)
N = number of threads| Workload | Threads Help? | Why |
|---|---|---|
| Mandelbrot rendering (1024x1024) | Yes | Embarrassingly parallel, CPU-heavy |
| Image filter (large image) | Yes | Each pixel independent |
| Sorting 1000 items | No | Too small, overhead > savings |
| JSON parsing | No | Mostly sequential |
| Matrix multiplication (large) | Yes | Divide rows across threads |
| DOM manipulation | No | Must happen on main thread |
| Physics simulation (1000+ bodies) | Yes | Each body update is independent |
| SHA-256 of a small string | No | Sequential algorithm, tiny input |
Rule of thumb: if the sequential work takes less than ~5ms, threading overhead will eat any gains.
Thread Pool Architecture
┌─────────────────────────────────────────────────────────┐
│ Main Thread │
│ │
│ 1. initThreadPool(4) │
│ ├── spawn Worker 1 ──┐ │
│ ├── spawn Worker 2 ──┤ Workers load same .wasm │
│ ├── spawn Worker 3 ──┤ with shared memory │
│ └── spawn Worker 4 ──┘ │
│ │
│ 2. parallel_sum(data) │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ rayon work-stealing scheduler │ │
│ │ │ │
│ │ Task queue: [chunk1][chunk2]... │ │
│ │ │ │
│ │ W1 ← steal ← W2 ← steal ← W3 │ │
│ │ │ │
│ │ Each worker processes chunks from │ │
│ │ shared memory via atomic load/store │ │
│ └──────────────────────────────────────┘ │
│ │
│ 3. Result returned to main thread │
└─────────────────────────────────────────────────────────┘Browser Support
| Browser | SharedArrayBuffer | Wasm Threads | Status |
|---|---|---|---|
| Chrome 91+ | Yes | Yes | Full support |
| Firefox 79+ | Yes | Yes | Full support |
| Safari 15.2+ | Yes | Yes | Full support |
| Edge 91+ | Yes | Yes | Full support (Chromium) |
| Node.js 16+ | Yes | Yes | --experimental-wasm-threads |
| Deno 1.9+ | Yes | Yes | Full support |
Feature detection
function supportsWasmThreads() {
try {
// Check SharedArrayBuffer
new SharedArrayBuffer(1);
// Check Atomics
if (typeof Atomics === 'undefined') return false;
// Check cross-origin isolation
if (!crossOriginIsolated) return false;
// Check Wasm threads support
const mem = new WebAssembly.Memory({ initial: 1, maximum: 1, shared: true });
return mem.buffer instanceof SharedArrayBuffer;
} catch {
return false;
}
}Practical Example: Parallel Image Processing
use rayon::prelude::*;
use wasm_bindgen::prelude::*;
#[wasm_bindgen]
pub fn blur_image(pixels: &mut [u8], width: usize, height: usize, radius: usize) {
let input = pixels.to_vec();
// Process rows in parallel
pixels
.par_chunks_mut(width * 4)
.enumerate()
.for_each(|(y, row)| {
for x in 0..width {
let (mut r, mut g, mut b, mut count) = (0u32, 0u32, 0u32, 0u32);
for dy in -(radius as i32)..=(radius as i32) {
for dx in -(radius as i32)..=(radius as i32) {
let ny = (y as i32 + dy).clamp(0, height as i32 - 1) as usize;
let nx = (x as i32 + dx).clamp(0, width as i32 - 1) as usize;
let idx = (ny * width + nx) * 4;
r += input[idx] as u32;
g += input[idx + 1] as u32;
b += input[idx + 2] as u32;
count += 1;
}
}
let idx = x * 4;
row[idx] = (r / count) as u8;
row[idx + 1] = (g / count) as u8;
row[idx + 2] = (b / count) as u8;
// Alpha unchanged
}
});
}Summary
Wasm multithreading uses SharedArrayBuffer for shared memory and Web Workers as the thread pool. The wasm-bindgen-rayon crate makes it feel like writing normal Rust parallel code — just use par_iter() instead of iter(). Remember to set COOP/COEP headers, build with +atomics, and only parallelize workloads that are large enough to overcome the threading overhead. Most modern browsers fully support Wasm threads as of 2024.