The CPU side of every speedup number on this site. Same composition the GPU gets in the hero, mirrored — model on the left so the two chips face each other across the page.

Loading CPU model...

Built to chase the next instruction

Out-of-order, speculative, branch-prediction heavy.
Wins when the next computation depends on the last one.
Our OpenMP runs use all 16 threads; baseline uses one.

8/16cores / threads

5.4GHz boost

36 MBL3 cache

Stack

Tools we used

CUDAGPU kernels, shared memory, warp shuffles

OpenMPCPU parallel-for baselines on 8 cores

C++17Both CPU and GPU host code

Next.js 14App Router + TypeScript for this report

RechartsLog-log timing plots

Headline numbers

Key findings

The full breakdown is on the benchmarks page — these are the ones that surprised us.

98×

Mandelbrot speedup

Optimized GPU vs single-threaded CPU at 4096×4096

40×

Dot product speedup

Optimized GPU vs single-threaded CPU at 1B elements

32×

Heat equation speedup

Tiled GPU vs single-threaded CPU at 2048×2048, 1000 steps

7×

Naive → optimized GPU

The reduction kernel gains the most from warp-level merging