Where does the GPU start winning?
A crossover point is the input size at which the GPU implementation becomes faster than the best CPU implementation. Below the crossover, the CPU wins because kernel-launch overhead and host↔device memory transfer cost more than the actual computation. Above it, raw parallelism dominates.
The crossover point is the single most useful number for deciding whether to port a workload to the GPU at all.
Crossover summary
Mandelbrot Set
- CPU wins until
- never
- Crossover at
- < 512
Mandelbrot has no crossover in our measured range — the GPU wins from the first data point. To see a CPU win you would need to drop below ~256×256, where kernel-launch overhead dominates.
Dot Product
- CPU wins until
- ~1M
- Crossover at
- ~5M
Below ~1M elements the single-threaded CPU is actually competitive — host↔device transfer eats the GPU advantage. The naive GPU kernel only catches OpenMP around 10M elements; the optimized kernel crosses earlier, around 5M.
Heat Equation (2D Stencil)
- CPU wins until
- ~256
- Crossover at
- ~256
The stencil hands the GPU a clear win once the grid hits 512×512. At 256×256 OpenMP is within 1.5× of the naive GPU because the working set fits comfortably in L2.
Three rules of thumb
Not every workload is worth porting. The crossover analysis gives a practical filter.
- 01
Embarrassingly parallel: port it
If your problem looks like Mandelbrot — independent threads, no shared state — the GPU wins at almost every size. The launch overhead is the only floor.
- 02
Reductions: only at scale
The dot-product crossover sits in the millions of elements. Below that, the transfer cost dominates and you are better off on a multi-core CPU.
- 03
Stencils: optimize before you compare
A naive stencil kernel looks unimpressive. Shared-memory tiling is what separates a 3× speedup from a 30× speedup — measure the optimized version when deciding whether to port.