Crossover analysis

Where does the GPU start winning?

A crossover point is the input size at which the GPU implementation becomes faster than the best CPU implementation. Below the crossover, the CPU wins because kernel-launch overhead and host↔device memory transfer cost more than the actual computation. Above it, raw parallelism dominates.

The crossover point is the single most useful number for deciding whether to port a workload to the GPU at all.

At a glance

Crossover summary

< 512

Mandelbrot crossover

GPU wins across the entire measured range

~5M

Dot product crossover

Optimized GPU overtakes 8-core OpenMP

~256

Heat equation crossover

Tiled GPU pulls ahead at 512×512 and never looks back

Embarrassingly parallel

Mandelbrot Set

CPU wins until: never
Crossover at: < 512

Mandelbrot has no crossover in our measured range — the GPU wins from the first data point. To see a CPU win you would need to drop below ~256×256, where kernel-launch overhead dominates.

Reduction

Dot Product

CPU wins until: ~1M
Crossover at: ~5M

Below ~1M elements the single-threaded CPU is actually competitive — host↔device transfer eats the GPU advantage. The naive GPU kernel only catches OpenMP around 10M elements; the optimized kernel crosses earlier, around 5M.

Stencil / nearest-neighbor

Heat Equation (2D Stencil)

CPU wins until: ~256
Crossover at: ~256

The stencil hands the GPU a clear win once the grid hits 512×512. At 256×256 OpenMP is within 1.5× of the naive GPU because the working set fits comfortably in L2.

Takeaway

Three rules of thumb

Not every workload is worth porting. The crossover analysis gives a practical filter.

01
Embarrassingly parallel: port it
If your problem looks like Mandelbrot — independent threads, no shared state — the GPU wins at almost every size. The launch overhead is the only floor.
02
Reductions: only at scale
The dot-product crossover sits in the millions of elements. Below that, the transfer cost dominates and you are better off on a multi-core CPU.
03
Stencils: optimize before you compare
A naive stencil kernel looks unimpressive. Shared-memory tiling is what separates a 3× speedup from a 30× speedup — measure the optimized version when deciding whether to port.

Where does the GPU start winning?

Embarrassingly parallel: port it

Reductions: only at scale

Stencils: optimize before you compare