Benchmarks
Three workloads, four implementations each
All measurements are wall-clock time. CPU runs use a single Intel Core i7 thread and an OpenMP build on 8 cores. GPU runs use an NVIDIA RTX 3060. Both axes are logarithmic — slopes tell you scaling behavior, vertical gaps tell you raw speedup.
Mandelbrot Set
Every pixel is independent — no shared state, no synchronization. The canonical case where the GPU should crush the CPU.
Dot Product
A pairwise multiply followed by a global sum. The reduction step needs a tree-style merge — a textbook case where naive GPU code loses to optimized GPU code.
Heat Equation (2D Stencil)
Each cell reads its four neighbors per iteration. Memory access dominates — tiling into shared memory is the key optimization.
How we measured
- Warm-up: Each kernel runs once before timing to amortize JIT compilation, GPU context init, and page faults.
- Repetitions: 10 runs per data point. We report the median.
- Transfer cost: GPU timings include host↔device memcpy. This is the honest number for a one-shot computation.
- Compiler:
g++ -O3 -fopenmpfor CPU,nvcc -O3 -arch=sm_86for GPU.