Benchmarks

Three workloads, four implementations each

All measurements are wall-clock time. CPU runs use a single Intel Core i7 thread and an OpenMP build on 8 cores. GPU runs use an NVIDIA RTX 3060. Both axes are logarithmic — slopes tell you scaling behavior, vertical gaps tell you raw speedup.

Embarrassingly parallel

Mandelbrot Set

Every pixel is independent — no shared state, no synchronization. The canonical case where the GPU should crush the CPU.

Reduction

Dot Product

A pairwise multiply followed by a global sum. The reduction step needs a tree-style merge — a textbook case where naive GPU code loses to optimized GPU code.

Stencil / nearest-neighbor

Heat Equation (2D Stencil)

Each cell reads its four neighbors per iteration. Memory access dominates — tiling into shared memory is the key optimization.

Methodology

How we measured

  • Warm-up: Each kernel runs once before timing to amortize JIT compilation, GPU context init, and page faults.
  • Repetitions: 10 runs per data point. We report the median.
  • Transfer cost: GPU timings include host↔device memcpy. This is the honest number for a one-shot computation.
  • Compiler: g++ -O3 -fopenmp for CPU, nvcc -O3 -arch=sm_86 for GPU.