Loading CPU model...
Loading GPU model...
Computer architecture · final project
CPUvsGPU
Two architectures, three workloads, one wall-clock.
cores8 P · 16 T
L336 MB
SMs170
VRAM32 GB
Three problems, three parallelism patterns
Each task isolates a different reason GPUs are fast — or aren't.
01Embarrassingly parallel
Mandelbrot
Per-pixel iteration with no data dependencies. The cleanest possible win for GPUs — every thread runs the same kernel on independent data.
→02ReductionDot Product
Element-wise multiply followed by a global sum. Forces a tree-style merge — naive GPU code lags until you use warp-level primitives.
→03StencilHeat Equation
Each cell reads four neighbors per step. Memory-bound; shared-memory tiling is what unlocks the GPU here.
→The serial side · CPU
Fewer cores, deeper pipelines.
The CPU side of every speedup number on this site. Same composition the GPU gets in the hero, mirrored — model on the left so the two chips face each other across the page.
Loading CPU model...
Built to chase the next instruction
- Out-of-order, speculative, branch-prediction heavy.
- Wins when the next computation depends on the last one.
- Our OpenMP runs use all 16 threads; baseline uses one.
8/16cores / threads
5.4GHz boost
36 MBL3 cache
Tools we used
CUDAGPU kernels, shared memory, warp shuffles
OpenMPCPU parallel-for baselines on 8 cores
C++17Both CPU and GPU host code
Next.js 14App Router + TypeScript for this report
RechartsLog-log timing plots
Key findings
The full breakdown is on the benchmarks page — these are the ones that surprised us.
98×
Mandelbrot speedup
Optimized GPU vs single-threaded CPU at 4096×4096
40×
Dot product speedup
Optimized GPU vs single-threaded CPU at 1B elements
32×
Heat equation speedup
Tiled GPU vs single-threaded CPU at 2048×2048, 1000 steps
7×
Naive → optimized GPU
The reduction kernel gains the most from warp-level merging