Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

yrmo/SGEMM_CUDA

Open more actions menu
 
 

Repository files navigation

Fast CUDA SGEMM from Scratch

Step-by-step optimization of matrix multiplication, implemented in CUDA. For an explanation of each kernel, see siboehm.com/CUDA-MMM.

Overview

Running the kernels on a NVIDIA A6000 (Ampere):

GFLOPs at matrix size 4096x4096:

Kernel GFLOPs/s Performance relative to cuBLAS
1: Naive 309.0 1.3%
2: GMEM Coalescing 1986.5 8.5%
3: SMEM Caching 2980.3 12.8%
4: 1D Blocktiling 8474.7 36.5%
5: 2D Blocktiling 15971.7 68.7%
7: Avoid Bank Conflicts (Linearize) 16213.4 69.7%
8: Avoid Bank Conflicts (Offset) 16459.2 70.8%
11: Double Buffering 17278.3 74.3%
6: Vectorized Mem Access 18237.3 78.4%
9: Autotuning 19721.0 84.8%
10: Warptiling 21779.3 93.7%
0: cuBLAS 23249.6 100.0%

Setup

  1. Install dependencies: CUDA toolkit 12, Python (+ Seaborn), CMake, Ninja. See environment.yml.
  2. Configure NVCC compilation parameters. Look up your GPUs compute capability here. Then configure the CMakeLists.txt and change:
    set(CUDA_COMPUTE_CAPABILITY 80)
  3. Build: mkdir build && cd build && cmake .. && cmake --build .
  4. Run one of the kernels: DEVICE=<device_id> ./sgemm <kernel number>
  5. Profiling via NVIDIA Nsight Compute (ncu): make profile KERNEL=<kernel number>

Credit goes to wangzyon/NVIDIA_SGEMM_PRACTICE for the benchmarking setup.

About

Fast CUDA matrix multiplication from scratch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Cuda 82.0%
  • Shell 10.5%
  • Python 5.3%
  • CMake 1.3%
  • Makefile 0.9%
Morty Proxy This is a proxified and sanitized view of the page, visit original site.