Supplementary Material for Lectures

YouTube Channel

The PMPP Book: Programming Massively Parallel Processors: A Hands-on Approach (Amazon link)

Lecture 1: Profiling and Integrating CUDA kernels in PyTorch

Speaker: Mark Saroufim
Notebook and slides in lecture_001 folder

Lecture 2: Recap Ch. 1-3 from the PMPP book

Speaker: Andreas Koepf
Slides: The powerpoint file lecture_002/cuda_mode_lecture2.pptx can be found in the root directory of this repository. Alternatively here as Google docs presentation.

Lecture 3: Getting Started With CUDA

Speaker: Jeremy Howard
Notebook: See the lecture_003 folder, or run the Colab version

Lecture 4: Intro to Compute and Memory Architecture

Speaker: Thomas Viehmann
Notebook and slides in the lecture_004 folder.

Lecture 5: Going Further with CUDA for Python Programmers

Speaker: Jeremy Howard
Notebook in the lecture_005 folder.

Lecture 6: Optimizing PyTorch Optimizers

Speaker: Jane Xu
Slides

Lecture 7: Advanced Quantization

Speaker: Charles Hernandez
Slides

Lecture 8: CUDA Performance Checklist

Speaker: Mark Saroufim
Code in the lecture_008 folder
Slides

Lecture 9: Reductions

Speaker: Mark Saroufim
Code in the lecture_009 folder
Slides

Lecture 10: Build a Prod Ready CUDA Library

Speaker: Oscar Amoros Huguet
slides

Lecture 11: Sparsity

Speaker: Jesse Cai
Slides

Lecture 12: Flash Attention

Speaker: Thomas Viehmann
Code in the lecture_012 folder

Lecture 13: Ring Attention

Speaker: Andreas Koepf
Slides

Lecture 14: Practitioner's Guide to Triton

Date: 2024-04-13, Speaker: Umer Adil
Notebook

Lecture 15: CUTLASS

Speaker: Eric Auld

Lecture 16: On Hands profiling

Speaker: Taylor Robbie

Bonus Lecture: CUDA C++ llm.cpp

Speaker: Jake Hemstad & Georgii Evtushenko
Slides

Lecture 17: GPU Collective Communication (NCCL)

Speaker: Dan Johnson
Code in the lecture_017 folder

Lecture 18: Fused Kernels

Speaker: Kapil Sharma
Code in the lecture_018 folder

Lecture 19: Data Processing on GPUs

Speaker: Devavret Makkar

Lecture 20: Scan Algorithm

Speaker: Izzat El Haj
Slides

Lecture 21: Scan Algorithm Part 2

Speaker: Izzat El Haj
Slides

Lecture 22: Hacker's Guide to Speculative Decoding in VLLM

Speaker: Cade Daniel
Slides

Lecture 23: Tensor Cores

Speaker: Vijay Thakkar & Pradeep Ramani
Slides

Lecture 24: Scan at the Speed of Light

Speaker: Jake Hemstad & Georgii Evtushenko

Lecture 25: Speaking Composable Kernel

Speaker: Haocong Wang
Slides

Lecture 26: SYCL MODE (Intel GPU)

Speaker: Patric Zhao
Slides

Lecture 27: gpu.cpp

Speaker: Austin Huang
Slides

Lecture 28: Liger Kernel

Lecture 29: Triton Internals

Speaker: Kapil Sharma
Code/presentation in the lecture_029 folder

Lecture 30: Quantized training

Speaker: Thien Tran
Code/presentation in the lecture_030 folder

Lecture 31: Beginners Guide to Metal Kernels

Speaker: Nikita Shulga
Code/presentation in the lecture_031 folder

Lecture 32: Unsloth - LLM Systems Engineering

Speaker: Daniel Han
Slides

Lecture 33: BitBLAS

Speaker: Wang Lei
Code/presentation in the lecture_033 folder

Lecture 34: Low Bit Triton Kernels

Speaker: Hicham Badri
Slides

Lecture 35: SGLang Performance Optimization

Speaker: Yineng Zhang
Slides

Lecture 36: CUTLASS and Flash ATtention 3

Speaker: Jay Shah
Slides

Lecture 37: Introduction to SASS & GPU Microarchitecture

Speaker: Arun Demeure
Slides

Lecture 38: Lowbit kernels for ARM CPU

Speaker: Scott Roy
Slides

Lecture 39: TorchTitan

Speaker: Mark Saroufim and Tianyu Liu

Lecture 40: Flash Infer

Speaker: Zihao Ye

Lecture 41: CUDA Docs for Humans

Speaker: Charles Frye
Slides

Lecture 42: Mosaic GPU

Speaker: Adam Paszke

Lecture 43:

Speaker: Erik Schultheis
Slides

Lecture 57: CuTE

Speaker: Cris Cecka
Slides

Lecture 67: NCCL & NVSHMEM

Speaker: Jeff Hammond
Slides
Code

Lecture 69: Quartet 4 bit training

Speakers: Roberto Castro and Andrei Panferov
Code: https://github.com/IST-DASLab/Quartet and https://github.com/isT-DASLab/qutlass Roberto Castro and Andrei Panferov
Paper

Lecture 70: Fault tolerant communication collectives

Speaker: mike64_t
Slides

Lecture 71: [ScaleML Series] FlexOlmo: Open Language Models for Flexible Data Use

Speaker: Sewon Min
Slides

Name	Name	Last commit message	Last commit date
Latest commit History 140 Commits
lecture_001	lecture_001
lecture_002	lecture_002
lecture_003	lecture_003
lecture_004	lecture_004
lecture_005	lecture_005
lecture_008	lecture_008
lecture_009	lecture_009
lecture_011	lecture_011
lecture_012	lecture_012
lecture_013	lecture_013
lecture_014	lecture_014
lecture_017	lecture_017
lecture_018	lecture_018
lecture_025	lecture_025
lecture_029	lecture_029
lecture_030	lecture_030
lecture_031	lecture_031
lecture_033	lecture_033
lecture_035	lecture_035
lecture_036	lecture_036
lecture_037	lecture_037
lecture_038	lecture_038
lecture_042	lecture_042
lecture_057	lecture_057
lecture_071	lecture_071
lecture_072	lecture_072
lecture_074	lecture_074
lecture_075	lecture_075
lecture_078	lecture_078
lecture_079	lecture_079
lecture_084	lecture_084
lecture_086	lecture_086
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
utils.py	utils.py

Search code, repositories, users, issues, pull requests...

License

gpu-mode/lectures

Folders and files

Latest commit

History

Repository files navigation

Supplementary Material for Lectures

Lecture 1: Profiling and Integrating CUDA kernels in PyTorch

Lecture 2: Recap Ch. 1-3 from the PMPP book

Lecture 3: Getting Started With CUDA

Lecture 4: Intro to Compute and Memory Architecture

Lecture 5: Going Further with CUDA for Python Programmers

Lecture 6: Optimizing PyTorch Optimizers

Lecture 7: Advanced Quantization

Lecture 8: CUDA Performance Checklist

Lecture 9: Reductions

Lecture 10: Build a Prod Ready CUDA Library

Lecture 11: Sparsity

Lecture 12: Flash Attention

Lecture 13: Ring Attention

Lecture 14: Practitioner's Guide to Triton

Lecture 15: CUTLASS

Lecture 16: On Hands profiling

Bonus Lecture: CUDA C++ llm.cpp

Lecture 17: GPU Collective Communication (NCCL)

Lecture 18: Fused Kernels

Lecture 19: Data Processing on GPUs

Lecture 20: Scan Algorithm

Lecture 21: Scan Algorithm Part 2

Lecture 22: Hacker's Guide to Speculative Decoding in VLLM

Lecture 23: Tensor Cores

Lecture 24: Scan at the Speed of Light

Lecture 25: Speaking Composable Kernel

Lecture 26: SYCL MODE (Intel GPU)

Lecture 27: gpu.cpp

Lecture 28: Liger Kernel

Lecture 29: Triton Internals

Lecture 30: Quantized training

Lecture 31: Beginners Guide to Metal Kernels

Lecture 32: Unsloth - LLM Systems Engineering

Lecture 33: BitBLAS

Lecture 34: Low Bit Triton Kernels

Lecture 35: SGLang Performance Optimization

Lecture 36: CUTLASS and Flash ATtention 3

Lecture 37: Introduction to SASS & GPU Microarchitecture

Lecture 38: Lowbit kernels for ARM CPU

Lecture 39: TorchTitan

Lecture 40: Flash Infer

Lecture 41: CUDA Docs for Humans

Lecture 42: Mosaic GPU

Lecture 43:

Lecture 57: CuTE

Lecture 67: NCCL & NVSHMEM

Lecture 69: Quartet 4 bit training

Lecture 70: Fault tolerant communication collectives

Lecture 71: [ScaleML Series] FlexOlmo: Open Language Models for Flexible Data Use

Lecture 72: [ScaleML Series] Efficient & Effective Long-Context Modeling for Large Language Models

Lecture 74: [ScaleML Series] Positional Encodings and PaTH Attention

Lecture 75: [ScaleML Series] GPU Programming Fundamentals + ThunderKittens

Lecture 78: Iris: Multi-GPU Programming in Triton

Lecture 79: Mirage (MPK): Compiling LLMs into Mega Kernels

Lecture 84: Numerics and AI

Lecture 86: Introduction to CuTeDSL (for NVIDIA competition)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 30

Uh oh!

Languages