Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

gpu-mode/lectures

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Supplementary Material for Lectures

YouTube Channel

The PMPP Book: Programming Massively Parallel Processors: A Hands-on Approach (Amazon link)

Lecture 1: Profiling and Integrating CUDA kernels in PyTorch

Lecture 2: Recap Ch. 1-3 from the PMPP book

Lecture 3: Getting Started With CUDA

Lecture 4: Intro to Compute and Memory Architecture

Lecture 5: Going Further with CUDA for Python Programmers

Lecture 6: Optimizing PyTorch Optimizers

Lecture 7: Advanced Quantization

Lecture 8: CUDA Performance Checklist

Lecture 9: Reductions

Lecture 10: Build a Prod Ready CUDA Library

Lecture 11: Sparsity

Lecture 12: Flash Attention

Lecture 13: Ring Attention

Lecture 14: Practitioner's Guide to Triton

Lecture 15: CUTLASS

Lecture 16: On Hands profiling

Bonus Lecture: CUDA C++ llm.cpp

Lecture 17: GPU Collective Communication (NCCL)

Lecture 18: Fused Kernels

Lecture 19: Data Processing on GPUs

Lecture 20: Scan Algorithm

Lecture 21: Scan Algorithm Part 2

Lecture 22: Hacker's Guide to Speculative Decoding in VLLM

Lecture 23: Tensor Cores

  • Speaker: Vijay Thakkar & Pradeep Ramani
  • Slides

Lecture 24: Scan at the Speed of Light

  • Speaker: Jake Hemstad & Georgii Evtushenko

Lecture 25: Speaking Composable Kernel

  • Speaker: Haocong Wang
  • Slides

Lecture 26: SYCL MODE (Intel GPU)

Lecture 27: gpu.cpp

Lecture 28: Liger Kernel

Lecture 29: Triton Internals

Lecture 30: Quantized training

Lecture 31: Beginners Guide to Metal Kernels

Lecture 32: Unsloth - LLM Systems Engineering

Lecture 33: BitBLAS

Lecture 34: Low Bit Triton Kernels

Lecture 35: SGLang Performance Optimization

Lecture 36: CUTLASS and Flash ATtention 3

Lecture 37: Introduction to SASS & GPU Microarchitecture

Lecture 38: Lowbit kernels for ARM CPU

Lecture 39: TorchTitan

  • Speaker: Mark Saroufim and Tianyu Liu

Lecture 40: Flash Infer

Lecture 41: CUDA Docs for Humans

Lecture 42: Mosaic GPU

Lecture 43:

  • Speaker: Erik Schultheis
  • Slides

Lecture 57: CuTE

Lecture 67: NCCL & NVSHMEM

Lecture 69: Quartet 4 bit training

Lecture 70: Fault tolerant communication collectives

Lecture 71: [ScaleML Series] FlexOlmo: Open Language Models for Flexible Data Use

Lecture 72: [ScaleML Series] Efficient & Effective Long-Context Modeling for Large Language Models

Lecture 74: [ScaleML Series] Positional Encodings and PaTH Attention

Lecture 75: [ScaleML Series] GPU Programming Fundamentals + ThunderKittens

Lecture 78: Iris: Multi-GPU Programming in Triton

Speakers: Muhammad Awad, Muhammad Osama & Brandon Potter

Lecture 79: Mirage (MPK): Compiling LLMs into Mega Kernels

Speakers: Mengdi Wu, Xinhao Cheng

Lecture 84: Numerics and AI

Speaker: Paulius Micikevicius

Lecture 86: Introduction to CuTeDSL (for NVIDIA competition)

Speaker: Vicki Wang

Morty Proxy This is a proxified and sanitized view of the page, visit original site.