📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
-
Updated
May 29, 2026 - Cuda
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.
CUDA Core Compute Libraries
Fast LLM speculative inference server for consumer hardware.
Deep learning in Rust, with shape checked tensors and neural networks
Safe rust wrapper around CUDA toolkit
A machine learning library with a TypeScript API and Rust backend. CUDA and WebGPU compatibility. Built to understand how ML frameworks and models work internally.
CUDA Kernel Benchmarking Library
cuDNN Frontend is NVIDIA's modern, open-source entry point to the cuDNN library and a growing collection of high-performance open-source kernels.
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
Kernel Tuner
Simple utilities to enable code reuse and portability between CUDA C/C++ and standard C/C++.
FlashRT is a high-performance realtime inference engine for small-batch, latency-sensitive AI workloads. The flagship integration is production VLA control for Pi0, Pi0.5, GROOT N1.6, and Pi0-FAST. Also support llm e.g, qwen3.6-27B
This is an archive of materials produced for an introductory class on CUDA programming at Stanford University in 2010
Comprehensive CUDA tutorials for Maths & ML with examples
Amplifier allows .NET developers to easily run complex applications with intensive mathematical computation on Intel CPU/GPU, NVIDIA, AMD without writing any additional C kernel code. Write your function in .NET and Amplifier will take care of running it on your favorite hardware.
Triton implementation of FlashAttention2 that adds Custom Masks.
CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.
Add a description, image, and links to the cuda-kernels topic page so that developers can more easily learn about it.
To associate your repository with the cuda-kernels topic, visit your repo's landing page and select "manage topics."