cuda-kernels

Here are 342 public repositories matching this topic...

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

cuda cuda-kernels cuda-demo cuda-toolkit cuda-library cuda-kernel learn-cuda cuda-cpp hgemm flash-attention leet-cuda cuda-12

Updated May 29, 2026
Cuda

NVIDIA / cuda-samples

Star

Samples for CUDA Developers which demonstrates features in CUDA Toolkit

cuda cuda-kernels cuda-driver-api cuda-opengl

Updated May 27, 2026
C++

InternLM / lmdeploy

Star

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

llama cuda-kernels deepspeed llm fastertransformer llm-inference turbomind internlm llama2 codellama llama3

Updated Jun 5, 2026
Python

Rust-GPU / rust-cuda

Star

Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.

rust gpu cuda rust-lang gpgpu cuda-kernels gpu-programming cuda-programming

Updated Apr 29, 2026
Rust

NVIDIA / cccl

Star

CUDA Core Compute Libraries

cpp hpc gpu modern-cpp parallel-computing cuda nvidia gpu-acceleration cuda-kernels gpu-computing parallel-algorithm parallel-programming nvidia-gpu gpu-programming cuda-library cpp-programming cuda-programming accelerated-computing cuda-cpp

Updated Jun 7, 2026
C++

Luce-Org / lucebox-hub

Star

Fast LLM speculative inference server for consumer hardware.

kernel cuda cuda-kernels nvidia-cuda luce rtx3090 llama-cpp local-ai qwen speculative-decoding dflash megakernel speculative-prefill pflash lucebox

Updated Jun 6, 2026
C++

chelsea0x3b / dfdx

Sponsor

Star

Deep learning in Rust, with shape checked tensors and neural networks

rust machine-learning deep-neural-networks deep-learning neural-network gpu cuda autograd rust-lang gpu-acceleration cuda-kernels tensor gpu-computing backpropagation cudnn cuda-toolkit cuda-support autodiff autodifferentiation

Updated Jul 23, 2024
Rust

chelsea0x3b / cudarc

Sponsor

Star

Safe rust wrapper around CUDA toolkit

rust gpu cuda cublas gpu-acceleration cuda-kernels cudnn cuda-toolkit nccl curand cuda-programming nvrtc

Updated May 15, 2026
Rust

mni-ml / framework

Star

A machine learning library with a TypeScript API and Rust backend. CUDA and WebGPU compatibility. Built to understand how ML frameworks and models work internally.

rust machine-learning cuda cuda-kernels webgpu pytorch-implementation

Updated Apr 20, 2026
Rust

NVIDIA / nvbench

Star

CUDA Kernel Benchmarking Library

benchmark performance gpu cuda nvidia cuda-kernels kernel-benchmark

Updated Jun 4, 2026
Cuda

NVIDIA / cudnn-frontend

Star

cuDNN Frontend is NVIDIA's modern, open-source entry point to the cuDNN library and a growing collection of high-performance open-source kernels.

Updated Jun 5, 2026
Python

deepreinforce-ai / CUDA-L2

Star

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

reinforcement-learning cublas nvidia matrix-multiplication cuda-kernels large-language-models

Updated Mar 30, 2026
Cuda

KernelTuner / kernel_tuner

Star

Kernel Tuner

python c testing machine-learning cplusplus gpu optimization opencl cuda autotuning software-development opencl-kernels kernel-tuner cuda-kernels gpu-computing auto-tuning

Updated Jun 5, 2026
Python

harrism / hemi

Star

Simple utilities to enable code reuse and portability between CUDA C/C++ and standard C/C++.

c-plus-plus gpu cuda cuda-kernels cuda-device hemi

Updated Apr 14, 2022
C++

FlashRT is a high-performance realtime inference engine for small-batch, latency-sensitive AI workloads. The flagship integration is production VLA control for Pi0, Pi0.5, GROOT N1.6, and Pi0-FAST. Also support llm e.g, qwen3.6-27B

cuda pi thor cuda-kernels wan vla jetson motus jetson-orin qwen gr00t wan22-5b realtime-inference pi05 jetson-thor qwen3-6 gr00t-n1-6-3b realtime-vla qwen3-6-27b

Updated Jun 6, 2026
C++

jaredhoberock / stanford-cs193g-sp2010

Star

This is an archive of materials produced for an introductory class on CUDA programming at Stanford University in 2010

cuda cuda-kernels gpu-programming cuda-programming

Updated Jun 24, 2022
C++

HenryNdubuaku / cuda-tutorials

Sponsor

Star

Comprehensive CUDA tutorials for Maths & ML with examples

machine-learning cuda cuda-kernels maths cuda-programming

Updated Jun 11, 2025
Cuda

deepakkumar1984 / Amplifier.NET

Sponsor

Star

Amplifier allows .NET developers to easily run complex applications with intensive mathematical computation on Intel CPU/GPU, NVIDIA, AMD without writing any additional C kernel code. Write your function in .NET and Amplifier will take care of running it on your favorite hardware.

compiler opencl simd gpgpu opencl-kernels cuda-kernels gpgpu-computing gpgpu-sim

Updated Dec 23, 2025
C#

alexzhang13 / flashattention2-custom-mask

Star

Triton implementation of FlashAttention2 that adds Custom Masks.

deep-learning triton attention cuda-kernels attention-mechanism triton-lang flash-attention flash-attention-2

Updated Aug 14, 2024
Python

psmarter / CUDA-Practice

Star

CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.

parallel-computing cuda high-performance-computing cuda-kernels quantization cutlass gemm performance-optimization nccl gpu-programming roofline-model tensor-core llm-inference flash-attention nsight-compute

Updated May 11, 2026
Cuda

Improve this page

Add a description, image, and links to the cuda-kernels topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the cuda-kernels topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda-kernels

Here are 342 public repositories matching this topic...

xlite-dev / LeetCUDA

NVIDIA / cuda-samples

InternLM / lmdeploy

Rust-GPU / rust-cuda

NVIDIA / cccl

Luce-Org / lucebox-hub

chelsea0x3b / dfdx

chelsea0x3b / cudarc

mni-ml / framework

NVIDIA / nvbench

NVIDIA / cudnn-frontend

deepreinforce-ai / CUDA-L2

KernelTuner / kernel_tuner

harrism / hemi

LiangSu8899 / FlashRT

jaredhoberock / stanford-cs193g-sp2010

HenryNdubuaku / cuda-tutorials

deepakkumar1984 / Amplifier.NET

alexzhang13 / flashattention2-custom-mask

psmarter / CUDA-Practice

Improve this page

Add this topic to your repo

Search code, repositories, users, issues, pull requests...

cuda-kernels

Here are 342 public repositories matching this topic...

xlite-dev / LeetCUDA

NVIDIA / cuda-samples

InternLM / lmdeploy

Rust-GPU / rust-cuda

NVIDIA / cccl

Luce-Org / lucebox-hub

chelsea0x3b / dfdx

chelsea0x3b / cudarc

mni-ml / framework

NVIDIA / nvbench

NVIDIA / cudnn-frontend

deepreinforce-ai / CUDA-L2

KernelTuner / kernel_tuner

harrism / hemi

LiangSu8899 / FlashRT

jaredhoberock / stanford-cs193g-sp2010

HenryNdubuaku / cuda-tutorials

deepakkumar1984 / Amplifier.NET

alexzhang13 / flashattention2-custom-mask

psmarter / CUDA-Practice

Improve this page

Add this topic to your repo