kernel-fusion

Fused Triton kernels for TurboQuant KV cache compression — 2-4 bit quantization with RHT rotation. Drop-in HuggingFace & vLLM integration. Up to 4.9x KV cache compression for Llama, Qwen, Mistral, and more.

Updated Apr 1, 2026
Python

nopperl / pytorch-fused-lamb

Star

LAMB go brrr

cuda optimizer pytorch triton lamb kernel-fusion triton-lang

Updated Apr 11, 2024
Python

svdrecbd / mhc-mlx

Star

MLX + Metal implementation of mHC: Manifold-Constrained Hyper-Connections by DeepSeek-AI.

performance deep-learning metal gpu transformers mhc mlx sinkhorn kernel-fusion sinkhorn-knopp apple-silicon metal-kernel mlx-explore fused-kernels manifold-constrained-hyper-connections hyperconnections birkhoff-polytope

Updated Jan 13, 2026
Python

LessUp / triton-fused-ops

Star

Fused Triton kernels for Transformer inference: RMSNorm+RoPE, Gated MLP, and FP8 GEMM.

Updated Apr 29, 2026
Python

fraidakis / PDS_BitonicSortCUDA

Star

Assigment 3 for the "Parallel & Distributed Systems" course (ECE, AUTh) - Fall 2024

cuda shared-memory radix-sort bitonic-sort nvidia-gpu kernel-fusion

Updated Mar 16, 2025
Cuda

PwnKit-Labs / noeris

Star

Noeris — autonomous kernel fusion discovery + Triton autotuning for LLM kernels and Gemma layer deeper fusion (A100/H100 wins).

benchmarking cuda pytorch triton autotuning gemma gpu-kernels github-actions kernel-fusion llm-training llm-inference kernel-optimization

Updated May 2, 2026
Python

ParCoreLab / gpu-fusion

Star

GPU fusion code and algorithm

gpu cuda kernel-fusion

Updated May 24, 2024
Cuda

ShkalikovOleh / alpaka_expr_trees

Star

Compile time kernels fusion and expression trees as Alpaka boost.odeint backend. This is my team project developed in collaboration with and under the supervision of HZDR.

cuda accelerators kernel-fusion alpaka

Updated Feb 20, 2024
C++

LessUp / tiny-dl-inference

Star

Zero-dependency WebGPU deep learning inference engine (~50KB vs TensorFlow.js ~2MB)

machine-learning typescript browser deep-learning neural-network wasm inference mnist tensor gpu-computing webgpu kernel-fusion wgsl webgpu-compute

Updated May 1, 2026
TypeScript

varad-more / fused-triton-rmsnorm-residual-qkv

Star

Production-grade Triton kernel fusing residual add + RMSNorm + packed QKV projection into a single GPU launch for decoder-only transformer inference (Llama-3, Mistral, Qwen2). +2.4% tok/s, -1.5 GB VRAM on A10G.

cuda pytorch transformer triton llama memory-bandwidth gpu-kernels kernel-fusion rmsnorm llm-inference

Updated Apr 22, 2026
Python

abgnydn / webgpu-q

Star

WebGPU quantum many-body simulator — statevector + MPS + kernel fusion + chemistry. 160 tests, ITensor-validated, PySCF to 7 decimals. Browser-native.

typescript browser quantum-computing quantum-chemistry dmrg mps quantum-simulator webgpu itensor tensor-networks phase-transition many-body-physics vqe kernel-fusion wgsl

Updated May 4, 2026
TypeScript

abgnydn / wgpu-adas-bench

Star

ADAS sensor fusion benchmark — 11-stage fused wgpu-native vs multi-kernel PyTorch. 12-15x faster on same GPU.

rust benchmark metal vulkan pytorch autonomous-driving sensor-fusion adas webgpu kernel-fusion wgpu

Updated May 4, 2026
Rust

JonSnow1807 / Fused-LayerNorm-CUDA-Operator

Star

High-performance CUDA implementation of LayerNorm for PyTorch achieving 1.46x speedup through kernel fusion. Optimized for large language models (4K-8K hidden dims) with vectorized memory access, warp-level primitives, and mixed precision support. Drop-in replacement for nn.LayerNorm with 25% memory reduction.

deep-learning cuda pytorch gpu-optimization kernel-fusion layernorm

Updated Aug 17, 2025
Python

abgnydn / webgpu-fusion-max

Star

Pushing fused WebGPU transformer kernels to max model size — int4, tiled FFN, Phi-3-mini 3.6B in Chrome

inference transformer quantization webgpu kernel-fusion wgsl llm phi-3 browser-llm

Updated May 4, 2026
HTML

Improve this page

Add a description, image, and links to the kernel-fusion topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the kernel-fusion topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel-fusion

Here are 18 public repositories matching this topic...

tracel-ai / burn

ROCm / iris

chhzh123 / Krill

wu-kan / GoPTX

Argonaut790 / fused-turboquant

nopperl / pytorch-fused-lamb

svdrecbd / mhc-mlx

LessUp / triton-fused-ops

fraidakis / PDS_BitonicSortCUDA

PwnKit-Labs / noeris

ParCoreLab / gpu-fusion

ShkalikovOleh / alpaka_expr_trees

LessUp / tiny-dl-inference

varad-more / fused-triton-rmsnorm-residual-qkv

abgnydn / webgpu-q

abgnydn / wgpu-adas-bench

JonSnow1807 / Fused-LayerNorm-CUDA-Operator

abgnydn / webgpu-fusion-max

Improve this page

Add this topic to your repo

Search code, repositories, users, issues, pull requests...

kernel-fusion

Here are 18 public repositories matching this topic...

tracel-ai / burn

ROCm / iris

chhzh123 / Krill

wu-kan / GoPTX

Argonaut790 / fused-turboquant

nopperl / pytorch-fused-lamb

svdrecbd / mhc-mlx

LessUp / triton-fused-ops

fraidakis / PDS_BitonicSortCUDA

PwnKit-Labs / noeris

ParCoreLab / gpu-fusion

ShkalikovOleh / alpaka_expr_trees

LessUp / tiny-dl-inference

varad-more / fused-triton-rmsnorm-residual-qkv

abgnydn / webgpu-q

abgnydn / wgpu-adas-bench

JonSnow1807 / Fused-LayerNorm-CUDA-Operator

abgnydn / webgpu-fusion-max

Improve this page

Add this topic to your repo