Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Lossfunk/Flash-Kernels-v2

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Flash-Kernels v2 – Quick Reference

A lean playground for experimenting with CUDA GPU kernels.

Directory highlights

  1. benchmarks/ – micro-benchmarks + visualisations
  2. new_kernels/ – hand-tuned CUDA kernels (LayerNorm, Softmax, …)
  3. evals/ – PyTest correctness & regression suite
  4. agents/ – LLM pipeline that auto-writes kernels for KernelBench

1 · Installation

python -m venv venv && source venv/bin/activate
pip install -r requirements.txt      # Dependencies including Triton (for benchmarking), PyTorch, plotting libs

GPU prerequisites

  • CUDA 11.8+ drivers
  • Compute Capability ≥ 7.0 (RTX 30-series, A100/H100, …)

2 · Running the tests (evals/)

pytest -q evals       # FP32 + BF16 when supported

Tests compare the CUDA kernels against the PyTorch reference with strict assert_verbose_allclose tolerances.


3 · Benchmarking (benchmarks/)

3.1 One-liner drivers 🚀

For the common cases you do not need to pass any arguments – just execute one of the convenience shell scripts and grab a coffee:

# Forward-only roofline runs
bash benchmarks/run_layer_norm_sol.sh
bash benchmarks/run_softmax_sol.sh
bash benchmarks/run_diagonal_matmul_sol.sh
bash benchmarks/run_fused_linear_rowsum_sol.sh

Each script will

  1. call the corresponding benchmarks/scripts/benchmark_*.py file,
  2. append results to benchmarks/data/all_benchmark_data.csv,
  3. auto-generate PNGs in benchmarks/visualizations/ (one per extra-config).

3.2 Manual control

Prefer explicit flags? You can run the python scripts directly:

python benchmarks/scripts/benchmark_layer_norm.py --overwrite
python benchmarks/benchmark_visualizer.py \
       --kernel-name layer_norm --metric-name speed \
       --kernel-operation-mode forward --display

⚠️ The visualiser iterates over every unique extra_benchmark_config unless --extra-config-filter is supplied. Expect several PNGs.


4 · Reproducing the paper numbers

Full-scale regeneration (~25 min on an H100):

source venv/bin/activate  # ensure deps are present
for s in benchmarks/scripts/benchmark_*.py; do python "$s" --overwrite; done
for k in fused_linear_rowsum layer_norm softmax diag_matmul; do
  python benchmarks/benchmark_visualizer.py --kernel-name "$k" --metric-name speed
  python benchmarks/benchmark_visualizer.py --kernel-name "$k" --metric-name memory || true
done

Hardware: Ampere or newer GPU with ≥40 GB VRAM to cover the largest cases. Generated all_benchmark_data.csv should match the committed copy (timestamps differ).

About

because life's too short for slow kernels.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
Morty Proxy This is a proxified and sanitized view of the page, visit original site.