kv-cache

Star

Here are 606 public repositories matching this topic...

LMCache / LMCache

Star

LMCache: Supercharge Your LLM with the Fastest KV Cache Layer

fast amd cuda inference pytorch speed rocm kv-cache llm vllm

Updated Jul 24, 2026
Python

HDT3213 / godis

Star

A Golang implemented Redis Server and Cluster. Go 语言实现的 Redis 服务器和分布式集群

go redis golang cluster redis-server redis-cluster godis kv-cache

Updated Sep 14, 2025
Go

Zefan-Cai / KVCache-Factory

Star

Unified KV Cache Compression Methods for Auto-Regressive Models

kv-cache llm kv-cache-compression

Updated Jul 10, 2026
Python

NVIDIA / kvpress

Star

LLM KV cache compression made easy

python transformers inference pytorch kv-cache large-language-models llm long-context kv-cache-compression

Updated Jul 9, 2026
Python

harleyszhang / llm_note

Star

LLM notes, including model inference, transformer model structure, and llm framework code analysis notes.

cuda-programming transformer-models kv-cache llm vllm llm-inference triton-kernels

Updated Jul 2, 2026
Python

Anbeeld / beellama.cpp

Star

KVarN, KV cache precision tail, low-bit quants in llama.cpp for longer context of better precision in the same VRAM

inference quantization kv-cache llm llm-serving llama-cpp ggml llm-inference speculative-decoding

Updated Jul 23, 2026
C++

therealoliver / Deepdive-llama3-from-scratch

Star

Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.

Updated Feb 24, 2025
Jupyter Notebook

From teacher to tiles — a from-scratch LLM distillation & serving engine: custom Triton/CUDA kernels, FSDP distillation, paged-KV continuous batching, speculative decoding, a Rust gateway, a JAX oracle, and interpretability tooling.

rust cuda pytorch triton quantization knowledge-distillation inference-engine jax kv-cache ml-systems llm mechanistic-interpretability fsdp flash-attention speculative-decoding paged-attention

Updated Jun 5, 2026
Python

raymin0223 / mixture_of_recursions

Star

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation (NeurIPS 2025)

router early-exiting adaptive-computation kv-cache llm recursive-transformers

Updated Sep 26, 2025
Python

openinfer-project / openinfer

Star

Pure Rust + CUDA LLM inference engine — no PyTorch, OpenAI-compatible, serves Qwen3 to Kimi-K2

Updated Jul 24, 2026
Rust

FMInference / H2O

Star

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.

sparsity high-throughput heavy-hitters kv-cache gpt-3 large-language-models

Updated Aug 1, 2024
Python

Zefan-Cai / Awesome-LLM-KV-Cache

Star

Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes.

kv-cache llm kv-cache-quantization kv-cache-compression

Updated Jun 17, 2026

huawei-csl / KVarN

Star

KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

quantization kv-cache llm long-context vllm llm-inference agentic-ai

Updated Jun 22, 2026
Python

thu-nics / C2C

Star

[ICLR'26] The official code implementation for "Cache-to-Cache: Direct Semantic Communication Between Large Language Models"

multi-agent kv-cache llm

Updated Mar 13, 2026
Python

quantumaikr / quant.cpp

Star

LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.

embeddable transformer pure-c quantization delta-compression kv-cache llm llm-inference gguf turboquant

Updated Apr 26, 2026
C

dipampaul17 / KVSplit

Star

Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% quality loss. Includes benchmarking, visualization, and one-command setup. Optimized for M1/M2/M3 Macs with Metal support.

metal optimization quantization m2 m3 m1 memory-optimization kv-cache apple-silicon llm generative-ai llama-cpp

Updated May 21, 2025
Python

jjiantong / Awesome-KV-Cache-Optimization

Star

[ACL 2026] Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization

machine-learning ai system computer-architecture neural-language-processing mlsys kv-cache serving-ml llm llm-serving llm-inference

Updated Jun 28, 2026
Python

QuanjianSong / FashionChameleon

Star

Official Pytorch Code of the Paper "FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization"

streaming real-time fashion interactive rewards dmd sft teacher-forcing kv-cache video-diffusion-models video-customization garment-switch self-forcing

Updated May 31, 2026
Python

alibaba / tair-kvcache

Star

Alibaba Cloud's high-performance KVCache system for LLM inference, with components for global cache management, inference simulation(HiSim), and more.

simulator kv-cache llm kvcache hisim

Updated Jul 24, 2026
C++

NVIDIA-Merlin / HierarchicalKV

Star

HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of HierarchicalKV is to store key-value feature-embeddings on high-bandwidth memory (HBM) of GPUs and in host memory. It also can be used as a generic key-value storage.

gpu cuda recommender-system hashtable key-value-store kv-cache dynamic-embedding embedding-storage

Updated May 22, 2026
Cuda

Improve this page

Add a description, image, and links to the kv-cache topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the kv-cache topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv-cache

Here are 606 public repositories matching this topic...

LMCache / LMCache

HDT3213 / godis

Zefan-Cai / KVCache-Factory

NVIDIA / kvpress

harleyszhang / llm_note

Anbeeld / beellama.cpp

therealoliver / Deepdive-llama3-from-scratch

zengxiao-he / tessera

raymin0223 / mixture_of_recursions

openinfer-project / openinfer

FMInference / H2O

Zefan-Cai / Awesome-LLM-KV-Cache

huawei-csl / KVarN

thu-nics / C2C

quantumaikr / quant.cpp

dipampaul17 / KVSplit

jjiantong / Awesome-KV-Cache-Optimization

QuanjianSong / FashionChameleon

alibaba / tair-kvcache

NVIDIA-Merlin / HierarchicalKV

Improve this page

Add this topic to your repo

Search code, repositories, users, issues, pull requests...

kv-cache

Here are 606 public repositories matching this topic...

LMCache / LMCache

HDT3213 / godis

Zefan-Cai / KVCache-Factory

NVIDIA / kvpress

harleyszhang / llm_note

Anbeeld / beellama.cpp

therealoliver / Deepdive-llama3-from-scratch

zengxiao-he / tessera

raymin0223 / mixture_of_recursions

openinfer-project / openinfer

FMInference / H2O

Zefan-Cai / Awesome-LLM-KV-Cache

huawei-csl / KVarN

thu-nics / C2C

quantumaikr / quant.cpp

dipampaul17 / KVSplit

jjiantong / Awesome-KV-Cache-Optimization

QuanjianSong / FashionChameleon

alibaba / tair-kvcache

NVIDIA-Merlin / HierarchicalKV

Improve this page

Add this topic to your repo