kv-cache-compression

First open-source implementation of Google TurboQuant (ICLR 2026) -- near-optimal KV cache compression for LLM inference. 5x compression with near-zero quality loss.

machine-learning compression deep-learning pytorch transformer attention quantization iclr vector-quantization memory-optimization kv-cache google-research llm vllm llm-inference kv-cache-compression

Updated Apr 17, 2026
Python

JIA-Lab-research / Q-LLM

Star

This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"

fast-inference inference-acceleration large-language-models long-context kv-cache-compression

Updated Jul 16, 2024
Python

abdelfattah-lab / xKV

Star

xKV: Cross-Layer SVD for KV-Cache Compression

mla low-rank long-context llm-inference deepseek kv-cache-compression inter-layer

Updated Nov 30, 2025
Python

Linking-ai / SCOPE

Star

(ACL2025 oral) SCOPE: Optimizing KV Cache Compression in Long-context Generation

long-context kv-cache-compression kvcache

Updated May 28, 2025
Jupyter Notebook

Janghyun1230 / FastKVzip

Star

Accurate and fast KV cache compression with a gating mechanism

large-language-models kv-cache-compression

Updated Apr 5, 2026
Python

OnlyTerp / kvtc

Star

First open-source KVTC implementation (NVIDIA, ICLR 2026) -- 8-32x KV cache compression via PCA + adaptive quantization + entropy coding

compression pytorch nvidia transformer pca attention dynamic-programming quantization deflate entropy-coding memory-optimization kv-cache llm llm-inference kv-cache-compression iclr-2026

Updated Apr 17, 2026
Python

Native Windows build of vLLM 0.19.1 — no WSL, no Docker. Pre-built wheels + 34-file Windows patch + Multi-TurboQuant KV cache compression (6 methods, 2x cache capacity). PyTorch 2.10 + CUDA 12.6 + Triton + Flash-Attention 2.

Updated Apr 26, 2026
Python

FluffyAIcode / LLM-KV--Cache-compress

Star

Discrete Kakeya cover for LLM KV cache: D4/E8 nested-lattice quantisation realising a Kakeya-style tube-cover over the direction sphere. 2.4x-2.8x compression at <1% perplexity loss on Qwen3, Llama-3, DeepSeek, GLM-4, Gemma. Drop-in transformers.DynamicCache. pip install kakeyalattice.

transformers quantization discrete-geometry kv-cache long-context vllm llm-inference kv-cache-compression qwen3 lattice-quantization e8-lattice d4-lattice kakeya kakeya-set

Updated Apr 30, 2026
Python

MGDDestiny / Lava

Star

LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation

llm kv-cache-compression

Updated Sep 17, 2025
Python

Ryuketsukami / turboquant-skill

Star

AI agent skill implementing Google's TurboQuant compression algorithm (ICLR 2026) — 6x KV cache memory reduction, 8x speedup, zero accuracy loss. Compatible with Claude Code, Codex CLI, and all Agent Skills-compatible tools.

Updated Mar 28, 2026
Python

NAME0x0 / AVA

Star

Research and training stack for AVA — a tool-using, memory-aware virtual assistant targeting 4 GB VRAM. Spans custom transformers, verifier-RL, external memory, multi-domain benchmarks, and Gemma 4 inference optimization.

Updated Apr 7, 2026
Python

Ryuketsukami / turboquant-compression

Star

Near-optimal vector quantization for LLM KV cache compression. Python implementation of TurboQuant (ICLR 2026) — PolarQuant + QJL for 3-bit quantization with minimal accuracy loss and up to 8x memory reduction.

Updated Mar 28, 2026
Python

alexluchen / pitfalls-of-kv-cache-compression

Star

Repository for the paper: https://arxiv.org/abs/2510.00231

machine-learning artificial-intelligence kv-cache-compression

Updated Oct 6, 2025
Python

Improve this page

Add a description, image, and links to the kv-cache-compression topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the kv-cache-compression topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv-cache-compression

Here are 31 public repositories matching this topic...

Zefan-Cai / KVCache-Factory

NVIDIA / kvpress

Zefan-Cai / Awesome-LLM-KV-Cache

snu-mllab / KVzip

itsnamgyu / block-transformer

shadowpa0327 / Palu

snu-mllab / Context-Memory

OnlyTerp / turboquant

JIA-Lab-research / Q-LLM

abdelfattah-lab / xKV

Linking-ai / SCOPE

Janghyun1230 / FastKVzip

OnlyTerp / kvtc

aivrar / vllm-windows-build

FluffyAIcode / LLM-KV--Cache-compress

MGDDestiny / Lava

Ryuketsukami / turboquant-skill

NAME0x0 / AVA

Ryuketsukami / turboquant-compression

alexluchen / pitfalls-of-kv-cache-compression

Improve this page

Add this topic to your repo

Search code, repositories, users, issues, pull requests...

kv-cache-compression

Here are 31 public repositories matching this topic...

Zefan-Cai / KVCache-Factory

NVIDIA / kvpress

Zefan-Cai / Awesome-LLM-KV-Cache

snu-mllab / KVzip

itsnamgyu / block-transformer

shadowpa0327 / Palu

snu-mllab / Context-Memory

OnlyTerp / turboquant

JIA-Lab-research / Q-LLM

abdelfattah-lab / xKV

Linking-ai / SCOPE

Janghyun1230 / FastKVzip

OnlyTerp / kvtc

aivrar / vllm-windows-build

FluffyAIcode / LLM-KV--Cache-compress

MGDDestiny / Lava

Ryuketsukami / turboquant-skill

NAME0x0 / AVA

Ryuketsukami / turboquant-compression

alexluchen / pitfalls-of-kv-cache-compression

Improve this page

Add this topic to your repo