Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

actypedef/ARCQuant

Open more actions menu

Repository files navigation

ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs

arXiv

arcquant

ARCQuant is a high-performance quantization framework for low-bit LLMs that improves accuracy under fine-grained formats such as NVFP4, while preserving a unified and efficient inference pipeline.

While fine-grained quantization formats such as NVFP4 effectively isolate quantization noise, activation outliers can still cause severe accuracy degradation in critical channels. Traditional mixed-precision methods address this by splitting computations into separate branches, which introduces additional kernel launch overhead and memory fragmentation.

ARCQuant takes a different approach. Instead of treating outliers as a separate computation path, we leverage the structural sparsity of quantization errors in fine-grained settings. We capture the quantization residuals of critical channels and fuse them back into the computation as Augmented Residual Channels (ARC).

News

  • [2026/04] 🏆 ARCQuant has been accepted to ACL 2026 Main Conference!
  • [2026/03] 🔥 ARCQuant has been integrated into NVIDIA TensorRT-LLM, with contributions from Tracin!
  • [2026/01] 🔥 ARCQuant is publicly available on arXiv! Check our paper here.

To Do

  • Release code for reproducing results.
  • Release CUDA kernels for NVFP4.
  • Support vLLM integration.
  • Model Support: Add support for more model families and architectures:
    • Qwen3
    • Mixtral
    • Wan2.2

Installation

conda create -n arcquant python=3.10 -y
conda activate arcquant

Please make sure that CUDA 12.8 is available in your environment.

git clone --recurse-submodules https://github.com/actypedef/ARCQuant.git
cd ARCQuant
pip install -r requirements.txt

Usage

Building Kernels

sudo apt-get update
sudo apt-get install python3-dev
conda install pybind11
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
cd kernels/
bash remake.sh

This process may take a few minutes.

Preprocessing

Precomputed reorder_indices and select_num are required for quantization:

python reorder_indices.py --model /PATH/TO/YOUR/MODEL/ --samples 128 --seqlen 2048 --act_sort_metric max

The generated files will be saved to ./saved/

Accuracy Evaluation

ARCQuant supports multiple formats, including NVFP4, MXFP4, HiF4, and INT4. You can modify the quant_type parameter as needed.

bash evaluate.sh /PATH/TO/YOUR/MODEL/

Efficiency Evaluation

FlashInfer:

cd third-party/flashinfer
python -m pip install -v .

vLLM-based efficiency evaluation scripts will be released in a future update.

Citation

@article{meng2026arcquant,
  title={ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs},
  author={Meng, Haoqian and Luo, Yilun and Zhao, Yafei and Liu, Wenyuan and Zhang, Peng and Ma, Xindian},
  journal={arXiv preprint arXiv:2601.07475},
  year={2026}
}

Acknowledgements

This project builds on several excellent open-source efforts. We sincerely thank the community for their contributions:

About

[ACL 2026 Main] Code for the paper "ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Morty Proxy This is a proxified and sanitized view of the page, visit original site.