This repository contains the code for the ICML 2025 paper BoA: Attention-Aware Post-Training Quantization without Backpropagation.
The current release includes the following features:
- Implementation of the proposed BoA:
boa.py - Quantization of OPT, Llama, Llama2, Llama3, Qwen2.5, Qwen3 models:
main.py - Evaluating the perplexity and 0-shot accuracy (8 tasks) of quantized models
- see
requirements.txt
-
block_v: whether to apply block-wise objective for the value projection layer. In memory-limited cases, we can significantly reduce memory by de-activating this option, but at the expense of a slight performance degradation. -
act_order_col: whether to re-order columns before the quantization based on the column-wise Hessian$\mathbf{H}_{col}$ (GPTQ heuristic) -
act_order_row: whether to re-order rows before the quantization based on the row-wise Hessian$\mathbf{H}_{row}$ -
qparam_comput: how to select quantization grids. Grids can be determined with a naive MinMax or to minimize the weight perturbation (MMSE) or the layer-wise reconstruction error (Hessian)
For more details on other arguments, please refer to process_args.py.
- Setup
- NVIDIA H100 GPU has been used.
block_voption has been activated.qparam_computoption has been set toHessian.- Test all cases for
act_order_rowandact_order_coland report the best results with respect to Wiki2 PPL.
-
INT2 weight-only quantization
Size act_order_rowact_order_colWiki2 ( $\downarrow$ )C4-new ( $\downarrow$ )0-shot ( $\uparrow$ )0.5B O O 144.7 455.8 32.56 1.5B O O 58.09 235.7 36.93 3B X O 26.55 90.77 43.28 7B X O 23.14 103.4 43.79 14B O O 12.05 37.64 57.4 -
INT3 weight-only quantization
Size act_order_rowact_order_colWiki2 ( $\downarrow$ )C4-new ( $\downarrow$ )0-shot ( $\uparrow$ )0.5B O O 20.12 40.14 45.68 1.5B X O 12.03 23.52 56.90 3B O O 9.541 17.32 60.29 7B O O 9.054 18.19 66.12 14B O O 6.492 12.20 71.80 -
Quantization processing time
Size Time (min) 0.5B 6.145 1.5B 22.81 3B 39.59 7B 63.14 14B 150.9
-
INT2 weight-only quantization
Size act_order_rowact_order_colWiki2 ( $\downarrow$ )C4-new ( $\downarrow$ )0-shot ( $\uparrow$ )4B X O 78.57 302.8 35.13 8B X O 32.53 96.43 38.47 14B X O 24.29 75.76 42.41 -
INT3 weight-only quantization
Size act_order_rowact_order_colWiki2 ( $\downarrow$ )C4-new ( $\downarrow$ )0-shot ( $\uparrow$ )4B X O 28.19 48.24 45.25 8B X O 15.62 29.60 53.69 14B X O 10.62 18.61 66.69 -
Quantization processing time
Size Time (min) 4B 43.10 8B 83.77 14B 132.6
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC).
If you find this work is useful for your research, please cite our paper:
@inproceedings{kimboa,
title={BoA: Attention-aware Post-training Quantization without Backpropagation},
author={Kim, Junhan and Kim, Ho-young and Cho, Eulrang and Lee, Chungman and Kim, Joonyoung and Jeon, Yongkweon},
booktitle={Forty-second International Conference on Machine Learning}
}