Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

SamsungLabs/BOA

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BoA

This repository contains the code for the ICML 2025 paper BoA: Attention-Aware Post-Training Quantization without Backpropagation.

The current release includes the following features:

  • Implementation of the proposed BoA: boa.py
  • Quantization of OPT, Llama, Llama2, Llama3, Qwen2.5, Qwen3 models: main.py
  • Evaluating the perplexity and 0-shot accuracy (8 tasks) of quantized models

Dependencies

  • see requirements.txt

BoA options

  • block_v: whether to apply block-wise objective for the value projection layer. In memory-limited cases, we can significantly reduce memory by de-activating this option, but at the expense of a slight performance degradation.
  • act_order_col: whether to re-order columns before the quantization based on the column-wise Hessian $\mathbf{H}_{col}$ (GPTQ heuristic)
  • act_order_row: whether to re-order rows before the quantization based on the row-wise Hessian $\mathbf{H}_{row}$
  • qparam_comput: how to select quantization grids. Grids can be determined with a naive MinMax or to minimize the weight perturbation (MMSE) or the layer-wise reconstruction error (Hessian)

For more details on other arguments, please refer to process_args.py.

Experimental Results

  • Setup
    • NVIDIA H100 GPU has been used.
    • block_v option has been activated.
    • qparam_comput option has been set to Hessian.
    • Test all cases for act_order_row and act_order_col and report the best results with respect to Wiki2 PPL.

Results on Qwen2.5 Models

  • INT2 weight-only quantization

    Size act_order_row act_order_col Wiki2 ($\downarrow$) C4-new ($\downarrow$) 0-shot ($\uparrow$)
    0.5B O O 144.7 455.8 32.56
    1.5B O O 58.09 235.7 36.93
    3B X O 26.55 90.77 43.28
    7B X O 23.14 103.4 43.79
    14B O O 12.05 37.64 57.4
  • INT3 weight-only quantization

    Size act_order_row act_order_col Wiki2 ($\downarrow$) C4-new ($\downarrow$) 0-shot ($\uparrow$)
    0.5B O O 20.12 40.14 45.68
    1.5B X O 12.03 23.52 56.90
    3B O O 9.541 17.32 60.29
    7B O O 9.054 18.19 66.12
    14B O O 6.492 12.20 71.80
  • Quantization processing time

    Size Time (min)
    0.5B 6.145
    1.5B 22.81
    3B 39.59
    7B 63.14
    14B 150.9

Results on Qwen3 Models

  • INT2 weight-only quantization

    Size act_order_row act_order_col Wiki2 ($\downarrow$) C4-new ($\downarrow$) 0-shot ($\uparrow$)
    4B X O 78.57 302.8 35.13
    8B X O 32.53 96.43 38.47
    14B X O 24.29 75.76 42.41
  • INT3 weight-only quantization

    Size act_order_row act_order_col Wiki2 ($\downarrow$) C4-new ($\downarrow$) 0-shot ($\uparrow$)
    4B X O 28.19 48.24 45.25
    8B X O 15.62 29.60 53.69
    14B X O 10.62 18.61 66.69
  • Quantization processing time

    Size Time (min)
    4B 43.10
    8B 83.77
    14B 132.6

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC).

Citation

If you find this work is useful for your research, please cite our paper:

@inproceedings{kimboa,
  title={BoA: Attention-aware Post-training Quantization without Backpropagation},
  author={Kim, Junhan and Kim, Ho-young and Cho, Eulrang and Lee, Chungman and Kim, Joonyoung and Jeon, Yongkweon},
  booktitle={Forty-second International Conference on Machine Learning}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

Morty Proxy This is a proxified and sanitized view of the page, visit original site.