-
Notifications
You must be signed in to change notification settings - Fork 12.2k
gguf-py : simplify support for quant types #8838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
gguf-py/gguf/constants.py
Outdated
# Default quantization type for each file type | ||
# Keep this the same as in llama_model_quantize_internal from llama.cpp | ||
LlamaFileTypeMap: dict[LlamaFileType, GGMLQuantizationType] = { | ||
LlamaFileType.MOSTLY_Q4_0: GGMLQuantizationType.Q4_0, | ||
LlamaFileType.MOSTLY_Q4_1: GGMLQuantizationType.Q4_1, | ||
LlamaFileType.MOSTLY_Q5_0: GGMLQuantizationType.Q5_0, | ||
LlamaFileType.MOSTLY_Q5_1: GGMLQuantizationType.Q5_1, | ||
LlamaFileType.MOSTLY_Q8_0: GGMLQuantizationType.Q8_0, | ||
LlamaFileType.MOSTLY_F16: GGMLQuantizationType.F16, | ||
LlamaFileType.MOSTLY_BF16: GGMLQuantizationType.BF16, | ||
LlamaFileType.ALL_F32: GGMLQuantizationType.F32, | ||
|
||
# K-quants | ||
LlamaFileType.MOSTLY_Q2_K_S: GGMLQuantizationType.Q2_K, | ||
LlamaFileType.MOSTLY_Q2_K: GGMLQuantizationType.Q2_K, | ||
LlamaFileType.MOSTLY_IQ3_XS: GGMLQuantizationType.IQ3_S, | ||
LlamaFileType.MOSTLY_Q3_K_S: GGMLQuantizationType.Q3_K, | ||
LlamaFileType.MOSTLY_Q3_K_M: GGMLQuantizationType.Q3_K, | ||
LlamaFileType.MOSTLY_Q3_K_L: GGMLQuantizationType.Q3_K, | ||
LlamaFileType.MOSTLY_Q4_K_S: GGMLQuantizationType.Q4_K, | ||
LlamaFileType.MOSTLY_Q4_K_M: GGMLQuantizationType.Q4_K, | ||
LlamaFileType.MOSTLY_Q5_K_S: GGMLQuantizationType.Q5_K, | ||
LlamaFileType.MOSTLY_Q5_K_M: GGMLQuantizationType.Q5_K, | ||
LlamaFileType.MOSTLY_Q6_K: GGMLQuantizationType.Q6_K, | ||
LlamaFileType.MOSTLY_IQ2_XXS: GGMLQuantizationType.IQ2_XXS, | ||
LlamaFileType.MOSTLY_IQ2_XS: GGMLQuantizationType.IQ2_XS, | ||
LlamaFileType.MOSTLY_IQ2_S: GGMLQuantizationType.IQ2_XS, | ||
LlamaFileType.MOSTLY_IQ2_M: GGMLQuantizationType.IQ2_S, | ||
LlamaFileType.MOSTLY_IQ3_XXS: GGMLQuantizationType.IQ3_XXS, | ||
LlamaFileType.MOSTLY_IQ1_S: GGMLQuantizationType.IQ1_S, | ||
LlamaFileType.MOSTLY_IQ1_M: GGMLQuantizationType.IQ1_M, | ||
LlamaFileType.MOSTLY_IQ4_NL: GGMLQuantizationType.IQ4_NL, | ||
LlamaFileType.MOSTLY_IQ4_XS: GGMLQuantizationType.IQ4_XS, | ||
LlamaFileType.MOSTLY_IQ3_S: GGMLQuantizationType.IQ3_S, | ||
LlamaFileType.MOSTLY_IQ3_M: GGMLQuantizationType.IQ3_S, | ||
LlamaFileType.MOSTLY_Q4_0_4_4: GGMLQuantizationType.Q4_0_4_4, | ||
LlamaFileType.MOSTLY_Q4_0_4_8: GGMLQuantizationType.Q4_0_4_8, | ||
LlamaFileType.MOSTLY_Q4_0_8_8: GGMLQuantizationType.Q4_0_8_8, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding this because it's now used in convert_hf_to_gguf.py
to get the default quantization type from a file type, but I'm not sure if the file types which are not used in the convert script should still be mapped.
Anyone has an opinion on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we should avoid coupling gguf
with llama.cpp
specifically. The llama_ftype
enum is specific to llama.cpp
so maybe it would be better to avoid it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe at some point we should move the LlamaFileType
enum from gguf-py/gguf/constants.py
to a new llama.cpp/constants.py
and this file can hold the llama.cpp
-specific file type logic and potentially other stuff
Too specific to 'llama.cpp', and would be a maintenance burden to keep up to date. * gguf-py : add generic quantize and dequantize functions The quant classes no longer need to be known, only the target or the source type, for 'quantize' and 'dequantize', respectively.
* gguf-py : use classes for quants * convert_hf : simplify internal quantization type selection * gguf-py : fix flake8 lint * gguf-py : fix BF16 numpy view type * gguf-py : remove LlamaFileTypeMap Too specific to 'llama.cpp', and would be a maintenance burden to keep up to date. * gguf-py : add generic quantize and dequantize functions The quant classes no longer need to be known, only the target or the source type, for 'quantize' and 'dequantize', respectively.
* gguf-py : use classes for quants * convert_hf : simplify internal quantization type selection * gguf-py : fix flake8 lint * gguf-py : fix BF16 numpy view type * gguf-py : remove LlamaFileTypeMap Too specific to 'llama.cpp', and would be a maintenance burden to keep up to date. * gguf-py : add generic quantize and dequantize functions The quant classes no longer need to be known, only the target or the source type, for 'quantize' and 'dequantize', respectively.
There are only 2 types right now supported by
gguf-py/gguf/quants.py
(BF16
andQ8_0
), but there will be more over time, especially for dequantization, because this could enable interesting things as in #8831Here, I'm reducing the quant-type-specific code needed to be written by using an abstract base class.
I've also simplified the type selection logic in
convert_hf_to_gguf.py
, which should allow making overrides like in #8715 simpler to implement in a more maintainable way.I've tested with https://huggingface.co/Qwen/Qwen2-0.5B-Instruct that conversion with
convert_hf_to_gguf.py
is still the same as when usingllama-quantize
from a F32 conversion.The files ending with
-q
were made withllama-quantize
.I've also tested https://huggingface.co/state-spaces/mamba-130m-hf: