Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Even more quantization types? #5063

ikawrakow started this conversation in General
Jan 21, 2024 · 12 comments · 28 replies
Discussion options

In addition to the IQ2_XXS, IQ2_XS, Q2_K_S (and now Q3_K_S via PR #5060) that were recently added to llama.cpp, I have experimented with a number of other quantization types in a private development repository. Before embarking on a journey to add some of those to llama.cpp, I think it is useful to discuss if this will be considered a welcome addition:

  • PRO: more quants allows for a more fine-grained control over the model size vs generation quality tradeoff, which can be very useful for "Inference at the edge", the main focus of this project
  • CON: more quants means more code and the associated maintenance burden, along with even more stuff for users to remember/understand

To get the discussion going, in what follows I give a brief summary of what these additional quants bring to the table:

  1. Row-wise quantization
  2. Non-linear quantization
  3. k-means clustering quantization

1. Row-wise quantization

All existing llama.cpp quantization types utilize a block-wise structure - either blocks of 32 quants (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0), or blocks of 16 or 32 quants in super-blocks of 256 for the k-quants. Each super-block of 256 quants has 1 or 2 floating points scales that convert the quants to actual model weights. My experiments show that the increase in quantization error by going from super-block scales to row-wise scales is very minor. Hence, one can go to scales per tensor row. There are two main benefits:

  • One can quantize models where the number of columns in some tensors is not a multiple of 256. Early examples of such model were Falcon-7B and OpenLLaMA-3B. A more recent example that seems quite popular is Qwen-14B (and derivatives). The current llama.cpp solution to a situation where k-quants cannot be used is to replace the quant type with one of Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, resulting in a larger model or a lower quality quantization.
  • There is a small saving in model size (e.g., 0.0625 bits-per-weight (bpw) for a quantization similar to Q4_K, so abut 1.5%)

2. Non-linear quantization

All existing llama.cpp quantization types use a linear mapping between quants and de-quantized weights (i.e., x = a * q or x = a * q + b, where x are the de-quantized model weights, q are the quants, and a, b are block-wise quantization constants. The non-linear quants that I have experimented with use a 3'rd order relation, i.e., x = a q^3 + b q^2 + c q + d. The key benefit from doing so is that one can achieve a very similar quantization quality to the k-quants with larger blocks, thus saving precious bits and reducing quantized model size. As an example, a 3rd order non-linear quantization with 4.125 bpw is comparable to Q4_K, which uses 4.5 bpw, for almost a 10% reduction in quantized model size. This comes at the expense of slightly lower performance (typically a few percent). But when the model does not fit into the available GPU, one can squeeze a few more layers onto the GPU that way, which can more than offset the slightly lower kernel performance. Oh, why 3rd order polynomial? I can give a more detailed explanation in the PR if it comes to that.

3. k-means clustering quantization

k-means clustering is what is used in, e.g., SqueezeLLM. Basically, instead of scaling quants block-wise into the the range provided by a given number of bits, one performs (weighted) k-means clustering on all weights in a tensor row, thus mapping weights to clusters (with number of clusters defined by the bpw one wants to spend). In this way one can have a "true" N-bit quantization (only using 2^N bytes per tensor row for the cluster means in addition to the N bpw for the quants). k-means clustering is a tricky business and the final outcome strongly depends on the details of the clustering algorithm and model weights used. My implementation is different from SqueezeLLM and does slightly worse on LLaMA-v1-7B (PPL = 6.04 vs theirs 6.03) but much better for LLaMA-v2-7B with PPL = 5.91 vs theirs 5.96 (and I don't have their PPL values for other models. PR #3093, which would add SqueezeLLM support to llama.cpp if accepted, is ARM_NEON only, so it takes a long time to run perplexities, so I only did it for 7B LLaMA's.) This type of quantization is never as good as k-quants or non-linear quants, but it does squeeze out a few more bits from a quantized model.

Conclusion

Just for fun, below is a copy-paste of the ggml_type enum from my development repo. Obviously I would never add all of these to llama.cpp/ggml but only pick a few select types that offer the best model-size-vs-quality tradeoff, if the consensus is that this would be valuable.

    enum ggml_type {
        GGML_TYPE_F32  = 0, 
        GGML_TYPE_F16  = 1, 
        GGML_TYPE_Q4_0 = 2, 
        GGML_TYPE_Q4_1 = 3, 
        // GGML_TYPE_Q4_2 = 4, support has been removed
        // GGML_TYPE_Q4_3 (5) support has been removed
        GGML_TYPE_Q5_0 = 6, 
        GGML_TYPE_Q5_1 = 7, 
        GGML_TYPE_Q8_0 = 8, 
        GGML_TYPE_Q8_1 = 9, 
        // k-quantizations
        GGML_TYPE_Q2_K = 10,
        GGML_TYPE_Q3_K = 11,
        GGML_TYPE_Q4_K = 12,
        GGML_TYPE_Q5_K = 13,
        GGML_TYPE_Q6_K = 14,
        GGML_TYPE_Q8_K = 15,
        // i-quantizations
        GGML_TYPE_IQ2_XXS      = 16,
        GGML_TYPE_IQ2_XS       = 17,
        GGML_TYPE_IQ3_0        = 18,
        GGML_TYPE_IQ3_NL_B16   = 19,
        GGML_TYPE_IQ3_NL_B32   = 20,
        GGML_TYPE_IQ3_SQ       = 21,
        GGML_TYPE_IQ4_0        = 22,
        GGML_TYPE_IQ4_NL_B16   = 23,
        GGML_TYPE_IQ4_NL_B32   = 24,
        GGML_TYPE_IQ4_NL_B64   = 25,
        GGML_TYPE_IQ4_SQ       = 26,
        GGML_TYPE_IQ4_K        = 27,
        GGML_TYPE_IQ5_0        = 28,
        GGML_TYPE_IQ5_NL       = 29,
        GGML_TYPE_IQ5_SQ       = 30,
        GGML_TYPE_IQ5_K        = 31,
        GGML_TYPE_IQ6_0        = 32,
        GGML_TYPE_IQ6_NL       = 33,
        GGML_TYPE_IQ6_K        = 34,
        GGML_TYPE_IQ2_K        = 35,
        GGML_TYPE_IQ2_S        = 36,
        GGML_TYPE_IQ2_XS_RW    = 37,
        GGML_TYPE_IQ2_0        = 38,
        GGML_TYPE_IQ3_T        = 39,
        GGML_TYPE_IQ8_0        = 40,
        GGML_TYPE_Q3_I         = 41,
        GGML_TYPE_I8,
        GGML_TYPE_I16,
        GGML_TYPE_I32,
        GGML_TYPE_COUNT,
    };   
You must be logged in to vote

Replies: 12 comments · 28 replies

Comment options

The Non-linear quantization one is kind of interesting.
I have a question.
Row-wise quantization don't need block of 32 or 256,
but Does Non-linear quantization need that like K-quants.?

You must be logged in to vote
2 replies
@ikawrakow
Comment options

All quantization types after k-quants in my repository are implemented row-wise, not block-wise. The row- vs block-wise approach is independent of non-linear quantization. Yes, when you go to row-wise quantization, there is no need the number of columns to be divisible by 32 or 256. But in practice I assume that there is at least divisibility by 32, else the implementation becomes too cumbersome. But to my knowledge, all LLM's currently out there are divisible by at least 32 (if not even 64).

@sorasoras
Comment options

32 is pretty much sufficient in most case.

Comment options

It would be great to have more choice, of course.
Like John Snow, I know nothing about development beyond kidding around to satisfy my exact needs (compiling, fixing an easy compile mistake, measure perfs to pick the best quant and rope for my needs, etc).

But basically, I'd go for the obvious here : the best compromises for each range of situation.
Let's look at this example :

Screenshot 2024-01-21 at 23-08-59 Add Q3_K_XS by ikawrakow · Pull Request #5060 · ggerganov_llama cpp

In this case, the improved Q2_K (pre-SOTA) and the Q3_K_S are competing with each other.
In Ppl/s terms (and Hellaswag terms), the best bang for our buck between both is this Q2_K, because the gain in size clearly goes way beyond the bump of perplexity in percentage, and 1k more context at equal VRAM usage is quite a boner. Yet the Q3_K_S was retained in the recent version of LlamaCPP. In my opinion, the opposite choice was to be made, if choice between both was to be made (both could be kept as well for granularity, as Q3_K_S and Q3_K_XS, the new quant offered by Ikawrakow could be named Q3_K_XXS because this naming scheme already exists).

About the code, you guys know best, but I guess you apply a principle of "lowest maintenance" factorizing what can be factorized, which means that any code which can be common between the quants should be part of a common code leaving the smallest amount of separated code for each quant specifics. (don't flame me, what's obvious for you guys is sophisticated for me!)

About the IQ Quants, I'll pass on commenting because it's beyond my paygrade lol, but the IQ2_XS is great, and same general principles ofc apply!

You must be logged in to vote
0 replies
Comment options

I think the 2) is what I was thinking about for some time but never got into trying it in practice:

  • We know that in-context learning is related to outliers, and we know that transformers are trained with layernorm which keeps numbers close to each other (and that's why these models generally quantize so well).

  • Outliers are also what is getting shaved/destroyed first by the (linear) quantization, and they are very important for coding tasks (and for MMLU). Usually models below 4bpw get stupid really quickly. It seems reasonable to imply that outliers are responsible for this.

So what if we had entirely different datatype, something exponential for example, which still allows to encode outliers somehow, but it's only one bit of the precision. in 3bit, that could be -127, -2, -1, -0.001, 0.001, 1, 2, 127 or something like that. I believe this is close to your proposal, you just came with formula and I'd go with handpicked/observed scale.

Another idea would be to use some kind of entropy coding instead (or even on top of that), because we know that not all weights are likely to be outliers, so if we've already seen N_MAX outliers, we can use those bits to encode "ordinary" weights (-127 being -3 for example). But I'm not sure how practical it would be to do this in shader, so this is just very abstract sketch. Maybe we could also save offset of the first outlier, and use that together with the counter?

Also, I'm not sure what is the actual scale of different outliers, but from my intuitive understanding, it should be more or less the same "power" when compared to ordinary weights.

Forgive me, if I'm mumbling nonsense, I am not an expert.

You must be logged in to vote
0 replies
Comment options

Outliers are also what is getting shaved/destroyed first by the (linear) quantization, ...

This is not quite true for the quants in llama.cpp. It is actually exactly the opposite: "outliers" are kept while normal weights are "destroyed". To understand this, let's consider a block of 32 weights and Q4_0 quantization (4 bits, quantized values are in -8...7). Let's assume that we have an "outlier" in the block with a value of 1.0, and all other weights in the block are in the normal range of ±0.06 (typical values for 7B LLaMA models). The block scale becomes -1.0/8, and quants are obtained by RTN after multiplication with the inverse scale = -8/1.0. With this, we end up with a quantization for this block where the "outlier" has a value of -8, while all other quants have a value of zero. I..e, we kept the "outlier" but lost the "normal" model weights.

Outliers are lost in, e.g., quantization schemes based on clustering.

You must be logged in to vote
6 replies
@ikawrakow
Comment options

Yes, this is the idea behind the non-linear quantization. A typical set of 16 non-linear quant values in a 4-bit quantization mapped to the uint8_t range may look like this:

quant_values[16] = {-128, -104, -83, -65, -49, -35, -22, -10, 1, 13, 25, 38, 53, 69, 88, 112};

You see how the difference between two consecutive quants at the beginning/end is 24 units, but it is only 11-12 units in the middle. This widens the dynamic range by a factor of two compared to a linear scale.

@ggerganov
Comment options

Do the non-linear quantizations require row-wise scaling, or could they be implemented in the framework of 256 superblocks?

@ikawrakow
Comment options

The non-linear quantization types not based on k-means clustering could be implemented block-wise with some size penalty. But for me one of the main motivations for adding more quants is to not be bound to 256 divisibility.

@ggerganov
Comment options

I'm worried that supporting row-size blocks would require a lot of changes to the code base. There are assumptions such as nb[0] == ggml_type_size(type);. I guess it would be a relatively big change

@ikawrakow
Comment options

I hope to have found and fixed all such assumptions in my repository. It runs there without issues (except, perhaps, in the multi-GPU portions as I don't have the ability to test that). But I agree, it will be tedious as mainline has changed quite a bit since I last synced my repo about 3 weeks ago.

Comment options

@TheBloke Any reason why there's no imatrix and the new 2 bit quants yet? It seems you're still uploading the old ones.

You must be logged in to vote
0 replies
Comment options

In another discussion around quantization somebody brought up the recent AQLM Paper. I spent some time making a more detailed comparison between the results of this paper and the quantization types available in llama.cpp. Unlike all previous claims of being SOTA, this time around the AQLM authors appear to have achieved truly SOTA results for 2- and 3-bit quantization. More on this below. The reason I'm bringing it up here is that the graph below nicely illustrates the quality-vs-size-tradeoff difference difference between k- and i-quants available in llama.cpp and those I was proposing to add, so I thought it could be an useful input for this discussion.

Now back to the actual comparison.

  • I'll focus exclusively on PPL as I honestly don't understand the HellaSwag/Winogrande/ARC results in the AQLM paper. For instance, according to Table 1 of the AQLM paper, LLaMA-v2-7B has a 0-shot HellaSwag score of 56.69 for the fp16 model. In contrast, using the full HellaSwag validation dataset, I compute a score of 75.7, which is more in line with the few-shot scores one finds on the HF Leader Board.
  • Apparently some researchers haven't heard of Mistral or Mixtral yet, and keep publishing LLaMA-v2 results even in January of 2024, so LLaMA-v2 it is for the sake of the comparison.
  • In all quantization papers I have seen, they keep the token embedding and the output tensor as fp16, but do not count the entirely non-negligible amount of extra bits in the bit balance. Hence, I have adjusted the bpw as output by llama.cpp, which is total number of bits (including token embedding and output tensor) divided by total number of parameters, by removing the bits contributed by token embedding and output weight tensor from the bit balance.
  • As the PPL computed by llama.cpp differs from the PPL computed by Python tools, I'm showing the quantization error defined as PPL(Q)/PPL(fp16)-1, which should account for any differences in the way perplexity is computed.
  • It seems researchers publishing papers on LLM quantization always use the full context of the model when they publish PPL results, so I'm using a context of 4096 in the llama.cpp calculations.

The graph shows the quantization error as a function of bits-per-weight (bpw) used. Please note the logarithmic scale of the y-axis chosen to be able to represent the large variation in quantization error as one goes from 2- to 4-bit quants. I have taken the GPTQ, QuIP#, and SpQR results from the AQLM Paper (so don't blame me if they are wrong). GPTQ is the orange circles. It is clearly far behind the competition, so I'll not discuss it further (although, even today, one comes across claims around the Internet of GPTQ being the SOTA when it comes to quantization). SpQR (blue triangle down, available at 3 and 4 bpw) and QuIP# (magenta triangles up, available at 2 and 4 bpw) results are respectable, quite similar to each other, but clearly not SOTA by being above the black line with squares representing published k- and i-quants. The AQLM results shown with red circles/dashed line are clearly the new SOTA at 2 and 3 bpw, so congratulations to the AQLM authors! At 4 bpw it is not quite as clear. There is no 4.0-bit k- or i-quantization, and Q4_K_S at 4.5 bpw beats the AQLM 4-bit quantization by a large margin (0.9% vs 1.8% quantization error). And this brings us to the cyan squares. They represent unpublished i-quants that I was referring to in my post above. The two that are at around 4.5 and 4.25 bpw are 4-bit non-linear quants. The one that is very cloase to the AQLM 4-bit point is simply a mix between a 4.25-bpw and a 3.25 bpw quantization to bring the bpw as close as possible to the AQLM 4-bit point (which is actually at 4.044 bpw), thus showing that AQLM is not really SOTA at 4-bit. Two of the cyan points between 3 and 3.5 bpw are non-linear 3-bit quants, the 3rd is similar to IQ3_XXS, but uses 512 instead of 256 D4 grid points. The cyan point at around 2.5 bpw is similar to IQ2_XS but uses 1024 E8-lattice points instead of 512. I have also added the dashed black line, which shows where current k- and i-quants would sit if they were implemented row-wise instead of block-wise (and seeing the effect of row-wise quantization on this plot has been quite useful for me to gain the fresh perspective that blocks are perfectly fine unless one is fighting for the 2-bit SOTA crown, where the 0.0625 bpw added by the blocks do make a difference). In any case, at 2 bit, IQ2_XS outperforms the AQLM result by a significant margin (28.4% vs 35.4% quantization error) at just one step past "true" 2-bit quants. But at 3 bit, the AQLM result is truly remarkable. Without the unpublished i-quants in cyan, one needs to go to Q3_K_M (3.8 bpw) to beat the AQLM 3-bit quantization (it does look better for the 13B and 70B LlaMA's, where Q3_K_S at 3.44 bpw matches AQLM). So, then, by adding the 3- and 4-bit non-linear quants llama.cpp will be SOTA again at 4-bits, and will have a better answer to the new 3-bit SOTA established by AQLM, with only a few percent larger model size required to achieve a lower PPL.

l2_qerr

You must be logged in to vote
4 replies
@ggerganov
Comment options

From these analysis, to me it seems the non-linear 4-bit quants are the most interesting from the unpublished i-quants (plus, it would be nice to reclaim the SOTA crown 😄 )

If you think it's worth it to do it with the existing block-wise approach, we can proceed. For the row-wise scaling - I'm not sure if it would be worth it. Mainly because I still don't have a good idea about how much effort it would be to change ggml to support that, so I'm hesitating. And also I don't consider the 256 divisibility restriction to be a big limitation - with the fallback option to a higher quants, we lose a bit of bpw but we win some ppl. And I think it is typically the case that only a fraction of the tensors in the model have ne[0] % 256 != 0 - the majority are usually divisible.

I will take a deeper look at supporting row-wise scaling, but I want to do some GPU kernel development and llama.cpp refactoring in the short-term and after that I'll dedicate some time on this

@sorasoras
Comment options

I think even with existing block-wise approach, you can combine blocks of 256 and blocks of 64 to avoid to fallback to legacy quants.
qwen 13B has 40 tensors is not divisible by 256, and I can see significant output quality impact from 40 tensors that is not divisible by 256. When legacy quants get imatrix support, the overall quality of the model improve visibly.
row-wise scaling could be less of priority for now, but I do think it could improve performance in theory if there is less off quants mix in a model.

@turian
Comment options

@ikawrakow Do you mind discussing in more detail how you are computing perplexity?

"As the PPL computed by llama.cpp differs from the PPL computed by Python tools, I'm showing the quantization error defined as PPL(Q)/PPL(fp16)-1, which should account for any differences in the way perplexity is computed." For PPL by llama.cpp, do you mean perplexity.cpp in examples? And what Python tools are you using? When you do "PPL(Q)/PPL(fp16)-1", are you doing this with llama.cpp or Python PPL?

Like perplexity.cpp, do you use stride = n_ctx/2, and then for each context window compute the perplexity only over the second half of the tokens?

@MfAl2
Comment options

The AQLM Paper's QuIP# numbers are identical to those from ablation tests in the QuIP# paper.

They say they did their own "calibration" so there's some plausible deniability, but the fine tune for QuIP# used a tiny number of sequences. If AQLM really did their own and it performed that much worse, it required both explanation and justification to use the worse method.

Comment options

I am very confused by the recent changes in quants. There are not enough tests and no clear guidelines on how to use imatrixes.

Combining quants times imatrixes times tests leads to combinatorial explosion, I can't possibly test everything, but I've done what I can.
Here is the output of my stupidly complicated script for Nous Hermes 2 Solar 10.7B

It runs mmlu, winogrande, arc and perplexity against almost all quants, pretty prints the results and compares them against Q8_0 and TheBloke quants computed a ~month ago.
nous-hermes-2-solar.flow.results.txt

The 'X' indicates if the result is statistically significant, if I didn't screw up the math.

It seems that multiple choice tests are not very helpful because the results simply do not reach statistical significance. Perplexity is better, but it's unclear how that translates into real-world performance. I wonder what could be done.

If you are looking for something specific, I have all the output from llama.cpp for tests, quants, and imatrix preparations.

Hope this helps.

You must be logged in to vote
1 reply
@kinchahoy
Comment options

I wonder if it's worth recreating this with a smaller top model, so it's easier to iterate / test? I'm happy to dedicate some cycles to Llama-3 instruct if you share your script.

Comment options

All existing llama.cpp quantization types use a linear mapping between quants and de-quantized weights (i.e., x = a * q or x = a * q + b, where x are the de-quantized model weights, q are the quants, and a, b are block-wise quantization constants. The non-linear quants that I have experimented with use a 3'rd order relation, i.e., x = a q^3 + b q^2 + c q + d.

https://en.wikipedia.org/wiki/Chebyshev_polynomials
https://en.wikipedia.org/wiki/Legendre_polynomials

Might be of interest, especially if you ever need to optimize the parameters (ie: the standard polynomial basis coefficients aren't independent and altering say c in your equation will require a and b to move a bit to compensate). Chebyshev Polynomials also have an interesting property of spreading the error associated with zeroing a coefficient over the whole range (so in your case it might make the effects of increasing and decreasing the order and block sizes clearer). I can't remember where I read it but there is also some link to Information Theory (again something to do with the coefficients being independent).

Another interesting thing to look at are Padé approximants:

https://en.wikipedia.org/wiki/Pad%C3%A9_approximant

The wiki page doesn't really do a good job of explaining their purpose, but this does: https://www.youtube.com/watch?v=szMaPkJEMrw

Finally if you've never head of it, the book The End of Error: Unum Computing is a fascinating read. Likely not that helpful directly, but makes a good case for variable length floating point types, and might be food for thought! :)

You must be logged in to vote
0 replies
Comment options

Hi all,

Thank you for your previous discussions and insights on vector quantization and the support for VQ-based weights in llama.cpp. We've recently developed a method called VPTQ (Vector Post-Training Quantization), which you can explore here: VPTQ on GitHub https://github.com/microsoft/VPTQ . This method quantizes weights into index and vector components that form lookup tables.

Here’s a brief overview of VPTQ:

  1. Dequantization: VPTQ decompresses weights by reading vectors through an index, currently implemented with a simple CUDA dequantization kernel.

  2. Vector Quantization: It organizes continuous out feature data into vectors, which should facilitate integration with existing systems like llama.cpp/ggml, avoiding issues encountered in AQLM.

  3. Storage: The index is packed and stored in INT32, while the lookup table resides in the embedding operator, which should also ease integration with current systems.

Questions:

  1. I am planning to integrate VPTQ (vqlinear) into llama.cpp in a fork. What steps should I take to begin this process?
  2. I welcome any feedback or suggestions on this method. What are your thoughts?

Looking forward to your insights and suggestions.

Thanks!
Yang

You must be logged in to vote
7 replies
@YingkunZhou
Comment options

@YangWang92 I also check your models in huggingface -- Meta-Llama-3-8B-v6-k4096-4096-woft
If I didn't misunderstand, total Bitwidth is log2(4096) / 6 + log2(4096) / 6 = 4 bits, but I need to download 4.57+1.05 = 5.62GB
image
However in this model size, I can download Meta-Llama-3-8B-Instruct-Q5_K_S.gguf
image

Also let's checkout the metric which is mentioned by @ikawrakow -- PPL(Q)/PPL(fp16)-1

No offense, but are you serious your quantization algorithm is SOTA?

@matt-c1
Comment options

Hi @YingkunZhou! I'm an unaffiliated bystander but I can explain the GGUF discrepancy.

This is because you don't load the entire model file into VRAM, there are some layers like (in Llama 3) token_embd.weight and output.weight (see Edit 2) which stay in RAM. GGUF quantizes these anyway, but EXL2 or VPTQ don't and leave them in fp16. This causes higher file size, but not higher VRAM use.

Let's look at the exact numbers with that in mind, which you can do with HuggingFace's built-in explorers of safetensors and GGUF files:

Meta-Llama-3-8B v6-k4096-4096-woft Meta-Llama-3-8B-Instruct Q5_K_S.gguf
entire model file size 5.62 GB 5.60 GB
parameters 8.03 B 8.03 B
bpw 5.60 5.58
token_embd.weight + output.weight (in RAM) size 2.10 GB 0.79 GB
parameters 1.05 B 1.05 B
bpw 16.00 6.03
the rest (weights in VRAM) size 3.52 GB 4.81 GB
parameters 6.98 B 6.98 B
bpw 4.03 5.51

So you can see that their "4 bit model" is indeed 4 bpw (bits per weight) where it matters, but their valid choice to leave some layers in fp16 makes the model files on disk larger.

I don't think there is a GGUF type that would use 4 bpw weights - the Q4 types tend to use significantly above 4 but below 5, and Q3 above 3 but below 4. And comparing to Q5 is out of the question, as that's over 5 bpw - 5.5 bpw in case of Q5_K. So there's no 1:1 perfect comparison to be made here. But the closest I think are:

  • IQ4_XS - 2.34% PPL error (at bpw > 4.0)
  • IQ3_K_L - 5.37% PPL error (at bpw < 4.0)

Which makes the VQLM 4 bit score of 4.56% seem OK, I think, as both its bpw and this particular quality metric are between these two GGUF variants.

Edit: now that I think of it, I'm not sure if GGUF / llama.cpp still loads these layers into VRAM or not. But I still think that inference backends for quant types which don't quantize these layers don't load them into VRAM, like Exllamav2 (see Edit 2). This is brought up now and then when people bring up the stark file size difference between GGUF and EXL2, despite that extra size not affecting VRAM use.

Edit 2: Exllama v2 leaves embed_tokens.weight at fp16, which seemingly does not affect VRAM use or performance (significantly). It does quantize the "head" (output.weight or lm_head.*) to 8 or 6 bpw usually, which does affect VRAM use. Meanwhile, llama.cpp with GGUF quantizes both of these but I don't know how they affect performance or memory use. For llama 3 8B, both of these layers are around 525 million parameters.

@YangWang92
Comment options

Hi @matt-c1

Thank you for helping me clarify the issue regarding bpw. I apologize for my unfamiliarity with the GGUF format. I’ve started to gradually explore contributing VPTQ to llama.cpp this week (at least contributing it to our own fork). In our paper, the bitwidth calculation includes the overhead of both the index and lookup table, so I believe our bit per weight (bpw) is very close to the nominal bitwidth while taking other costs into account.

From our experiments with VPTQ, we observed that even an increase of 0.1 bits in bpw can significantly improve the accuracy of the quantized models. Therefore, we believe that such precise calculations and comparisons can help us better understand the performance of different methods. Additionally, I am preparing evaluations of various quantization methods to obtain more accurate results.

@YangWang92
Comment options

Hi @YingkunZhou,
Thank you very much for your question. I’ve been extremely busy these past couple of weeks, so I apologize for the delayed response. We appreciate any suggestions you may have for us.

VPTQ was originally submitted to EMNLP 2024 (deadline June 15, 2024). I carefully checked, and AQLM updated their paper on June 8, 2024, and again on September 11, 2024—right before and after the deadline, respectively. I consulted with our co-author and here is our detailed response:

Thank you for your interest in our work. Below are our responses to the PPL-related questions you raised:

  • LLaMA-2 2-bit quantization results (Table 2 of VPTQ): The AQLM PPL results we report are based on version 2 of their paper on arXiv. Our paper was submitted to EMNLP 2024 with a June 15 deadline, while their version 3 was published on June 8. The results you mentioned appear to come from version 4, which was released on September 11. We had not noticed the updated results at the time, so thank you for bringing this to our attention.

  • LLaMA-2 3-bit and 4-bit quantization results (Table 5 of VPTQ): The results in Table 5 for VPTQ’s 3-bit and 4-bit quantization do not include end-to-end fine-tuning (as explained in Section 5.2). When compared to AQLM's results, both methods demonstrate strengths under different settings for LLaMA-2 7b/13b/70b. For example, AQLM achieves better results for LLaMA-2 7b in 4-bit, while our 3-bit PPL is lower. Therefore, we believe VPTQ and AQLM perform comparably (with an absolute PPL difference of less than 0.05). However, our primary focus with vector quantization is on extreme low-bit compression (2-bit), and comparisons for 3-bit and 4-bit quantization are not our main focus.

@YangWang92
Comment options

Hi all,

Additionally, I understand that the low-bit quantization field is highly competitive, with many constantly updating their results. Therefore, outside of places like our GitHub repository or where the paper initially claimed SOTA, we haven’t emphasized or declared that we hold the SOTA position (although I do believe we have an advantage in the 2-bit domain with lower decoding costs). Of course, I also hope that more papers and research will accurately calculate bit-per-weight, rather than hiding more parameters within the models.

We will continue to update our work, so please stay tuned for more developments and discussions. Do you have any suggestions for applying this method to llama.cpp?

As provided by the open-source community, the weights of current models (e.g., on Hugging Face) are stored as INT32 tensors (for packed index) and embeddings (for centroids/lookup tables, in FP16), which seems to be compatible with GGUF representation. After reviewing the GGUF format, it appears I may not need an additional format for storage. Would it be sufficient to add the VPTQ dequantization function in our own fork to easily deploy VPTQ?

Comment options

I normally don't hang around here anymore, but given that I started this thread, I decided to chime in.

Thank you, @matt-c1, for sorting out that the token embedding and output tensors are left as f16 in the VPTQ models published on HF. Yes, quantization researchers always ignore those two in their bit balances, and I was curious if someone will point it out. I'll say more about the choice to not quantize these tensors below.

Which makes the VQLM 4 bit score of 4.56% seem OK, I think, as both its bpw and this particular quality metric are between these two GGUF variants.

It is OK, but is it SOTA as claimed?

Here is a graph that shows quantization error (PPL(Q)/PPL(f16)-1) as a function of bits-per-weight (bpw) for LLaMA-3.1-8B-Instruct. To not compare apples to oranges, I have left token_embd.weight and output.weight as f16 (just use --token-embedding-type f16 --output-tensor-type f16 when quantizing), and have modified llama.cpp to output the bpw excluding these two tensors, so the bpw used in the graph is for the repeating layers only, just like reported in quantization literature papers. Blue symbols are k-quants (published June 2023), black symbols are i-quants (published Jan-Feb 2024), and the cyan symbols are new quants that I have developed in the last 2-3 months in my repository, where I still play around with stuff around LLM inference using a llama.cpp fork. The magenta squares are VPTQ results taken from their paper. y-axis is logarithmic, so differences between the various curves are quite significant, even if they appear close to each other.
il31

So, is VPTQ quantization SOTA?

  • Definitely not at 4 bpw.
  • Basically the same as IQ3_XXS at 3 bpw. Me personally wouldn't call matching the performance of something that has been around for a while SOTA
  • At 2 bpw, sure, VPTQ is better. But it is only better after fine tuning. I don't have the impression that any of the research groups competing for the Quantization CrownTM on arXiv can match the quants here without fine tuning. Which, in my mind, brings up the question of how practical approaches such as this are. Who has the time, energy, interest, and computational resources, to be fine-tuning quantized models for hours (per model) on a very high-end GPU? Perhaps one can do it for the major models, but what about the countless fine-tunes on HF?

Is it OK to leave token embedding and output tensors as f16? I would say yes for token embeddings. It is left on the host in llama.cpp as well, and the computational expense of extracting the input token embeddings from it is negligible. But the output tensor? If it is not uploaded to the GPU, than the matrix multiplication that produces the final model output must be done on the CPU, and this has a significant impact on overall inference performance. Especially for the newer models with their very large vocabulary, the performance penalty may not be acceptable.

You must be logged in to vote
4 replies
@YangWang92
Comment options

Hi @ikawrakow,

Thank you very much for participating in this discussion on the thread.

  1. VPTQ indeed does not quantize the embeddings and the final linear layer, and in fact, many quantization methods overlook this as well. I sincerely appreciate you pointing this out.

  2. VPTQ's bit per weight calculation includes both the index and centroids in the linear layer. Based on our accumulated experience, even a 0.1 bpw increase can significantly help with PPL. In the paper, we carefully calculated the bit per weight for other methods, many of which have additional overhead that affects their claimed bitwidth.

  3. Regarding PPL, honestly, I’ve grown a bit weary of making overly detailed comparisons. The measurement environment (even the GPU model or random seed) can have a significant impact on PPL. Focusing on PPL differences in the decimal places also feels quite exhausting. For our paper’s results, we made every effort to measure all results under the same random seed, device, and dataset. I’m not particularly interested in maintaining the claim of SOTA for our paper, as that in itself is tiring.
    Our method around 4 bits performs similarly to others (under a strict bit per weight comparison, rather than comparing with methods using 5 bits). I agree that at higher bit widths (>3 bits), most methods saturate, and much of the work is just fitting the data distribution. However, the 2-bit and below range is a truly interesting area.

  4. As for fine-tuning, VPTQ’s original intent was not to rely on fine-tuning, as it can introduce the risk of overfitting. We had to use fine-tuning because the methods we compared against gradually incorporated fine-tuning as well, and we needed to demonstrate VPTQ’s performance under similar conditions.

@YangWang92
Comment options

Hi all,
In conclusion, I also feel a little bit tired of making such overly detailed comparisons. When everyone is focusing on a 0.1 PPL difference, many potential opportunities are actually being missed. I’d like to seek your advice, as well as that of others, on the following three suggestions:

  1. I’ve been thinking about how to differentiate new quantization methods from others in future development and research. What kind of quantization method would be widely accepted? Should I consider working on quantization for VLM (multimodal models)? I’ve noticed that this area is still largely unexplored.

  2. I plan to conduct a detailed evaluation of the differences between VPTQ and other methods in the future, not to promote VPTQ, but to understand the specific capabilities of compressed large models (e.g., 70B 2-bit) vs. (e.g., 32B 4-bit), even if they have similar PPL. Do you have any advice on this direction?

  3. Regarding the llama.cpp format itself, I am still attempting to run VPTQ on llama.cpp (though it might just be on our own fork; I understand merging into the main branch could be difficult). Currently, VPTQ stores the index in an INT32 tensor (packed) and centroids in the embedding (FP16/BF16). Based on my limited understanding of GGUF, it seems I may not need a separate format to convert to a GGUF model. However, should I actually customize a model structure? Or perhaps integrate a new dequantization operator? I would love to hear your thoughts on VPTQ.

Thank you very much, and I look forward to all of your reply.

@kinchahoy
Comment options

+1 for quantization of VLM models

@YangWang92
Comment options

Got it! I will speed up my research on VLM and will promptly inform you once I have some preliminary results.

+1 for quantization of VLM models

Comment options

Hi @ikawrakow,

We understand that adding a new quantization data type is a very difficult decision, depending on factors such as continued support for the quantization method, the maintainability of the software system, and so on. Currently, VPTQ only uses indices packed into int32 and a lookup table in fp16/bf16 (in the embedding operator). I would like to ask, if I want to support this kind of quantization method for VPTQ in llama.cpp, even on my own fork, which approach should I take:

  1. Define a series of new models (e.g., vptq-llama3.1) using existing data types (int32, fp16), hiding model dequantization within a separate dequant op.

  2. or Define a new quant dtype (e.g., some lookup table data structure).

Which approach would you prefer, and which one is more likely to be merged into the main branch?

Thanks!
Yang

You must be logged in to vote
3 replies
@ikawrakow
Comment options

Hello Yang,

I no longer contribute to llama.cpp, so it is totally not up to me to decide if and how you add VPTQ to llama.cpp.

You are more than welcome to contribute VPTQ to my llama.cpp fork here. IMHO, quantization innovation currently happens there and not here (but I can of course see that you may not want to spend your time contributing to a low-profile repository). Either way, when you get started seriously with integrating VPTQ into llama.cpp, you will find that ggml, the underlying ML library, is quite opinionated when it comes to how tensor data must be organized and stored. This inevitably leads to difficulties when one wants to integrate things such as variable length/per tensor codebooks, per tensor row or per tensor scales, etc. This is one of the reasons why the i-quants here use fixed rather than per tensor codebooks. In my fork there is at least some infrastructure already in place to allow departures from ggml opinions. You can take a look at the implementation of the new quantization types IQ4_KS, IQ4_KSS, IQ2_KS, IQ1_TN and IQ2_TN, which all use per tensor row scales, to see how you can have quantized tensors that do not adhere to the strict block-wise quantization ggml rule.

@YangWang92
Comment options

Hi @ikawrakow,

Thank you for your quick reply. I don't mind working on a forked version of llama.cpp. I am seriously trying to integrate VPTQ into llama.cpp/ik_llama.cpp, regardless of whether it's a popular fork or not. I am carefully looking into the implementations of ggml and gguf, and discussing with the community has been very helpful to me.

I'll continue exploring how I should approach this. I've forked your repository here: https://github.com/VPTQ/ik_llama.cpp. I'm thinking about how to minimize the impact on llama.cpp/ik_llama.cpp when integrating VPTQ. I can continue the discussion on ik_llama.cpp regarding my thoughts on this fork. Please bear with me as I'm still experimenting.

Thanks again for your advice! Also, feel free to share more suggestions on VPTQ!
Yang

@ikawrakow
Comment options

I can continue the discussion on ik_llama.cpp regarding my thoughts on this fork. Please bear with me as I'm still experimenting.

Sure, please feel free to ask any questions you may have.

Comment options

Hello @ikawrakow and the rest of the readers here. I am glad to see that you now have your own fork. You have meant so much for the llama.cpp project! Your ingenious contributions were very exciting to try. What do you think about QTIP?

Sources:

https://www.reddit.com/r/LocalLLaMA/comments/1ggwrx6/new_quantization_method_qtip_quantization_with/
https://arxiv.org/pdf/2406.11235
https://huggingface.co/collections/relaxml/qtip-quantized-models-66fa253ad3186746f4b62803

Do you think this could be useful or it is another project claiming SOTA results? Curious to hear your thoughts. Might also be interesting for your fork. I assume that people can still borrow code from your fork for llama.cpp? I am not sure why you felt the need to make your own fork in the first place, but I presume it has to do with having more control over what get's implemented?

Once again, I sincerely appreciate all of the great and innovative work you've done so far!

You must be logged in to vote
1 reply
@ikawrakow
Comment options

I did spend some time looking into the QTIP quantization approach. I have a PR in my repository implementing some of it. See comments there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Morty Proxy This is a proxified and sanitized view of the page, visit original site.