imatrix : use GGUF to store importance matrices #9400

compilade · Sep 10, 2024

Follow-up from ikawrakow/ik_llama.cpp#15 (reply in thread).

Using GGUF as the format for imatrix files will be useful for further experiments (e.g. with L²QER) and compatibility with existing or future GGUF tooling (e.g. GGUF previews on HuggingFace, graphical GGUF viewer(s) #6715, some kind of gguf-diff, etc.).

There are multiple problems with imatrix which this is addressing:

Ad-hoc format which isn't really readable by other projects (and which has no way to backward-compatibly be extended except by adding more stuff at the end)
Non-deterministic tensor order depending on unordered_map iteration order (makes sha256sum useless to compare imatrix files made on the same dataset)
Broken behavior at small -ub (intermediate saves happen waaay too often)
Can't use bigger batch size than chunk size

Summary of changes

Use GGUF to store imatrix data.
- general.type is imatrix
- no general.architecture
  - can't really know the architecture from old imatrix files.
- store *.in_sum2 and *.counts for each tensors with imatrix data.
  - *.in_sum2 are the per-channel sums of squared activations
    - Stored in F32, like before.
  - *.counts are the number of activations (also the number of tokens), useful to calculate the mean squared activations (which is used by llama-quantize)
    - Why not simply store the mean? To allow merging imatrix files together with --in-file.
    - It's stored in F32 even though it's integer values, because when calculating the mean it would be converted to F32 anyway to perform the division.
~~Add convert_legacy_imatrix_to_gguf.py to convert old imatrix.dat files to imatrix.gguf~~
- Conversion is either not necessary (since llama-quantize can still read the old format (with a warning)) or can be converted with llama-imatrix directly (when the output file has the .gguf suffix).
Like llama-perplexity since perplexity : support using multiple sequences to allow larger batch sizes #5946, allow computing multiple chunks per batch with llama-imatrix
- This should be useful for huge models like Llama-405B when they don't fit completely in RAM.
~~Use fused-multiply-add (with std::fma) when accumulating the sums of activations~~
- (decided against using it for now, for easier comparisons with llama-imatrix from master)
- Shouldn't hurt to somewhat reduce rounding errors
  - (obviously f64 would be even better, but I'm not use it's worth it yet. For the curious, using double for the intermediate accumulations can be tried by changing only one line in IMatrixStats: vector<float> values to vector<double> values.)
Sort the tensor names before serializing
- This makes the tensor order deterministic, because otherwise it depended on the iteration order of unordered_map.
  - Determinism between runs means sha256sum can be meaningfully used to compare imatrix files generated in very similar conditions.

TODO

Compare old llama-quantize with old imatrix.dat with new llama-quantize using converted imatrix.gguf
- Seemed to work, but might need to re-test. The resulting quantized model(s) should have the same sha256sum.
Test new llama-imatrix at different batch sizes
- Same checksums with -ub 64 -b 512 and -ub 512 -b 2048 for a chunk size of 512 (-c 512)
Perplexity test(s) with i-quants with old llama-imatrix vs new llama-imatrix
Test with MoE models (perplexity with i-quants should be in the same ballpark as before)
Test --in-file with llama-imatrix
- single .imatrix or .gguf imatrix (for round-trip conversions)
- multiple (for merging)
(maybe) Implement cleaner general.architecture exclusion.
- Currently, this uses a subclass to make self.add_architecture() a no-op, but maybe general.architecture should simply be excluded when self.arch == "". Not sure how to prevent using the other self.add_* (in GGUFWriter) which expect self.arch to be something.
- Or maybe the architecture should be included?
  - What about conversions from older imatrix.dat files?

I have read the contributing guidelines
Self-reported review complexity:
- Medium

* perplexity : simplify filling the batch

examples/imatrix/imatrix.cpp

Sums and counts tensors no longer need to be consecutive. * imatrix : more sanity checks when loading multiple imatrix files * imatrix : use ggml_format_name instead of std::string concatenation Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

compilade · Sep 13, 2024

I'm setting this to "draft", because of concerns by @ikawrakow in ikawrakow/ik_llama.cpp#15 (comment) and ikawrakow/ik_llama.cpp#15 (comment) (mostly related to the fact that GGUF is harder to parse than imatrix.dat files).

More details near the end of ikawrakow/ik_llama.cpp#15 (reply in thread).

I'll need some days to think about how to go further with this.

ggerganov · Apr 8, 2025

@compilade This is a good change and I think it would be useful to bring it to a completion.

In the future, we can extend libllama with an interface for saving/loading imatrix data. This way the implementation for reading and writing the imatrix data would be localized in libllama and can be kept in-sync more easily. This can be combined with the refactoring of llama_model_quantize_params to not pass C++ objects.

This should make comparisons between the formats easier because this matches the behavior of the previous version.

tools/imatrix/imatrix.cpp

CISC · Jun 23, 2025

tools/imatrix/imatrix.cpp

+static bool str_remove_suffix(std::string & str, const std::string & suffix) {
+    bool has_suffix = str_has_suffix(str, suffix);
+    if (has_suffix) {
+        str = str.substr(0, str.size() - suffix.size());
+    }
+    return has_suffix;
+}


Suggested change

static bool str_remove_suffix(std::string & str, const std::string & suffix) {

bool has_suffix = str_has_suffix(str, suffix);

if (has_suffix) {

str = str.substr(0, str.size() - suffix.size());

}

return has_suffix;

}

static bool str_remove_suffix(std::string & str, const std::string_view & suffix) {

bool has_suffix = string_ends_with(str, suffix);

if (has_suffix) {

str = str.substr(0, str.size() - suffix.size());

}

return has_suffix;

}

This is a nice complement to string_ends_with, should be moved to common.cpp as string_remove_suffix.

Shouldn't std::string_view be passed by value instead of by const reference in these functions?

For example, https://quuxplusone.github.io/blog/2021/11/09/pass-string-view-by-value/ suggests that std::string_view should always be passed by value.

I did wonder about that, but didn't know enough of the details behind it, reading that it certainly seems so, and it should be worth going over the usage of std::string_view elsewhere.

tools/imatrix/imatrix.cpp

tools/quantize/quantize.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

CISC

LGTM, I did some light testing with old and new formats, converting them back and forth, checking that quantization was unchanged, etc, all seems good.

~~Only slightly annoying thing is the garbage output to console (by gguf_init_from_file) when using old format.~~ Resolved in #14381

Will probably benefit from testing by more people before merging, @bartowski1182 @danielhanchen ?

JohannesGaessler · Jun 25, 2025

Thank you for working on this, I've been thinking that storing imatrix as gguf wold be nice for investigating the use of gradients instead of activations.

compilade added 8 commits August 20, 2024 15:17

imatrix : allow processing multiple chunks per batch

bce5464

* perplexity : simplify filling the batch

imatrix : fix segfault when using a single chunk per batch

347247a

imatrix : use GGUF to store imatrix data

3de9300

imatrix : fix conversion problems

c8ab6a3

Merge branch 'master' into compilade/imatrix-batched-chunks

3ad0603

imatrix : use FMA and sort tensor names

d19101c

py : add requirements for legacy imatrix convert script

503630e

perplexity : revert changes

9e6b0e9

compilade added 3 commits September 9, 2024 22:20

py : include imatrix converter requirements in toplevel requirements

894ed8d

imatrix : avoid using designated initializers in C++

efa9186

imatrix : remove unused n_entries

2217247

ngxson reviewed Sep 10, 2024

View reviewed changes

examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved

examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved

examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved

compilade and others added 2 commits September 10, 2024 11:51

quantize : use unused imatrix chunk_size with LLAMA_TRACE

2d79a70

compilade marked this pull request as draft September 13, 2024 03:11

compilade added 3 commits January 30, 2025 19:56

common : use GGUF for imatrix output by default

c7a32e7

Merge branch 'master' into compilade/imatrix-batched-chunks

db502dd

Merge branch 'master' into compilade/imatrix-batched-chunks

1be357d

compilade mentioned this pull request Apr 6, 2025

imatrix: add option to display importance score statistics for a given imatrix file #12718

Open

compilade added 3 commits April 13, 2025 12:10

Merge branch 'master' into compilade/imatrix-batched-chunks

16202d6

imatrix : two-way conversion between old format and GGUF

a5165a6

convert : remove imatrix to gguf python script

635f945

compilade added 7 commits April 15, 2025 17:52

imatrix : use the function name in more error messages

1d19025

Merge branch 'master' into compilade/imatrix-batched-chunks

2c09450

imatrix : don't use FMA explicitly

ba6f6be

This should make comparisons between the formats easier because this matches the behavior of the previous version.

imatrix : avoid returning from void function save_imatrix

1a9454a

imatrix : support 3d tensors with MUL_MAT

43cd2b3

quantize : fix dataset name loading from gguf imatrix

0e79355

Merge branch 'master' into compilade/imatrix-batched-chunks

118d52f

compilade marked this pull request as ready for review June 23, 2025 18:45

CISC reviewed Jun 23, 2025

View reviewed changes

common : move string_remove_suffix from quantize and imatrix

e33de12

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

CISC approved these changes Jun 24, 2025

View reviewed changes

CISC mentioned this pull request Jun 25, 2025

ggml : do not output unprintable characters on GGUF load failure #14381

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

imatrix : use GGUF to store importance matrices #9400

imatrix : use GGUF to store importance matrices #9400

compilade commented Sep 10, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

compilade commented Sep 13, 2024

Uh oh!

ggerganov commented Apr 8, 2025

Uh oh!

Uh oh!

CISC Jun 23, 2025

Uh oh!

compilade Jun 23, 2025

Uh oh!

CISC Jun 23, 2025

Uh oh!

Uh oh!

Uh oh!

CISC left a comment •

edited

Loading

Uh oh!

JohannesGaessler commented Jun 25, 2025

Uh oh!

Uh oh!

Search code, repositories, users, issues, pull requests...

imatrix : use GGUF to store importance matrices #9400

Are you sure you want to change the base?

imatrix : use GGUF to store importance matrices #9400

Conversation

compilade commented Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of changes

TODO

Uh oh!

Uh oh!

Uh oh!

Uh oh!

compilade commented Sep 13, 2024

Uh oh!

ggerganov commented Apr 8, 2025

Uh oh!

Uh oh!

CISC Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

compilade Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

CISC left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Jun 25, 2025

Uh oh!

Uh oh!

compilade commented Sep 10, 2024 •

edited

Loading

CISC left a comment •

edited

Loading