Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

imatrix : use GGUF to store importance matrices #9400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: master
Choose a base branch
Loading
from

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented Sep 10, 2024

Follow-up from ikawrakow/ik_llama.cpp#15 (reply in thread).

Using GGUF as the format for imatrix files will be useful for further experiments (e.g. with L²QER) and compatibility with existing or future GGUF tooling (e.g. GGUF previews on HuggingFace, graphical GGUF viewer(s) #6715, some kind of gguf-diff, etc.).

There are multiple problems with imatrix which this is addressing:

  • Ad-hoc format which isn't really readable by other projects (and which has no way to backward-compatibly be extended except by adding more stuff at the end)
  • Non-deterministic tensor order depending on unordered_map iteration order (makes sha256sum useless to compare imatrix files made on the same dataset)
  • Broken behavior at small -ub (intermediate saves happen waaay too often)
  • Can't use bigger batch size than chunk size

Summary of changes

  • Use GGUF to store imatrix data.
    • general.type is imatrix
    • no general.architecture
      • can't really know the architecture from old imatrix files.
    • store *.in_sum2 and *.counts for each tensors with imatrix data.
      • *.in_sum2 are the per-channel sums of squared activations
        • Stored in F32, like before.
      • *.counts are the number of activations (also the number of tokens), useful to calculate the mean squared activations (which is used by llama-quantize)
        • Why not simply store the mean? To allow merging imatrix files together with --in-file.
        • It's stored in F32 even though it's integer values, because when calculating the mean it would be converted to F32 anyway to perform the division.
  • Add convert_legacy_imatrix_to_gguf.py to convert old imatrix.dat files to imatrix.gguf
    • Conversion is either not necessary (since llama-quantize can still read the old format (with a warning)) or can be converted with llama-imatrix directly (when the output file has the .gguf suffix).
  • Like llama-perplexity since perplexity : support using multiple sequences to allow larger batch sizes #5946, allow computing multiple chunks per batch with llama-imatrix
    • This should be useful for huge models like Llama-405B when they don't fit completely in RAM.
  • Use fused-multiply-add (with std::fma) when accumulating the sums of activations
    • (decided against using it for now, for easier comparisons with llama-imatrix from master)
    • Shouldn't hurt to somewhat reduce rounding errors
      • (obviously f64 would be even better, but I'm not use it's worth it yet. For the curious, using double for the intermediate accumulations can be tried by changing only one line in IMatrixStats: vector<float> values to vector<double> values.)
  • Sort the tensor names before serializing
    • This makes the tensor order deterministic, because otherwise it depended on the iteration order of unordered_map.
      • Determinism between runs means sha256sum can be meaningfully used to compare imatrix files generated in very similar conditions.

TODO

  • Compare old llama-quantize with old imatrix.dat with new llama-quantize using converted imatrix.gguf
    • Seemed to work, but might need to re-test. The resulting quantized model(s) should have the same sha256sum.
  • Test new llama-imatrix at different batch sizes
    • Same checksums with -ub 64 -b 512 and -ub 512 -b 2048 for a chunk size of 512 (-c 512)
  • Perplexity test(s) with i-quants with old llama-imatrix vs new llama-imatrix
  • Test with MoE models (perplexity with i-quants should be in the same ballpark as before)
  • Test --in-file with llama-imatrix
    • single .imatrix or .gguf imatrix (for round-trip conversions)
    • multiple (for merging)
  • (maybe) Implement cleaner general.architecture exclusion.
    • Currently, this uses a subclass to make self.add_architecture() a no-op, but maybe general.architecture should simply be excluded when self.arch == "". Not sure how to prevent using the other self.add_* (in GGUFWriter) which expect self.arch to be something.
    • Or maybe the architecture should be included?
      • What about conversions from older imatrix.dat files?

@compilade compilade added enhancement New feature or request breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. refactoring Refactoring examples python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels Sep 10, 2024
examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved
examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved
examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved
compilade and others added 2 commits September 10, 2024 11:51
Sums and counts tensors no longer need to be consecutive.

* imatrix : more sanity checks when loading multiple imatrix files

* imatrix : use ggml_format_name instead of std::string concatenation

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
@compilade compilade marked this pull request as draft September 13, 2024 03:11
@compilade
Copy link
Collaborator Author

I'm setting this to "draft", because of concerns by @ikawrakow in ikawrakow/ik_llama.cpp#15 (comment) and ikawrakow/ik_llama.cpp#15 (comment) (mostly related to the fact that GGUF is harder to parse than imatrix.dat files).

More details near the end of ikawrakow/ik_llama.cpp#15 (reply in thread).

I'll need some days to think about how to go further with this.

@ggerganov
Copy link
Member

@compilade This is a good change and I think it would be useful to bring it to a completion.

In the future, we can extend libllama with an interface for saving/loading imatrix data. This way the implementation for reading and writing the imatrix data would be localized in libllama and can be kept in-sync more easily. This can be combined with the refactoring of llama_model_quantize_params to not pass C++ objects.

@compilade compilade marked this pull request as ready for review June 23, 2025 18:45
tools/imatrix/imatrix.cpp Outdated Show resolved Hide resolved
Comment on lines 38 to 44
static bool str_remove_suffix(std::string & str, const std::string & suffix) {
bool has_suffix = str_has_suffix(str, suffix);
if (has_suffix) {
str = str.substr(0, str.size() - suffix.size());
}
return has_suffix;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static bool str_remove_suffix(std::string & str, const std::string & suffix) {
bool has_suffix = str_has_suffix(str, suffix);
if (has_suffix) {
str = str.substr(0, str.size() - suffix.size());
}
return has_suffix;
}
static bool str_remove_suffix(std::string & str, const std::string_view & suffix) {
bool has_suffix = string_ends_with(str, suffix);
if (has_suffix) {
str = str.substr(0, str.size() - suffix.size());
}
return has_suffix;
}

This is a nice complement to string_ends_with, should be moved to common.cpp as string_remove_suffix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't std::string_view be passed by value instead of by const reference in these functions?

For example, https://quuxplusone.github.io/blog/2021/11/09/pass-string-view-by-value/ suggests that std::string_view should always be passed by value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did wonder about that, but didn't know enough of the details behind it, reading that it certainly seems so, and it should be worth going over the usage of std::string_view elsewhere.

tools/imatrix/imatrix.cpp Outdated Show resolved Hide resolved
tools/quantize/quantize.cpp Outdated Show resolved Hide resolved
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Copy link
Collaborator

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I did some light testing with old and new formats, converting them back and forth, checking that quantization was unchanged, etc, all seems good.

Only slightly annoying thing is the garbage output to console (by gguf_init_from_file) when using old format. Resolved in #14381

Will probably benefit from testing by more people before merging, @bartowski1182 @danielhanchen ?

@JohannesGaessler
Copy link
Collaborator

Thank you for working on this, I've been thinking that storing imatrix as gguf wold be nice for investigating the use of gradients instead of activations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. enhancement New feature or request examples python python script changes refactoring Refactoring Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.