-
Notifications
You must be signed in to change notification settings - Fork 12.2k
Implement GGML_CPU_ALL_VARIANTS for PowerPC #14286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can you try moving these variables to the function below? This should delay initialization until the function is called. diff --git a/ggml/src/ggml-cpu/repack.cpp b/ggml/src/ggml-cpu/repack.cpp
index 5c6715d5c..e1f338686 100644
--- a/ggml/src/ggml-cpu/repack.cpp
+++ b/ggml/src/ggml-cpu/repack.cpp
@@ -1397,44 +1397,44 @@ template <typename BLOC_TYPE, int64_t INTER_SIZE, int64_t NB_COLS, ggml_type PAR
}
};
-// instance for Q4
-static const tensor_traits<block_q4_0, 4, 4, GGML_TYPE_Q8_0> q4_0_4x4_q8_0;
-static const tensor_traits<block_q4_0, 8, 4, GGML_TYPE_Q8_0> q4_0_4x8_q8_0;
-static const tensor_traits<block_q4_0, 8, 8, GGML_TYPE_Q8_0> q4_0_8x8_q8_0;
-static const tensor_traits<block_q4_K, 8, 8, GGML_TYPE_Q8_K> q4_K_8x8_q8_K;
-
-// instance for IQ4
-static const tensor_traits<block_iq4_nl, 4, 4, GGML_TYPE_Q8_0> iq4_nl_4x4_q8_0;
-
} // namespace ggml::cpu::repack
static const ggml::cpu::tensor_traits * ggml_repack_get_optimal_repack_type(const struct ggml_tensor * cur) {
+ // instance for Q4
+ static const ggml::cpu::repack::tensor_traits<block_q4_0, 4, 4, GGML_TYPE_Q8_0> q4_0_4x4_q8_0;
+ static const ggml::cpu::repack::tensor_traits<block_q4_0, 8, 4, GGML_TYPE_Q8_0> q4_0_4x8_q8_0;
+ static const ggml::cpu::repack::tensor_traits<block_q4_0, 8, 8, GGML_TYPE_Q8_0> q4_0_8x8_q8_0;
+ static const ggml::cpu::repack::tensor_traits<block_q4_K, 8, 8, GGML_TYPE_Q8_K> q4_K_8x8_q8_K;
+
+ // instance for IQ4
+ static const ggml::cpu::repack::tensor_traits<block_iq4_nl, 4, 4, GGML_TYPE_Q8_0> iq4_nl_4x4_q8_0;
+
if (cur->type == GGML_TYPE_Q4_0) {
if (ggml_cpu_has_avx2() || (ggml_cpu_has_sve() && ggml_cpu_has_matmul_int8() && ggml_cpu_get_sve_cnt() == QK8_0)) {
if (cur->ne[1] % 8 == 0) {
- return &ggml::cpu::repack::q4_0_8x8_q8_0;
+ return &q4_0_8x8_q8_0;
}
}
if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
if (cur->ne[1] % 4 == 0) {
- return &ggml::cpu::repack::q4_0_4x8_q8_0;
+ return &q4_0_4x8_q8_0;
}
}
if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
if (cur->ne[1] % 4 == 0) {
- return &ggml::cpu::repack::q4_0_4x4_q8_0;
+ return &q4_0_4x4_q8_0;
}
}
} else if (cur->type == GGML_TYPE_Q4_K) {
if (ggml_cpu_has_avx2()) {
if (cur->ne[1] % 8 == 0) {
- return &ggml::cpu::repack::q4_K_8x8_q8_K;
+ return &q4_K_8x8_q8_K;
}
}
} else if (cur->type == GGML_TYPE_IQ4_NL) {
if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
if (cur->ne[1] % 4 == 0) {
- return &ggml::cpu::repack::iq4_nl_4x4_q8_0;
+ return &iq4_nl_4x4_q8_0;
}
}
} |
Yes that seems to have worked perfectly. Thanks! Do you want to change this, or should I file a separate MR for this, or should I just include the change in this MR here? |
Just include the change here. |
When using GGML_BACKEND_DL=ON, these initializations might use instructions that are not supported by the current CPU.
OK, here are the latest results: What worksOn little-endian, backends are scored and loaded correctly, here from my POWER9 test machine with VSX:
Performance of the loaded backend was on par with the What probably worksOn the big-endian POWER8 where I worked on this, backends also loaded correctly, but the model What doesn't workFeatures only get detected for architectures that self-report, through Next stepsI hope the above is good enough for an initial version for PowerPC. I intend to improve upon this in another PR soon where I address the platform issue in a more general way. |
* mamba2-sync: (24 commits) sync : ggml Add `ggml_roll` (ggml/1274) docs : fix the link to llama.h (ggml-org#14293) CUDA: add conv_2d_transpose (ggml-org#14287) lint : remove trailing whitepace (ggml-org#14304) vocab : prevent tokenizer overflow (ggml-org#14301) sycl: add usage of enqueue_functions extension (ggml-org#14244) Implement GGML_CPU_ALL_VARIANTS for PowerPC (ggml-org#14286) llama : improve sep token handling (ggml-org#14272) cuda : synchronize graph capture and cublas handle destruction (ggml-org#14288) ggml : fix repack work size for mul_mat_id (ggml-org#14292) ggml: Update KleidiAI to v1.9.0 (ggml-org#14277) model : more uniform output id handling (ggml-org#14275) ubatch : new splitting logic (ggml-org#14217) CUDA: add conv_2d_dw (ggml-org#14265) ggml-cpu : remove unnecesary arm feature detection (ggml-org#14281) gguf-py : make sentencepiece optional (ggml-org#14200) server : add server parameters for draft model cache type (ggml-org#13782) build : suppress gcc15 compile warnings (ggml-org#14261) sycl: Cleanup codepaths in Get Rows in sycl backend (ggml-org#14215) ...
This first draft follows the recent ARM approach and should technically be OK though far from perfect. It introduces the platform into the backend scoring, but in a "dumb" way, scoring it like any other on/off feature. I have an improvement for this in the works but this improvement needs to be implemented for all architectures building variants (x86, ARM) at the same time, so that will be a follow-up PR.
However this PowerPC build runs into
SIGILL
as soon as a backend built for a newer architecture than supported by the current CPU is loaded. From the looks of it, I think this is the scenario that @slaren commented in #14049: the compiler sees certain instructions enabled and attempts to use them during initialization at repack.cpp#1401, even if the code itself doesn't use intrinsics yet at that point. So the overall program crashes before we can even "kick" the backend out as unsupported.If my interpretation is right, then this is a general
DL_BACKEND
issue that only just happens to manifest itself with PowerPC so far.I'm not yet familiar with that part of the code so I don't see an obvious solution. If anyone has an idea, I would appreciate it. My first instinct would be to separate this part from the code from the scoring, but I would expect that to add to complexity, perhaps there is a simpler solution to the initialization above that I'm just missing.
Backtrace:
I was testing this on POWER8 big-endian, and POWER9 little-endian (the porter boxes that Debian has available). Test command was
llama-cli --version
.