Implement GGML_CPU_ALL_VARIANTS for PowerPC #14286

ckastner · Jun 19, 2025

This first draft follows the recent ARM approach and should technically be OK though far from perfect. It introduces the platform into the backend scoring, but in a "dumb" way, scoring it like any other on/off feature. I have an improvement for this in the works but this improvement needs to be implemented for all architectures building variants (x86, ARM) at the same time, so that will be a follow-up PR.

However this PowerPC build runs into SIGILL as soon as a backend built for a newer architecture than supported by the current CPU is loaded. From the looks of it, I think this is the scenario that @slaren commented in #14049: the compiler sees certain instructions enabled and attempts to use them during initialization at repack.cpp#1401, even if the code itself doesn't use intrinsics yet at that point. So the overall program crashes before we can even "kick" the backend out as unsupported.

If my interpretation is right, then this is a general DL_BACKEND issue that only just happens to manifest itself with PowerPC so far.

I'm not yet familiar with that part of the code so I don't see an obvious solution. If anyone has an idea, I would appreciate it. My first instinct would be to separate this part from the code from the scoring, but I would expect that to add to complexity, perhaps there is a simpler solution to the initialization above that I'm just missing.

Backtrace:

Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power7_2.so score: 67
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power8_1.so score: 5
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power7_1.so score: 3
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power9.so score: 73
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power8_2.so score: 69

Program received signal SIGILL, Illegal instruction.
__static_initialization_and_destruction_0 () at /home/ckk/llama.cpp-le/ggml/src/ggml-cpu/repack.cpp:1401
1401    static const tensor_traits<block_q4_0, 4, 4, GGML_TYPE_Q8_0> q4_0_4x4_q8_0;
(gdb) bt
#0  __static_initialization_and_destruction_0 () at /home/ckk/llama.cpp-le/ggml/src/ggml-cpu/repack.cpp:1401
#1  0x00007ffff520ec00 in _GLOBAL__sub_I_repack.cpp(void) () at /home/ckk/llama.cpp-le/ggml/src/ggml-cpu/repack.cpp:1555
#2  0x00007ffff7f96b0c in call_init (l=<optimized out>, argc=<optimized out>, argv=<optimized out>, env=<optimized out>)
    at dl-init.c:74
#3  _dl_init (main_map=0x1004f7250, argc=2, argv=0x7ffffffff378, env=0x7ffffffff390) at dl-init.c:121
#4  0x00007ffff7faa28c in call_dl_init (closure=<optimized out>, closure@entry=0x7ffffffde210) at dl-open.c:493
#5  0x00007ffff7f916e0 in __GI__dl_catch_exception (exception=exception@entry=0x0, operate=<optimized out>, 
    operate@entry=0x7ffff7faa260 <call_dl_init>, args=<optimized out>, args@entry=0x7ffffffde210) at dl-catch.c:215
#6  0x00007ffff7faa41c in dl_open_worker (a=0x7ffffffde210) at dl-open.c:799
#7  dl_open_worker (a=a@entry=0x7ffffffde210) at dl-open.c:750
#8  0x00007ffff7f9163c in __GI__dl_catch_exception (exception=exception@entry=0x7ffffffde258, 
    operate=operate@entry=0x7ffff7faa2f0 <dl_open_worker>, args=args@entry=0x7ffffffde210) at dl-catch.c:241
#9  0x00007ffff7fabccc in _dl_open (file=0x1004fb0d0 "/home/ckk/llama.cpp-le/build/bin/libggml-cpu-power10.so", 
    mode=-2147483646, caller_dlopen=0x7ffff7f196bc <dl_load_library(std::filesystem::__cxx11::path const&)+84>, nsid=-2, 
    argc=2, argv=0x7ffffffff378, env=0x7ffffffff390) at dl-open.c:874
#10 0x00007ffff6aa73f4 in dlopen_doit (a=a@entry=0x7ffffffde6e8) at dlopen.c:56
#11 0x00007ffff7f9163c in __GI__dl_catch_exception (exception=exception@entry=0x7ffffffde600, 
    operate=0x7ffff6aa7370 <dlopen_doit>, args=0x7ffffffde6e8) at dl-catch.c:241
#12 0x00007ffff7f917bc in _dl_catch_error (objname=objname@entry=0x7ffffffde678, errstring=errstring@entry=0x7ffffffde680, 
    mallocedp=mallocedp@entry=0x7ffffffde677, operate=<optimized out>, args=<optimized out>) at dl-catch.c:260
#13 0x00007ffff6aa6cd8 in _dlerror_run (operate=<optimized out>, operate@entry=0x7ffff6aa7370 <dlopen_doit>, 
    args=<optimized out>, args@entry=0x7ffffffde6e8) at dlerror.c:138
#14 0x00007ffff6aa7504 in dlopen_implementation (file=<optimized out>, mode=<optimized out>, dl_caller=<optimized out>)
    at dlopen.c:71
#15 ___dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:81
#16 0x00007ffff7f196bc in dl_load_library (
    path=filesystem::path "/home/ckk/llama.cpp-le/build/bin/libggml-cpu-power10.so" = {...})
    at /home/ckk/llama.cpp-le/ggml/src/ggml-backend-reg.cpp:140
#17 0x00007ffff7f1ab14 in ggml_backend_load_best (name=0x7ffff7f2c780 "cpu", silent=false, user_search_path=0x0)
    at /home/ckk/llama.cpp-le/ggml/src/ggml-backend-reg.cpp:517
#18 0x00007ffff7f1b570 in ggml_backend_load_all_from_path (dir_path=0x0)
    at /home/ckk/llama.cpp-le/ggml/src/ggml-backend-reg.cpp:580
#19 0x00007ffff7f1b400 in ggml_backend_load_all () at /home/ckk/llama.cpp-le/ggml/src/ggml-backend-reg.cpp:559
#20 0x00000001000c0308 in common_params_parser_init (params=..., ex=LLAMA_EXAMPLE_MAIN, 
    print_usage=0x1000850f8 <print_usage(int, char**)>) at /home/ckk/llama.cpp-le/common/arg.cpp:1225
#21 0x00000001000ae54c in common_params_parse (argc=2, argv=0x7ffffffff378, params=..., ex=LLAMA_EXAMPLE_MAIN, 
    print_usage=0x1000850f8 <print_usage(int, char**)>) at /home/ckk/llama.cpp-le/common/arg.cpp:1180
#22 0x0000000100085900 in main (argc=2, argv=0x7ffffffff378) at /home/ckk/llama.cpp-le/tools/main/main.cpp:89

I was testing this on POWER8 big-endian, and POWER9 little-endian (the porter boxes that Debian has available). Test command was llama-cli --version.

slaren · Jun 19, 2025

Can you try moving these variables to the function below? This should delay initialization until the function is called.

diff --git a/ggml/src/ggml-cpu/repack.cpp b/ggml/src/ggml-cpu/repack.cpp
index 5c6715d5c..e1f338686 100644
--- a/ggml/src/ggml-cpu/repack.cpp
+++ b/ggml/src/ggml-cpu/repack.cpp
@@ -1397,44 +1397,44 @@ template <typename BLOC_TYPE, int64_t INTER_SIZE, int64_t NB_COLS, ggml_type PAR
     }
 };

-// instance for Q4
-static const tensor_traits<block_q4_0, 4, 4, GGML_TYPE_Q8_0> q4_0_4x4_q8_0;
-static const tensor_traits<block_q4_0, 8, 4, GGML_TYPE_Q8_0> q4_0_4x8_q8_0;
-static const tensor_traits<block_q4_0, 8, 8, GGML_TYPE_Q8_0> q4_0_8x8_q8_0;
-static const tensor_traits<block_q4_K, 8, 8, GGML_TYPE_Q8_K> q4_K_8x8_q8_K;
-
-// instance for IQ4
-static const tensor_traits<block_iq4_nl, 4, 4, GGML_TYPE_Q8_0> iq4_nl_4x4_q8_0;
-
 }  // namespace ggml::cpu::repack

 static const ggml::cpu::tensor_traits * ggml_repack_get_optimal_repack_type(const struct ggml_tensor * cur) {
+    // instance for Q4
+    static const ggml::cpu::repack::tensor_traits<block_q4_0, 4, 4, GGML_TYPE_Q8_0> q4_0_4x4_q8_0;
+    static const ggml::cpu::repack::tensor_traits<block_q4_0, 8, 4, GGML_TYPE_Q8_0> q4_0_4x8_q8_0;
+    static const ggml::cpu::repack::tensor_traits<block_q4_0, 8, 8, GGML_TYPE_Q8_0> q4_0_8x8_q8_0;
+    static const ggml::cpu::repack::tensor_traits<block_q4_K, 8, 8, GGML_TYPE_Q8_K> q4_K_8x8_q8_K;
+
+    // instance for IQ4
+    static const ggml::cpu::repack::tensor_traits<block_iq4_nl, 4, 4, GGML_TYPE_Q8_0> iq4_nl_4x4_q8_0;
+
     if (cur->type == GGML_TYPE_Q4_0) {
         if (ggml_cpu_has_avx2() || (ggml_cpu_has_sve() && ggml_cpu_has_matmul_int8() && ggml_cpu_get_sve_cnt() == QK8_0)) {
             if (cur->ne[1] % 8 == 0) {
-                return &ggml::cpu::repack::q4_0_8x8_q8_0;
+                return &q4_0_8x8_q8_0;
             }
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
             if (cur->ne[1] % 4 == 0) {
-                return &ggml::cpu::repack::q4_0_4x8_q8_0;
+                return &q4_0_4x8_q8_0;
             }
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
             if (cur->ne[1] % 4 == 0) {
-                return &ggml::cpu::repack::q4_0_4x4_q8_0;
+                return &q4_0_4x4_q8_0;
             }
         }
     } else if (cur->type == GGML_TYPE_Q4_K) {
         if (ggml_cpu_has_avx2()) {
             if (cur->ne[1] % 8 == 0) {
-                return &ggml::cpu::repack::q4_K_8x8_q8_K;
+                return &q4_K_8x8_q8_K;
             }
         }
     } else if (cur->type == GGML_TYPE_IQ4_NL) {
         if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
             if (cur->ne[1] % 4 == 0) {
-                return &ggml::cpu::repack::iq4_nl_4x4_q8_0;
+                return &iq4_nl_4x4_q8_0;
             }
         }
     }

ckastner · Jun 19, 2025

Yes that seems to have worked perfectly. Thanks!

Do you want to change this, or should I file a separate MR for this, or should I just include the change in this MR here?

slaren · Jun 19, 2025

Just include the change here.

When using GGML_BACKEND_DL=ON, these initializations might use instructions that are not supported by the current CPU.

ckastner · Jun 19, 2025

OK, here are the latest results:

What works

On little-endian, backends are scored and loaded correctly, here from my POWER9 test machine with VSX:

ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power7_2.so score: 67
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power8_1.so score: 5
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power7_1.so score: 3
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power9.so score: 73
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power8_2.so score: 69
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power10.so score: 0
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power0.so score: 1
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power11.so score: 0
load_backend: loaded CPU backend from /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power9.so

Performance of the loaded backend was on par with the GGML_NATIVE=ON backend.

What probably works

On the big-endian POWER8 where I worked on this, backends also loaded correctly, but the model ggml-model-q4_0.gguf I used failed to load. I then learned from gguf that endianness is relevant and my model was little-endian. I did find some big-endian models but for reasons, I could not test them yet. I don't see why they shouldn't work, though.

What doesn't work

Features only get detected for architectures that self-report, through getauxval(AT_PLATFORM), as powerN. The cmake code also tests for CMAKE_SYSTEM_PROCESSOR=powerpc64le but I don't have a testbed for this. I don't even know if that is a valid value for AT_PLATFORM, or if that's just a cmake-ism. Either way, if the host supports VSX, at least that will get picked up.

Next steps

I hope the above is good enough for an initial version for PowerPC.

I intend to improve upon this in another PR soon where I address the platform issue in a more general way.

* mamba2-sync: (24 commits) sync : ggml Add `ggml_roll` (ggml/1274) docs : fix the link to llama.h (ggml-org#14293) CUDA: add conv_2d_transpose (ggml-org#14287) lint : remove trailing whitepace (ggml-org#14304) vocab : prevent tokenizer overflow (ggml-org#14301) sycl: add usage of enqueue_functions extension (ggml-org#14244) Implement GGML_CPU_ALL_VARIANTS for PowerPC (ggml-org#14286) llama : improve sep token handling (ggml-org#14272) cuda : synchronize graph capture and cublas handle destruction (ggml-org#14288) ggml : fix repack work size for mul_mat_id (ggml-org#14292) ggml: Update KleidiAI to v1.9.0 (ggml-org#14277) model : more uniform output id handling (ggml-org#14275) ubatch : new splitting logic (ggml-org#14217) CUDA: add conv_2d_dw (ggml-org#14265) ggml-cpu : remove unnecesary arm feature detection (ggml-org#14281) gguf-py : make sentencepiece optional (ggml-org#14200) server : add server parameters for draft model cache type (ggml-org#13782) build : suppress gcc15 compile warnings (ggml-org#14261) sycl: Cleanup codepaths in Get Rows in sycl backend (ggml-org#14215) ...

ckastner added 2 commits June 19, 2025 13:13

Add PowerPC feature detection and scoring

ebf30cb

ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for PowerPC

4358c2d

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 19, 2025

ggml-cpu: Delay some initializations until function is called

bafb2e9

When using GGML_BACKEND_DL=ON, these initializations might use instructions that are not supported by the current CPU.

ckastner marked this pull request as ready for review June 19, 2025 21:05

slaren approved these changes Jun 20, 2025

View reviewed changes

ckastner merged commit 6369be0 into ggml-org:master Jun 20, 2025
47 checks passed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jun 21, 2025

Revert Implement GGML_CPU_ALL_VARIANTS for PowerPC (ggml-org#14286)

b3bae4d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement GGML_CPU_ALL_VARIANTS for PowerPC #14286

Implement GGML_CPU_ALL_VARIANTS for PowerPC #14286

Uh oh!

ckastner commented Jun 19, 2025 •

edited

Loading

Uh oh!

slaren commented Jun 19, 2025

Uh oh!

ckastner commented Jun 19, 2025

Uh oh!

slaren commented Jun 19, 2025

Uh oh!

ckastner commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

Search code, repositories, users, issues, pull requests...

Implement GGML_CPU_ALL_VARIANTS for PowerPC #14286

Implement GGML_CPU_ALL_VARIANTS for PowerPC #14286

Uh oh!

Conversation

ckastner commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Jun 19, 2025

Uh oh!

ckastner commented Jun 19, 2025

Uh oh!

slaren commented Jun 19, 2025

Uh oh!

ckastner commented Jun 19, 2025

What works

What probably works

What doesn't work

Next steps

Uh oh!

Uh oh!

Uh oh!

ckastner commented Jun 19, 2025 •

edited

Loading