Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Implement GGML_CPU_ALL_VARIANTS for PowerPC #14286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 20, 2025

Conversation

ckastner
Copy link
Collaborator

@ckastner ckastner commented Jun 19, 2025

This first draft follows the recent ARM approach and should technically be OK though far from perfect. It introduces the platform into the backend scoring, but in a "dumb" way, scoring it like any other on/off feature. I have an improvement for this in the works but this improvement needs to be implemented for all architectures building variants (x86, ARM) at the same time, so that will be a follow-up PR.

However this PowerPC build runs into SIGILL as soon as a backend built for a newer architecture than supported by the current CPU is loaded. From the looks of it, I think this is the scenario that @slaren commented in #14049: the compiler sees certain instructions enabled and attempts to use them during initialization at repack.cpp#1401, even if the code itself doesn't use intrinsics yet at that point. So the overall program crashes before we can even "kick" the backend out as unsupported.

If my interpretation is right, then this is a general DL_BACKEND issue that only just happens to manifest itself with PowerPC so far.

I'm not yet familiar with that part of the code so I don't see an obvious solution. If anyone has an idea, I would appreciate it. My first instinct would be to separate this part from the code from the scoring, but I would expect that to add to complexity, perhaps there is a simpler solution to the initialization above that I'm just missing.

Backtrace:

Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power7_2.so score: 67
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power8_1.so score: 5
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power7_1.so score: 3
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power9.so score: 73
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power8_2.so score: 69

Program received signal SIGILL, Illegal instruction.
__static_initialization_and_destruction_0 () at /home/ckk/llama.cpp-le/ggml/src/ggml-cpu/repack.cpp:1401
1401    static const tensor_traits<block_q4_0, 4, 4, GGML_TYPE_Q8_0> q4_0_4x4_q8_0;
(gdb) bt
#0  __static_initialization_and_destruction_0 () at /home/ckk/llama.cpp-le/ggml/src/ggml-cpu/repack.cpp:1401
#1  0x00007ffff520ec00 in _GLOBAL__sub_I_repack.cpp(void) () at /home/ckk/llama.cpp-le/ggml/src/ggml-cpu/repack.cpp:1555
#2  0x00007ffff7f96b0c in call_init (l=<optimized out>, argc=<optimized out>, argv=<optimized out>, env=<optimized out>)
    at dl-init.c:74
#3  _dl_init (main_map=0x1004f7250, argc=2, argv=0x7ffffffff378, env=0x7ffffffff390) at dl-init.c:121
#4  0x00007ffff7faa28c in call_dl_init (closure=<optimized out>, closure@entry=0x7ffffffde210) at dl-open.c:493
#5  0x00007ffff7f916e0 in __GI__dl_catch_exception (exception=exception@entry=0x0, operate=<optimized out>, 
    operate@entry=0x7ffff7faa260 <call_dl_init>, args=<optimized out>, args@entry=0x7ffffffde210) at dl-catch.c:215
#6  0x00007ffff7faa41c in dl_open_worker (a=0x7ffffffde210) at dl-open.c:799
#7  dl_open_worker (a=a@entry=0x7ffffffde210) at dl-open.c:750
#8  0x00007ffff7f9163c in __GI__dl_catch_exception (exception=exception@entry=0x7ffffffde258, 
    operate=operate@entry=0x7ffff7faa2f0 <dl_open_worker>, args=args@entry=0x7ffffffde210) at dl-catch.c:241
#9  0x00007ffff7fabccc in _dl_open (file=0x1004fb0d0 "/home/ckk/llama.cpp-le/build/bin/libggml-cpu-power10.so", 
    mode=-2147483646, caller_dlopen=0x7ffff7f196bc <dl_load_library(std::filesystem::__cxx11::path const&)+84>, nsid=-2, 
    argc=2, argv=0x7ffffffff378, env=0x7ffffffff390) at dl-open.c:874
#10 0x00007ffff6aa73f4 in dlopen_doit (a=a@entry=0x7ffffffde6e8) at dlopen.c:56
#11 0x00007ffff7f9163c in __GI__dl_catch_exception (exception=exception@entry=0x7ffffffde600, 
    operate=0x7ffff6aa7370 <dlopen_doit>, args=0x7ffffffde6e8) at dl-catch.c:241
#12 0x00007ffff7f917bc in _dl_catch_error (objname=objname@entry=0x7ffffffde678, errstring=errstring@entry=0x7ffffffde680, 
    mallocedp=mallocedp@entry=0x7ffffffde677, operate=<optimized out>, args=<optimized out>) at dl-catch.c:260
#13 0x00007ffff6aa6cd8 in _dlerror_run (operate=<optimized out>, operate@entry=0x7ffff6aa7370 <dlopen_doit>, 
    args=<optimized out>, args@entry=0x7ffffffde6e8) at dlerror.c:138
#14 0x00007ffff6aa7504 in dlopen_implementation (file=<optimized out>, mode=<optimized out>, dl_caller=<optimized out>)
    at dlopen.c:71
#15 ___dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:81
#16 0x00007ffff7f196bc in dl_load_library (
    path=filesystem::path "/home/ckk/llama.cpp-le/build/bin/libggml-cpu-power10.so" = {...})
    at /home/ckk/llama.cpp-le/ggml/src/ggml-backend-reg.cpp:140
#17 0x00007ffff7f1ab14 in ggml_backend_load_best (name=0x7ffff7f2c780 "cpu", silent=false, user_search_path=0x0)
    at /home/ckk/llama.cpp-le/ggml/src/ggml-backend-reg.cpp:517
#18 0x00007ffff7f1b570 in ggml_backend_load_all_from_path (dir_path=0x0)
    at /home/ckk/llama.cpp-le/ggml/src/ggml-backend-reg.cpp:580
#19 0x00007ffff7f1b400 in ggml_backend_load_all () at /home/ckk/llama.cpp-le/ggml/src/ggml-backend-reg.cpp:559
#20 0x00000001000c0308 in common_params_parser_init (params=..., ex=LLAMA_EXAMPLE_MAIN, 
    print_usage=0x1000850f8 <print_usage(int, char**)>) at /home/ckk/llama.cpp-le/common/arg.cpp:1225
#21 0x00000001000ae54c in common_params_parse (argc=2, argv=0x7ffffffff378, params=..., ex=LLAMA_EXAMPLE_MAIN, 
    print_usage=0x1000850f8 <print_usage(int, char**)>) at /home/ckk/llama.cpp-le/common/arg.cpp:1180
#22 0x0000000100085900 in main (argc=2, argv=0x7ffffffff378) at /home/ckk/llama.cpp-le/tools/main/main.cpp:89

I was testing this on POWER8 big-endian, and POWER9 little-endian (the porter boxes that Debian has available). Test command was llama-cli --version.

@slaren
Copy link
Member

slaren commented Jun 19, 2025

Can you try moving these variables to the function below? This should delay initialization until the function is called.

diff --git a/ggml/src/ggml-cpu/repack.cpp b/ggml/src/ggml-cpu/repack.cpp
index 5c6715d5c..e1f338686 100644
--- a/ggml/src/ggml-cpu/repack.cpp
+++ b/ggml/src/ggml-cpu/repack.cpp
@@ -1397,44 +1397,44 @@ template <typename BLOC_TYPE, int64_t INTER_SIZE, int64_t NB_COLS, ggml_type PAR
     }
 };

-// instance for Q4
-static const tensor_traits<block_q4_0, 4, 4, GGML_TYPE_Q8_0> q4_0_4x4_q8_0;
-static const tensor_traits<block_q4_0, 8, 4, GGML_TYPE_Q8_0> q4_0_4x8_q8_0;
-static const tensor_traits<block_q4_0, 8, 8, GGML_TYPE_Q8_0> q4_0_8x8_q8_0;
-static const tensor_traits<block_q4_K, 8, 8, GGML_TYPE_Q8_K> q4_K_8x8_q8_K;
-
-// instance for IQ4
-static const tensor_traits<block_iq4_nl, 4, 4, GGML_TYPE_Q8_0> iq4_nl_4x4_q8_0;
-
 }  // namespace ggml::cpu::repack

 static const ggml::cpu::tensor_traits * ggml_repack_get_optimal_repack_type(const struct ggml_tensor * cur) {
+    // instance for Q4
+    static const ggml::cpu::repack::tensor_traits<block_q4_0, 4, 4, GGML_TYPE_Q8_0> q4_0_4x4_q8_0;
+    static const ggml::cpu::repack::tensor_traits<block_q4_0, 8, 4, GGML_TYPE_Q8_0> q4_0_4x8_q8_0;
+    static const ggml::cpu::repack::tensor_traits<block_q4_0, 8, 8, GGML_TYPE_Q8_0> q4_0_8x8_q8_0;
+    static const ggml::cpu::repack::tensor_traits<block_q4_K, 8, 8, GGML_TYPE_Q8_K> q4_K_8x8_q8_K;
+
+    // instance for IQ4
+    static const ggml::cpu::repack::tensor_traits<block_iq4_nl, 4, 4, GGML_TYPE_Q8_0> iq4_nl_4x4_q8_0;
+
     if (cur->type == GGML_TYPE_Q4_0) {
         if (ggml_cpu_has_avx2() || (ggml_cpu_has_sve() && ggml_cpu_has_matmul_int8() && ggml_cpu_get_sve_cnt() == QK8_0)) {
             if (cur->ne[1] % 8 == 0) {
-                return &ggml::cpu::repack::q4_0_8x8_q8_0;
+                return &q4_0_8x8_q8_0;
             }
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
             if (cur->ne[1] % 4 == 0) {
-                return &ggml::cpu::repack::q4_0_4x8_q8_0;
+                return &q4_0_4x8_q8_0;
             }
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
             if (cur->ne[1] % 4 == 0) {
-                return &ggml::cpu::repack::q4_0_4x4_q8_0;
+                return &q4_0_4x4_q8_0;
             }
         }
     } else if (cur->type == GGML_TYPE_Q4_K) {
         if (ggml_cpu_has_avx2()) {
             if (cur->ne[1] % 8 == 0) {
-                return &ggml::cpu::repack::q4_K_8x8_q8_K;
+                return &q4_K_8x8_q8_K;
             }
         }
     } else if (cur->type == GGML_TYPE_IQ4_NL) {
         if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
             if (cur->ne[1] % 4 == 0) {
-                return &ggml::cpu::repack::iq4_nl_4x4_q8_0;
+                return &iq4_nl_4x4_q8_0;
             }
         }
     }

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 19, 2025
@ckastner
Copy link
Collaborator Author

Yes that seems to have worked perfectly. Thanks!

Do you want to change this, or should I file a separate MR for this, or should I just include the change in this MR here?

@slaren
Copy link
Member

slaren commented Jun 19, 2025

Just include the change here.

When using GGML_BACKEND_DL=ON, these initializations might use
instructions that are not supported by the current CPU.
@ckastner
Copy link
Collaborator Author

OK, here are the latest results:

What works

On little-endian, backends are scored and loaded correctly, here from my POWER9 test machine with VSX:

ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power7_2.so score: 67
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power8_1.so score: 5
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power7_1.so score: 3
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power9.so score: 73
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power8_2.so score: 69
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power10.so score: 0
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power0.so score: 1
ggml_backend_load_best: /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power11.so score: 0
load_backend: loaded CPU backend from /home/ckk/llama.cpp-le/build/bin/libggml-cpu-power9.so

Performance of the loaded backend was on par with the GGML_NATIVE=ON backend.

What probably works

On the big-endian POWER8 where I worked on this, backends also loaded correctly, but the model ggml-model-q4_0.gguf I used failed to load. I then learned from gguf that endianness is relevant and my model was little-endian. I did find some big-endian models but for reasons, I could not test them yet. I don't see why they shouldn't work, though.

What doesn't work

Features only get detected for architectures that self-report, through getauxval(AT_PLATFORM), as powerN. The cmake code also tests for CMAKE_SYSTEM_PROCESSOR=powerpc64le but I don't have a testbed for this. I don't even know if that is a valid value for AT_PLATFORM, or if that's just a cmake-ism. Either way, if the host supports VSX, at least that will get picked up.

Next steps

I hope the above is good enough for an initial version for PowerPC.

I intend to improve upon this in another PR soon where I address the platform issue in a more general way.

@ckastner ckastner marked this pull request as ready for review June 19, 2025 21:05
@ckastner ckastner merged commit 6369be0 into ggml-org:master Jun 20, 2025
47 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jun 20, 2025
* mamba2-sync: (24 commits)
sync : ggml
Add `ggml_roll` (ggml/1274)
docs : fix the link to llama.h (ggml-org#14293)
CUDA: add conv_2d_transpose (ggml-org#14287)
lint : remove trailing whitepace (ggml-org#14304)
vocab : prevent tokenizer overflow (ggml-org#14301)
sycl: add usage of enqueue_functions extension (ggml-org#14244)
Implement GGML_CPU_ALL_VARIANTS for PowerPC (ggml-org#14286)
llama : improve sep token handling (ggml-org#14272)
cuda : synchronize graph capture and cublas handle destruction (ggml-org#14288)
ggml : fix repack work size for mul_mat_id (ggml-org#14292)
ggml: Update KleidiAI to v1.9.0 (ggml-org#14277)
model : more uniform output id handling (ggml-org#14275)
ubatch : new splitting logic (ggml-org#14217)
CUDA: add conv_2d_dw (ggml-org#14265)
ggml-cpu : remove unnecesary arm feature detection (ggml-org#14281)
gguf-py : make sentencepiece optional (ggml-org#14200)
server : add server parameters for draft model cache type (ggml-org#13782)
build : suppress gcc15 compile warnings (ggml-org#14261)
sycl: Cleanup codepaths in Get Rows in sycl backend (ggml-org#14215)
...
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jun 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.