vulkan: Add fusion support for RMS_NORM+MUL #14366

jeffbolznv · Jun 24, 2025

Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
Add detection logic and basic fusion logic in ggml-vulkan.
Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test.

I picked rms_norm+mul to fuse because it's commonly generated in llm_graph_context::build_norm.

jeffbolznv · Jun 24, 2025

ggml/src/ggml-vulkan/ggml-vulkan.cpp

+    // Since norm is the first operand of mul, it must be the same shape
+    GGML_ASSERT(ggml_are_same_shape(mul, norm));
+
+    // XXX TODO: Do we need a way to indicate that the user doesn't need the intermediate result?


@slaren I'd like to get your thoughts on this. How do we know that the user doesn't want to have all the intermediate results available to them? Seems like they'd need to opt in or out of fusion somehow... GGML_TENSOR_FLAG_OUTPUT seems like almost what we would want, but it's slightly different and doesn't seem to be used much?

Checking for GGML_TENSOR_FLAG_OUTPUT is enough. This flag guarantees that the tensor will not be overwritten by another computation, so it is necessary to set this flag if the user wants the output of a tensor. In cases like imatrix, the graph is evaluated one operation at a time, so fusion should never be active.

Precisely because getting details like this one right is not completely obvious, it would be good to have a function in ggml similar to this one that checks if multiple operations can be fused, that all the backends can use to implement operator fusion.

Yeah, I can move some of the logic there, but I think eventually some of the implementation details may differ between backends. Hopefully there's a way to have some common logic and some backend-specific logic.

OK, I've extracted out some common logic into a helper function.

Can you also extract the remaining logic in ggml_can_fuse_rms_norm_mul to a generic ggml function that takes a number of ops? Similar to what was proposed in #14158 (comment). While it may not work in every case, it will work in many cases, and it will avoid duplicating this code in every backend.

I will, but I'm skeptical that in the long run there will be a lot of commonality.

jeffbolznv · Jun 24, 2025

Forgot to include perf results. Helps pp, but no meaningful change to tg.

before:
Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\Phi-3-mini-4k-instruct-q4.gguf -fa 1 -n 128 -p 512 --prio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           pp512 |      6213.91 ± 37.97 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        140.52 ± 0.86 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\meta-llama-3-8b-instruct.Q4_K_M.gguf -fa 1 -n 128 -p 512 --prio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |      3600.60 ± 15.70 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         84.19 ± 0.21 |

after:
Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\Phi-3-mini-4k-instruct-q4.gguf -fa 1 -n 128 -p 512 --prio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           pp512 |      6456.74 ± 56.08 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        140.58 ± 1.12 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\meta-llama-3-8b-instruct.Q4_K_M.gguf -fa 1 -n 128 -p 512 --prio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           pp512 |      3644.10 ± 12.41 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         84.04 ± 0.22 |

- Add a use_count to ggml_tensor, so we can detect if an output is used more than once. - Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor. - Add detection logic and basic fusion logic in ggml-vulkan. - Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test.

ggml/src/ggml-impl.h

slaren · Jun 26, 2025

test-thread-safety with thread sanitizer reports a data race, which is likely due to the use_count of the model weights being incremented in multiple threads simultaneously. It's not likely to cause problems for this use cause, but at least the error should be suppressed.

jeffbolznv · Jun 26, 2025

Oh, maybe I should make the use counts part of the graph rather than part of the nodes. I'll look into that.

… when used in multiple threads

ggml/src/ggml.c

style fixes

ggml/src/ggml-vulkan/ggml-vulkan.cpp

fix type. handle mul operands being swapped.

0cc4m

Looks good to me, just some minor points. I do see the minor performance increase in pp on all my hardware.

ggml/src/ggml-vulkan/ggml-vulkan.cpp

CISC · Jun 29, 2025

We are getting a suspicious failure on CUDA:
https://github.com/ggml-org/ci/blob/results/llama.cpp/bd/9c981d7226107f18deb8344c3301450311bb8b/ggml-4-x86-cuda-v100/ctest_release-ctest.log#L6239

0cc4m · Jun 29, 2025

We are getting a suspicious failure on CUDA: https://github.com/ggml-org/ci/blob/results/llama.cpp/bd/9c981d7226107f18deb8344c3301450311bb8b/ggml-4-x86-cuda-v100/ctest_release-ctest.log#L6239

Slightly over the error limit can just be the random test data aligning in an unlikely and unfortunate way, it happens sometimes. If it happens every run or most runs it's an actual issue.

CISC · Jun 29, 2025

We are getting a suspicious failure on CUDA: https://github.com/ggml-org/ci/blob/results/llama.cpp/bd/9c981d7226107f18deb8344c3301450311bb8b/ggml-4-x86-cuda-v100/ctest_release-ctest.log#L6239

Slightly over the error limit can just be the random test data aligning in an unlikely and unfortunate way, it happens sometimes. If it happens every run or most runs it's an actual issue.

Two in a row so far...

CISC · Jun 29, 2025

I see the test was just re-enabled in this PR, so it might have always been failing for CUDA?

CISC · Jun 29, 2025

Ok, looks like it's randomly failing locally for me, so might be a fluke.

jeffbolznv · Jun 29, 2025

If it's failing we should probably disable the test again. But I haven't seen failures on the Vulkan backend.

CISC · Jun 29, 2025

It failed for the third time in a row now, probably worth looking at. @JohannesGaessler

CISC · Jun 29, 2025

Seems it fails on SYCL too.

cc/ @qnixsynapse

jeffbolznv requested a review from slaren June 24, 2025 20:12

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jun 24, 2025

jeffbolznv marked this pull request as draft June 24, 2025 20:13

jeffbolznv commented Jun 24, 2025

View reviewed changes

jeffbolznv mentioned this pull request Jun 24, 2025

ggml : implement REGLU/GEGLU/SWIGLU ops #14158

Merged

jeffbolznv added 4 commits June 26, 2025 13:07

extract some common fusion logic

18f2d00

fix -Winconsistent-missing-override

b84cb4a

move ggml_can_fuse to a common function

5e13dcf

jeffbolznv force-pushed the rms_mul_fusion_2 branch from c224529 to 5e13dcf Compare June 26, 2025 18:46

jeffbolznv commented Jun 26, 2025

View reviewed changes

ggml/src/ggml-impl.h Outdated Show resolved Hide resolved

jeffbolznv added 2 commits June 26, 2025 13:49

build fix

e6f3c06

C and C++ versions of can_fuse

da2fce0

move use count to the graph to avoid data races and double increments…

8c50a9b

… when used in multiple threads

slaren reviewed Jun 27, 2025

View reviewed changes

ggml/src/ggml.c Outdated Show resolved Hide resolved

jeffbolznv and others added 3 commits June 27, 2025 07:53

use hash table lookup to find node index

9ddd425

change use_counts to be indexed by hash table slot

1630435

minimize hash lookups

2dbac5a

style fixes

slaren approved these changes Jun 28, 2025

View reviewed changes

ggerganov approved these changes Jun 28, 2025

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated Show resolved Hide resolved

last node doesn't need single use.

6894905

fix type. handle mul operands being swapped.

jeffbolznv marked this pull request as ready for review June 28, 2025 13:37

jeffbolznv requested a review from 0cc4m June 28, 2025 13:37

0cc4m approved these changes Jun 28, 2025

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp Show resolved Hide resolved

ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated Show resolved Hide resolved

remove redundant parameter

a75e983

0cc4m merged commit bd9c981 into ggml-org:master Jun 29, 2025
48 checks passed

Search code, repositories, users, issues, pull requests...

vulkan: Add fusion support for RMS_NORM+MUL #14366

vulkan: Add fusion support for RMS_NORM+MUL #14366

Conversation

jeffbolznv commented Jun 24, 2025

Uh oh!

jeffbolznv Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

slaren Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

slaren Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

slaren Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

jeffbolznv commented Jun 24, 2025

Uh oh!

Uh oh!

slaren commented Jun 26, 2025

Uh oh!

jeffbolznv commented Jun 26, 2025

Uh oh!

Uh oh!

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC commented Jun 29, 2025

Uh oh!

0cc4m commented Jun 29, 2025

Uh oh!

CISC commented Jun 29, 2025

Uh oh!

CISC commented Jun 29, 2025

Uh oh!

CISC commented Jun 29, 2025

Uh oh!

jeffbolznv commented Jun 29, 2025

Uh oh!

CISC commented Jun 29, 2025

Uh oh!

CISC commented Jun 29, 2025

Uh oh!

Uh oh!