-
Notifications
You must be signed in to change notification settings - Fork 12.1k
kv-cache : avoid modifying recurrent cells when setting inputs #13834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The To clarify, the "preparation phase" is where we simply insert the ubatches in the cells to make sure they fit and then we revert back to the initial state as if no changes were ever made. We then start to insert the ubatches and compute them one-by-one. So I was confused why the warning disappears when the "preparation phase" is skipped and the only explanation I had was because of the updates during the compute. |
src/llama-kv-cache.cpp
Outdated
for (const auto & ubatch : ubatches) { | ||
if (!find_slot(ubatch)) { | ||
success = false; | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the "preparation phase".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. When I leave this out, the non-consecutive token position problem is still there (the only difference is that the warnings are not duplicated), so I don't think the preparation phase is the source of the problem.
When using -pps
the warnings are gone, though. Maybe this was an existing problem which was surfaced by not making -pps
the default in parallel.cpp
. I'll try to find the source of the kv_cell.pos
discrepancy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I confirm that the warnings are still there even without preparation - both on this branch and on the target branch. It's possible that I hallucinated that the warnings disappear without preparation.
When using -pps the warnings are gone, though. Maybe this was an existing problem which was surfaced by not making -pps the default in parallel.cpp. I'll try to find the source of the kv_cell.pos discrepancy.
You are likely right. Will take a look as well tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another problem which is detectable by running llama-parallel
with -pps
and deterministic settings and comparing the output at different -ub
. -ub 1
is always fine, but -ub 2
seems to produce weird output in this branch (like swapping answers). Default -ub
of 512 also manifests this problem.
For example (click to expand)
Towards the end the output is weird.
$ ./bin/llama-parallel -m /path/to/mamba-130M-hf-F16.gguf -np 5 -ns 8 --temp 0 --repeat-penalty 1.1 -ub 2 -pps
...
llama_context: constructing llama_context
llama_context: n_seq_max = 6
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 682
llama_context: n_batch = 2048
llama_context: n_ubatch = 2
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (682) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 1.15 MiB
llama_kv_cache_recurrent: kv_size = 6, n_seq_max = 6, type_k = 'f32', type_v = 'f32', n_layer = 24
llama_kv_cache_recurrent: CPU KV buffer size = 16.03 MiB
llama_kv_cache_recurrent: KV self size = 16.03 MiB, K (f32): 2.53 MiB, V (f32): 13.50 MiB
llama_context: CPU compute buffer size = 1.34 MiB
llama_context: graph nodes = 1278
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
No new questions so proceed with build-in defaults.
main: Simulating parallel requests from clients:
main: n_parallel = 5, n_sequences = 8, cont_batching = 1, system tokens = 260
main: Evaluating the system prompt ...
Processing requests ...
main: clearing the KV cache
Client 0, seq 0, started decoding ...
Client 1, seq 1, started decoding ...
Client 2, seq 2, started decoding ...
Client 3, seq 3, started decoding ...
Client 4, seq 4, started decoding ...
Client 0, seq 0/ 8, prompt 15 t, response 26 t, time 4.30 s, speed 9.54 t/s, cache miss 0
Input: What is the meaning of life?
Response: Life is a series of events, and the best way to understand them are by looking at what happens in each one.
Client 2, seq 2/ 8, prompt 15 t, response 26 t, time 4.30 s, speed 9.54 t/s, cache miss 0
Input: What is the meaning of life?
Response: Life is a series of events, and the best way to understand them are by looking at what happens in each one.
Client 0, seq 5, started decoding ...
Client 2, seq 6, started decoding ...
Client 2, seq 6/ 8, prompt 21 t, response 11 t, time 2.01 s, speed 15.93 t/s, cache miss 0
Input: If you could have any superpower, what would it be?
Response: I would like to be a superpower.
Client 2, seq 7, started decoding ...
Client 3, seq 3/ 8, prompt 26 t, response 39 t, time 6.84 s, speed 9.51 t/s, cache miss 0
Input: Are you familiar with the Special Theory of Relativity and can you explain it to me?
Response: I recommend the steak. It is a very good steak, and it's easy to cook with your hands on the stove or in a skillet if you have one handy at all times!
Client 2, seq 7/ 8, prompt 28 t, response 10 t, time 1.39 s, speed 27.28 t/s, cache miss 0
Input: I want to learn how to play the piano. What would be the best way to do it?
Response: I would suggest you play the piano.
Client 4, seq 4/ 8, prompt 16 t, response 49 t, time 7.70 s, speed 8.44 t/s, cache miss 0
Input: Recommend some interesting books to read.
Response: I recommend the book "The Golden Duck" by Richard Feynman. It is a fascinating and entertaining book that is written in a very readable style, and it is also a great way to learn about physics at school or college level!
Client 0, seq 5/ 8, prompt 26 t, response 65 t, time 5.27 s, speed 17.28 t/s, cache miss 0
Input: Are you familiar with the Special Theory of Relativity and can you explain it to me?
Response: I am a physicist and I have been working on this theory for over 20 years. It is a very simple theory that describes the behavior of particles in a vacuum, but it has many interesting properties such as the existence or nonexistence theorems etc., which are not known to us yet because we do know about them.
Client 1, seq 1/ 8, prompt 18 t, response 128 t, time 10.40 s, speed 14.04 t/s, cache miss 0
Input: What is the best way to cook a steak?
Response: The meaning of life is the pursuit and enjoyment that comes from living. It is a state in which one lives, and it is an important part to be alive at all times; it is a few years ago I was asked by my friend who works for me how I would like to work with him on this project. He said he would love if we could do some research into the effects of alcohol consumption among young people in the UK, and that he would be happy when we had a chance interview him about it.
I have been working at The Institute of Alcohol Studies for over 20 years now (and I am still working there), so
main: clearing the KV cache
run parameters as of 2025-05-27 23:50:32
main: n_parallel = 5, n_sequences = 8, cont_batching = 1, system tokens = 260
External prompt file: used built-in defaults
Model and path used: /path/to/mamba-130M-hf-F16.gguf
Total prompt tokens: 165, speed: 11.41 t/s
Total gen tokens: 354, speed: 24.48 t/s
Total speed (AVG): speed: 35.88 t/s
Cache misses: 0
llama_perf_context_print: load time = 120.10 ms
llama_perf_context_print: prompt eval time = 13517.87 ms / 743 tokens ( 18.19 ms per token, 54.96 tokens per second)
llama_perf_context_print: eval time = 823.75 ms / 36 runs ( 22.88 ms per token, 43.70 tokens per second)
llama_perf_context_print: total time = 14465.21 ms / 779 tokens
Notice how the last question about the steak is somehow answered as if the meaning of life was asked.
This does not happen with -ub 1
or with -np 1
.
The problem does not exist in #13746, but does exist in #9126 since at least 35d06fa, (but not on the corresponding commit on master
) which means the problem is most likely related to how I changed how the recurrent states are copied (which is weird because I think it did work at some point).
That your branch doesn't manifest the same problem narrows this to my new changes. Still not sure what exactly is the root cause; hopefully I'll find it soon.
This might or might not be related to the non-consecutive kv_cell.pos
problem. It's likely a different problem.
(EDIT: it's definitely a different problem, because it still happens after bdbfb4e even though the non-consecutive kv_cell.pos
problem has been fixed (the tail cell ids were not swapped correctly))
(EDIT2: Ok, I think I know what is happening, the first zero-ed cell sometimes is used as a source for non-zeroed states and this messes up some things. The fix will need to prevent that situation (EDIT3: somehow that's not sufficient; that was not the root cause))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, one change in #13746 that could be relevant to this is that we now allocate separate ubatch data buffers for each llama_ubatch
:
On master
, the ubatches were simply views of a single buffer, the contents of which were updated on every split
. But this was not compatible with the peparation logic, so now we allocate separate buffers.
9548d2a
to
256f1b7
Compare
9d05381
to
2b984f4
Compare
f23e4cc
to
71619f2
Compare
* kv-cache : remove inp_s_mask It was replaced with equivalent and simpler functionality with rs_z (the first zeroed state) and the already-existing inp_s_copy. * kv-cache : fix non-consecutive token pos warning for recurrent models The problem was apparently caused by how the tail cells were swapped. * graph : simplify logic for recurrent state copies * kv-cache : use cell without src refs for rs_z in recurrent cache
@ggerganov @compilade I'd love to get an update on the state of this branch (and #9126) as we prepare for the Granite 4 launch. It seems like there have been a number of structural changes that need to be accommodated since this branch was last updated to |
@gabe-l-hart My understanding is that there is an issue in the logic of the recurrent cells. We should find and fix this before any additional changes to the recurrent cache. I will try to take a look tomorrow using this branch and the repro command from above. |
@ggerganov Thanks! It's much appreciated. I may be able to dig a little today as well, so if I happen to get lucky and find something before you get to it, I'll update this PR. |
bdbfb4e
to
dd6495d
Compare
That is correct, there is a problem with the recurrent cells in multi-user cases, and from further tests it seems like it has been like this since a while ago on This PR does seem to fix part of the problem, though, since at I'm currently writing tests which will compare the outputs of batches at different concurrencies, so that it will be possible to notice problems in batch splits and the caches in the future (and hopefully this should also help figuring out the source of the current problem by narrowing the failing conditions). |
(EDIT: I just realized I hadn't pulled your latest sync with Some more evidence that might be helpful as I'm trying to repro the swapped answers mentioned in #13834 (comment). I don't see answers being swapped, but I do see some curious inconsistencies between running with/without On my machine, the (seeded) random selection of prompts is choosing (including the duplication):
The commands I'm running are: ./bin/llama-parallel -m ~/models/mamba-130m-hf/mamba-130M-hf-F16.gguf -np 5 -ns 8 --temp 0 --repeat-penalty 1.1 -ub 2 --junk 0
./bin/llama-parallel -m ~/models/mamba-130m-hf/mamba-130M-hf-F16.gguf -np 5 -ns 8 --temp 0 --repeat-penalty 1.1 -ub 2 --junk 0 -pps What I'm seeing is that for most of the prompts, the responses are identical with / without I'm not sure if this is actually helpful, but it might help zero in on a specific repro and/or hint at what's getting mixed up. |
I just rebuilt with your latest changes and I'm seeing different inconsistencies between with / without I've been walking through running with/without using |
The `state_copy` shuffle assumes everything is moved at once, which is not true when `states_extra` is copied back to the cache before copying the range of states between `head` and `head + n_seqs`. This is only a problem if any of the cells in [`head`, `head + n_seqs`) have an `src` in [`head + n_seqs`, `head + n_kv`), which does happen when `n_ubatch > 1` in the `llama-parallel` example. Changing the order of the operations avoids the potential overwrite before use, although when copies are avoided (like with Mamba2), this will require further changes. * llama-graph : rename n_state to state_size in build_recurrent_state This naming should reduce confusion between the state size and the number of states.
@gabe-l-hart Interesting! The llama.cpp/src/llama-kv-cache-recurrent.cpp Lines 574 to 579 in dd6495d
This is in Oh my, I might have found how the problem happens. Thank you for further narrowing what the cause could be. It really did help. The metadata seems to look fine, but how the states are used does not (because it leads to different output). I was wondering why the problem did not trigger anymore at More precisely, the problem only happens when a slot has Lines 1438 to 1457 in dd6495d
(this is the problematic code) The (on With the new changes in 62a9f34, I'm getting deterministic multi-user output! Mamba2 (in #9126) uses the |
Hooray! So glad you found it. I'll test it on my side tomorrow for another set of eyes. Some day, I want to fully unpack all the moving parts so I can read this explanation in all its detail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@compilade Feel free to merge.
Confirmed on my end. With that latest fix, I see both deterministic responses when the same prompt is repeated within a parallel batch, and deterministic responses between running with and without |
NOTE: this targets #13746, not(master
.master
is now the base branch of this PR since #13746 was merged)@ggerganov As discussed in #9126 (comment), this ports some of the changes from #9126 to #13746 for recurrent caches.
It mostly works, but there is still something wrong somewhere indicated by non-consecutive
token positions when running mamba-130m-hf with llama-parallel:
This was not a problem in #9126, so it might (or not) relate to other recent changes in how the kv cache is handled.
I'll attempt to figure out what is wrong by updating #9126 to the latest
master
and see if the problem also appears (EDIT: it does cause the problem to appear).Is there any recent change in how
kv_cell.pos
is handled which comes to mind and could cause this? (pos
is not reset properly between re-uses of clients inparallel
, at least for recurrent models)I'm suspecting #13598 might be related.
EDIT: it seems like with
-pps
, the problem isn't there, and this was the default behavior before #13598.Make sure to read the contributing guidelines before submitting a PR