ubatch : new splitting logic #14217

ggerganov · Jun 16, 2025

Remove llama_sbatch
llama_batch_allocr now handles ubatch splitting
llama_batch_allocr precomputes various index maps and guarantees the inputs are consistent
llama_ubatch can now iterate over unique sequence ids
Change notion of llama_ubatch.n_seqs from "number of sequences" to "number of sequence sets"
Enable pooling for n_tokens <= seq_id. Remove padding hack from llama-server
Detailed batch debug output

TODO:

Fix this:

make -j && LLAMA_BATCH_DEBUG=2 ./bin/llama-mtmd-cli -hf ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q8_0 --image ~/Downloads/rects.png -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." --seed 1 -ngl 99 --temp 0.0 -c 20000 -b 1

compilade · Jun 17, 2025

This breaks shuffled batches for equal splits.

When running test-model-random (from #14139) with this I get

Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=512: OK
init: sequence 0 does not start from the last position stored in the memory
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
get_logits_ith: invalid logits id 0, reason: batch.logits[0] != true
/path/to/llama.cpp/tests/test-model-random.cpp:841: GGML_ASSERT(out) failed

But there's also something else which did not happen before:

Comparing output for 'Llama4', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=1: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=2: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=512: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK
Error for seq_id 3 is 0.008624 at n_past=525
Error for seq_id 4 is 0.005619 at n_past=487
Error for seq_id 4 is 0.133501 at n_past=590
Comparing output for 'Llama4', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=1: (40%) FAILED
Error for seq_id 3 is 0.008624 at n_past=525
Error for seq_id 4 is 0.005619 at n_past=487
Error for seq_id 4 is 0.133501 at n_past=590
Comparing output for 'Llama4', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=2: (40%) FAILED
Error for seq_id 3 is 0.008624 at n_past=525
Error for seq_id 4 is 0.005619 at n_past=487
Error for seq_id 4 is 0.133501 at n_past=590
Comparing output for 'Llama4', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=512: (40%) FAILED

Seems like multiple sequences with chunked SWA have some inconsistency.

ggerganov · Jun 17, 2025

But there's also something else which did not happen before:

Does it trigger consistently? It passes on my end:

  gg/ubatch-rework [¡1⇡8]  +1 -1  ~/development/github/llama.cpp/build 
 18:48:42  git show
commit c0df4490c4d6e04ec8e2421fdba2655cbc3d5b44 (HEAD -> gg/ubatch-rework)
Merge: cc7952b42 04b8f5143
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Tue Jun 17 18:44:41 2025 +0300
    Merge remote-tracking branch 'origin/compilade/test-model-random' into gg/ubatch-rework
  gg/ubatch-rework [¡1⇡8]  +1 -1  ~/development/github/llama.cpp/build 
 18:48:48  git diff
diff --git a/tests/test-model-random.cpp b/tests/test-model-random.cpp
index 218cfcb82..b5c1d7248 100644
--- a/tests/test-model-random.cpp
+++ b/tests/test-model-random.cpp
@@ -1004,7 +1004,7 @@ int main(int argc, char ** argv) {
                     llama_free(ref_ctx);
                 }
 
-                for (bool shuffle : { false, true }) {
+                for (bool shuffle : { false, }) {
 
                     // skip shuffling the batch for non-recurrent models
                     // (simple splits don't handle shuffled batches correctly)
  gg/ubatch-rework [¡1⇡8]  +1 -1  ~/development/github/llama.cpp/build 
 18:48:57  a=$(make -j > /dev/null) && ./bin/test-model-random
..............
Comparing output for 'Llama2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=1: OK
Comparing output for 'Llama2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=2: OK
Comparing output for 'Llama2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=512: OK
Comparing output for 'Llama2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK
Comparing output for 'Llama2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK
Comparing output for 'Llama2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK
Comparing output for 'Llama2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=1: OK
Comparing output for 'Llama2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=2: OK
Comparing output for 'Llama2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=512: OK
.............................
Comparing output for 'Llama4', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=1: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=2: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=512: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=1: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=2: OK
Comparing output for 'Llama4', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=512: OK
................
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=1: OK
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=2: OK
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=512: OK
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=1: OK
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=2: OK
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=512: OK
............
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=512: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=512: OK

compilade · Jun 17, 2025

Does it trigger consistently?

It does on a Pixel 9 Pro in Termux. But it seems like this might not be a regression from here since it also happens in #14139 (sorry, I didn't test that branch on this hardware before).

-- ARM feature DOTPROD enabled
-- ARM feature SVE enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native+dotprod+i8mm+sve+nosme

Reproducing commands:

$ git switch compilade/test-model-random
$ mkdir build
$ cd build
$ cmake .. --fresh
$ make -j6 test-model-random
$ ./bin/test-model-random

So it's not a problem caused by this PR, sorry for misreporting.

(The shuffled batch regression however, is)

ggerganov · Jun 17, 2025

Yes, I reproduce it on my Mac also when I disable Metal, or force ngl = 0. So it's very likely a bug in one of the CPU kernel.

ggerganov · Jun 17, 2025

My best guess is that the summation here overflows FP16:

llama.cpp/ggml/src/ggml-cpu/vec.cpp

Lines 199 to 219 in 860a9e4

    
           #if defined(GGML_SIMD) 
        
               const int np = (n & ~(GGML_F16_STEP - 1)); 
        
               GGML_F16_VEC sum[GGML_F16_ARR] = { GGML_F16_VEC_ZERO }; 
        
               GGML_F16_VEC ax[GGML_F16_ARR]; 
        
               GGML_F16_VEC ay[GGML_F16_ARR]; 
        
               for (int i = 0; i < np; i += GGML_F16_STEP) { 
        
                   for (int j = 0; j < GGML_F16_ARR; j++) { 
        
                       ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j); 
        
                       ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j); 
        
                       sum[j] = GGML_F16_VEC_FMA(sum[j], ax[j], ay[j]); 
        
                   } 
        
               } 
        
               // reduce sum0..sum3 to sum0 
        
               GGML_F16_VEC_REDUCE(sumf, sum);

Applying this patch to make the accumulation use F32 (via the leftovers loop) fixes the issue:

diff --git a/ggml/src/ggml-cpu/vec.cpp b/ggml/src/ggml-cpu/vec.cpp
index f7614568e..03044d382 100644
--- a/ggml/src/ggml-cpu/vec.cpp
+++ b/ggml/src/ggml-cpu/vec.cpp
@@ -198,7 +198,7 @@ void ggml_vec_dot_f16(int n, float * GGML_RESTRICT s, size_t bs, ggml_fp16_t * G
     ggml_float sumf = 0.0;
 
 #if defined(GGML_SIMD)
-    const int np = (n & ~(GGML_F16_STEP - 1));
+    const int np = 0;
 
     GGML_F16_VEC sum[GGML_F16_ARR] = { GGML_F16_VEC_ZERO };

The best fix for now is probably to set the KV cache types that the test uses to F32 - this also works.

This breaks shuffled batches for equal splits.

I'll take a look if this can be handled cleanly, but I'm wondering if this use case is really needed. Do you have any specific applications in mind that require shuffled positions in the input batch?

compilade · Jun 18, 2025

The best fix for now is probably to set the KV cache types that the test uses to F32 - this also works.

That's very likely what I'll end up doing, thanks. (although it's less representative of actual use)

Do you have any specific applications in mind that require shuffled positions in the input batch?

The main benefit is that it makes it really easy to test that sequence aggregation works correctly for proper splitting. If it works with shuffled batches, than it can work with pretty much anything.

For an actual use case, I'm not really sure.

I'll see how the test can be changed to not affect the relative order within the sequences but still shuffle the relative order of tokens of different sequences. This makes the test a bit harder to implement, though it would be more representative of the expected possible batch orderings (and should probably make the test viable for simple splits too).

ggerganov · Jun 18, 2025

I'll see how the test can be changed to not affect the relative order within the sequences but still shuffle the relative order of tokens of different sequences.

Ok, that would be useful. Regarding the fully shuffled batches, I will add checks for such inputs and raise an error.

compilade · Jun 18, 2025

src/llama-batch.cpp

+    return ubatch_add(idxs, idxs.size(), false);
+}
+
+llama_ubatch llama_batch_allocr::split_equal(uint32_t n_ubatch) {


Note that for equal splits, some sequence sets are not compatible (i.e. they can't be put in the same ubatch). For example, a sequence set containing multiple seq_ids cannot be mixed with one having a seq_id in the multi-sequence set.

For example, tokens with seq_ids = { 0, 1, 2, 3 } are not compatible with tokens in seq_ids = { 1 }.

The reason is that the recurrent states are only copied to the target sequences on ubatch boundaries, and so dependant tokens cannot be mixed with a shared trunk.

Is this handled here?

Basically the main constraint to check would be that the sequence sets in a ubatch are independent (at least, I think that would be sufficient?).

(Before this PR, it was handled by splitting multi-sequence token groups on their own before the single-sequence tokens)

(I did not implement multi-sequence tests yet in #14139, but that should also be able to answer this question once implemented)

For example, a sequence set containing multiple seq_ids cannot be mixed with one having a seq_id in the multi-sequence set.

Yes, this logic here at the beginning of the function determines the unique non-overlapping sequence sets that will be contained in this ubatch:

llama.cpp/src/llama-batch.cpp

Lines 421 to 446 in 034b055

// determine the non-overlapping sequence sets participating in this ubatch

for (int32_t i = 0; i < batch.n_tokens; ++i) {

if (used[i]) {

continue;

}

bool add = true;

for (uint32_t s = 0; s < cur_seq_set.size(); ++s) {

// no overlap with existing sequence sets:

if (!(cur_seq_set[s] & seq_set[i]).none()) {

add = false;

break;

}

}

if (add) {

cur_seq_set.push_back(seq_set[i]);

if (cur_seq_set.size() > n_ubatch) {

break;

}

}

}

const uint32_t n_seqs = cur_seq_set.size();

ggerganov · Jun 18, 2025

@compilade FYI tentative plan is to first merge #13979 and after that to merge this PR (unless you spot some more issues). ETA probably tomorrow.

ggerganov · Jun 19, 2025

src/llama-memory-hybrid.cpp

-    status(LLAMA_MEMORY_STATUS_SUCCESS) {
+    state_attn(new llama_kv_cache_unified_state(mem->get_mem_attn(), std::move(heads_attn), this->ubatches)),
+    state_recr(new llama_memory_recurrent_state(mem->get_mem_recr(),                        this->ubatches)),
+    status(llama_memory_status_combine(state_attn->get_status(), state_recr->get_status())) {


@gabe-l-hart This status combine was missing in #13979 - added here.

ggml-ci

ggerganov · Jun 19, 2025

Merging #14139 and running test-model-random produces no errors:

..............
Comparing output for 'Llama2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=1: OK (max err: 0)
Comparing output for 'Llama2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=2: OK (max err: 0)
Comparing output for 'Llama2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=512: OK (max err: 6.5e-08)
Comparing output for 'Llama2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK (max err: 0)
Comparing output for 'Llama2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK (max err: 0)
Comparing output for 'Llama2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK (max err: 9.9e-07)
Comparing output for 'Llama2', with shuffle=1, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK (max err: 0)
Comparing output for 'Llama2', with shuffle=1, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK (max err: 0)
Comparing output for 'Llama2', with shuffle=1, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK (max err: 9.9e-07)
Comparing output for 'Llama2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=1: OK (max err: 2.5e-14)
Comparing output for 'Llama2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=2: OK (max err: 2.5e-14)
Comparing output for 'Llama2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=512: OK (max err: 4.1e-06)
Comparing output for 'Llama2', with shuffle=1, n_seq_max=5, n_ctx=3215, n_ubatch=1: OK (max err: 0)
Comparing output for 'Llama2', with shuffle=1, n_seq_max=5, n_ctx=3215, n_ubatch=2: OK (max err: 2.5e-14)
Comparing output for 'Llama2', with shuffle=1, n_seq_max=5, n_ctx=3215, n_ubatch=512: OK (max err: 4.1e-06)
.............................
Comparing output for 'Llama4', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=1: OK (max err: 0)
Comparing output for 'Llama4', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=2: OK (max err: 6.2e-13)
Comparing output for 'Llama4', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=512: OK (max err: 7.7e-07)
Comparing output for 'Llama4', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK (max err: 8.5e-11)
Comparing output for 'Llama4', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK (max err: 2.2e-10)
Comparing output for 'Llama4', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK (max err: 2.2e-06)
Comparing output for 'Llama4', with shuffle=1, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK (max err: 7.7e-11)
Comparing output for 'Llama4', with shuffle=1, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK (max err: 8.4e-11)
Comparing output for 'Llama4', with shuffle=1, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK (max err: 2.2e-06)
Comparing output for 'Llama4', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=1: OK (max err: 1.6e-10)
Comparing output for 'Llama4', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=2: OK (max err: 1.2e-10)
Comparing output for 'Llama4', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=512: OK (max err: 7.1e-05)
Comparing output for 'Llama4', with shuffle=1, n_seq_max=5, n_ctx=3215, n_ubatch=1: OK (max err: 2e-10)
Comparing output for 'Llama4', with shuffle=1, n_seq_max=5, n_ctx=3215, n_ubatch=2: OK (max err: 1.2e-10)
Comparing output for 'Llama4', with shuffle=1, n_seq_max=5, n_ctx=3215, n_ubatch=512: OK (max err: 7.1e-05)
................
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=1: OK (max err: 0)
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=2: OK (max err: 8.1e-14)
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=512: OK (max err: 2e-08)
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK (max err: 7e-13)
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK (max err: 4.1e-13)
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK (max err: 4e-08)
Comparing output for 'Gemma2', with shuffle=1, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK (max err: 7.8e-13)
Comparing output for 'Gemma2', with shuffle=1, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK (max err: 4.8e-13)
Comparing output for 'Gemma2', with shuffle=1, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK (max err: 4e-08)
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=1: OK (max err: 8.6e-13)
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=2: OK (max err: 1e-12)
Comparing output for 'Gemma2', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=512: OK (max err: 5.3e-07)
Comparing output for 'Gemma2', with shuffle=1, n_seq_max=5, n_ctx=3215, n_ubatch=1: OK (max err: 9e-13)
Comparing output for 'Gemma2', with shuffle=1, n_seq_max=5, n_ctx=3215, n_ubatch=2: OK (max err: 7.6e-13)
Comparing output for 'Gemma2', with shuffle=1, n_seq_max=5, n_ctx=3215, n_ubatch=512: OK (max err: 5.3e-07)
............
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=1: OK (max err: 0)
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=2: OK (max err: 0)
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=643, n_ubatch=512: OK (max err: 0)
Comparing output for 'Mamba', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK (max err: 0)
Comparing output for 'Mamba', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK (max err: 0)
Comparing output for 'Mamba', with shuffle=0, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK (max err: 0)
Comparing output for 'Mamba', with shuffle=1, n_seq_max=2, n_ctx=1286, n_ubatch=1: OK (max err: 0)
Comparing output for 'Mamba', with shuffle=1, n_seq_max=2, n_ctx=1286, n_ubatch=2: OK (max err: 0)
Comparing output for 'Mamba', with shuffle=1, n_seq_max=2, n_ctx=1286, n_ubatch=512: OK (max err: 0)
Comparing output for 'Mamba', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=1: OK (max err: 0)
Comparing output for 'Mamba', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=2: OK (max err: 0)
Comparing output for 'Mamba', with shuffle=0, n_seq_max=5, n_ctx=3215, n_ubatch=512: OK (max err: 0)
Comparing output for 'Mamba', with shuffle=1, n_seq_max=5, n_ctx=3215, n_ubatch=1: OK (max err: 0)
Comparing output for 'Mamba', with shuffle=1, n_seq_max=5, n_ctx=3215, n_ubatch=2: OK (max err: 0)
Comparing output for 'Mamba', with shuffle=1, n_seq_max=5, n_ctx=3215, n_ubatch=512: OK (max err: 0)

* mamba2-sync: (24 commits) sync : ggml Add `ggml_roll` (ggml/1274) docs : fix the link to llama.h (ggml-org#14293) CUDA: add conv_2d_transpose (ggml-org#14287) lint : remove trailing whitepace (ggml-org#14304) vocab : prevent tokenizer overflow (ggml-org#14301) sycl: add usage of enqueue_functions extension (ggml-org#14244) Implement GGML_CPU_ALL_VARIANTS for PowerPC (ggml-org#14286) llama : improve sep token handling (ggml-org#14272) cuda : synchronize graph capture and cublas handle destruction (ggml-org#14288) ggml : fix repack work size for mul_mat_id (ggml-org#14292) ggml: Update KleidiAI to v1.9.0 (ggml-org#14277) model : more uniform output id handling (ggml-org#14275) ubatch : new splitting logic (ggml-org#14217) CUDA: add conv_2d_dw (ggml-org#14265) ggml-cpu : remove unnecesary arm feature detection (ggml-org#14281) gguf-py : make sentencepiece optional (ggml-org#14200) server : add server parameters for draft model cache type (ggml-org#13782) build : suppress gcc15 compile warnings (ggml-org#14261) sycl: Cleanup codepaths in Get Rows in sycl backend (ggml-org#14215) ...

rotemdan · Jun 22, 2025

Over the past few days there seems to be a new issue or behavior (see #14298) causing calls to llama_decode to fail with:

sequence 0 does not start from the last position stored in the memory

Based on my investigation, so far, this happens when llama_decode is called with a batch that intersects the current sequence previously stored in the context, meaning it's effectively rewriting a part of it.

If I ensure to call llama_kv_self_seq_rm to truncate the sequence up to the start position of the new batch, before calling llama_decode, it seems to prevent the error.

The thing is, this error never happened before about a few days ago, and users are reporting it happens on the server as well (as part of context caching possibly), meaning it may not be an intentional behavioral change.

It seems possible that it's related to this pull request?

ggerganov · Jun 22, 2025

Are you using the stock llama-server or a modified version?

Currently, the design is to manually remove the intersecting tokens as you noticed, before calling llama_decode. The llama-server should be handling this, so if you observe it with the vanilla version of the server, then it might indicate a bug. If so, please provide repro steps.

rotemdan · Jun 22, 2025

#14298 contains a report for the server (not by myself). I am not using the server, and I did not test it for this issue. I'm using the C API directly.

I'm pretty sure that up until around 1 - 2 days ago I never got an error when calling llama_decode on a batch that intersected the current sequence. I previously assumed that there is implicit truncation done in that case, and that it is likely the intended behavior. It seemed to have worked correctly, even though it was possibly not intended? Maybe some of the recent changes impacted that?

ggerganov force-pushed the gg/ubatch-rework branch from 166ad5e to 57c79a9 Compare June 17, 2025 08:41

github-actions bot added examples server labels Jun 17, 2025

ggerganov force-pushed the gg/ubatch-rework branch from 57c79a9 to d3cb489 Compare June 17, 2025 08:56

ggerganov marked this pull request as ready for review June 17, 2025 12:36

ggerganov requested a review from ngxson as a code owner June 17, 2025 12:36

ggerganov requested a review from compilade June 17, 2025 12:36

ggerganov mentioned this pull request Jun 18, 2025

Hybrid recurrent cache #13979

Merged

compilade reviewed Jun 18, 2025

View reviewed changes

ggerganov mentioned this pull request Jun 18, 2025

Eval bug: Qwen2.5-VL-7B-Instruct returns extremely inaccurate bbox coordinates #13694

Open

ggerganov force-pushed the gg/ubatch-rework branch from 1cfb8bb to ca5ec9f Compare June 19, 2025 05:25

ggerganov commented Jun 19, 2025

View reviewed changes

ubatch : new splitting logic

deb5328

ggml-ci

ggerganov force-pushed the gg/ubatch-rework branch from ca5ec9f to deb5328 Compare June 19, 2025 05:36

ggerganov mentioned this pull request Jun 19, 2025

model : more uniform output id handling #14275

Merged

ggerganov merged commit 4c9fdfb into master Jun 20, 2025
55 checks passed

ggerganov deleted the gg/ubatch-rework branch June 20, 2025 07:14

rotemdan mentioned this pull request Jun 22, 2025

Misc. bug: Completion fails with error 500 #14298

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ubatch : new splitting logic #14217

ubatch : new splitting logic #14217

ggerganov commented Jun 16, 2025 •

edited

Loading

Uh oh!

compilade commented Jun 17, 2025 •

edited

Loading

Uh oh!

ggerganov commented Jun 17, 2025

Uh oh!

compilade commented Jun 17, 2025 •

edited

Loading

Uh oh!

ggerganov commented Jun 17, 2025

Uh oh!

ggerganov commented Jun 17, 2025 •

edited

Loading

Uh oh!

compilade commented Jun 18, 2025

Uh oh!

ggerganov commented Jun 18, 2025

Uh oh!

compilade Jun 18, 2025 •

edited

Loading

Uh oh!

ggerganov Jun 18, 2025

Uh oh!

ggerganov commented Jun 18, 2025

Uh oh!

ggerganov Jun 19, 2025

Uh oh!

ggerganov commented Jun 19, 2025

Uh oh!

Uh oh!

rotemdan commented Jun 22, 2025

Uh oh!

ggerganov commented Jun 22, 2025

Uh oh!

rotemdan commented Jun 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

	// determine the non-overlapping sequence sets participating in this ubatch
	for (int32_t i = 0; i < batch.n_tokens; ++i) {
	if (used[i]) {
	continue;
	}

	bool add = true;

	for (uint32_t s = 0; s < cur_seq_set.size(); ++s) {
	// no overlap with existing sequence sets:
	if (!(cur_seq_set[s] & seq_set[i]).none()) {
	add = false;
	break;
	}
	}

	if (add) {
	cur_seq_set.push_back(seq_set[i]);

	if (cur_seq_set.size() > n_ubatch) {
	break;
	}
	}
	}

	const uint32_t n_seqs = cur_seq_set.size();

Search code, repositories, users, issues, pull requests...

ubatch : new splitting logic #14217

ubatch : new splitting logic #14217

Conversation

ggerganov commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

compilade commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jun 17, 2025

Uh oh!

compilade commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jun 17, 2025

Uh oh!

ggerganov commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

compilade commented Jun 18, 2025

Uh oh!

ggerganov commented Jun 18, 2025

Uh oh!

compilade Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Jun 18, 2025

Uh oh!

ggerganov Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Jun 19, 2025

Uh oh!

Uh oh!

rotemdan commented Jun 22, 2025

Uh oh!

ggerganov commented Jun 22, 2025

Uh oh!

rotemdan commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Jun 16, 2025 •

edited

Loading

compilade commented Jun 17, 2025 •

edited

Loading

compilade commented Jun 17, 2025 •

edited

Loading

ggerganov commented Jun 17, 2025 •

edited

Loading

compilade Jun 18, 2025 •

edited

Loading

rotemdan commented Jun 22, 2025 •

edited

Loading