[Question] How to use kv cache?

Hello!

I have been trying to test the new kv cache loading and ran into an issue, it seems to segfault when running llama_eval.
To save the current cache i do:

import llama_cpp
import pickle
from ctypes import cast
# Some work...
kv_tokens = llama_cpp.llama_get_kv_cache_token_count(ctx)
kv_len = llama_cpp.llama_get_kv_cache_size(ctx)
kv_cache = llama_cpp.llama_get_kv_cache(ctx) 
kv_cache = cast(kv_cache, llama_cpp.POINTER(llama_cpp.c_uint8 * kv_len))
kv_cache = bytearray(kv_cache)
with open("test.bin", "wb") as f:
    pickle.dump([kv_cache,kv_tokens], f)

Loading:

with open("test.bin", "rb") as f:
    kv_cache, kv_tokens = pickle.load(f)
    llama_cpp.llama_set_kv_cache(ctx, 
	    (llama_cpp.c_uint8 * len(kv_cache)).from_buffer(kv_cache),
	    len(kv_cache),
	    kv_tokens
    )

But running llama_cpp.llama_eval after will result in a segfault.

llama-cpp-python version: 0.1.16

How do i fix this?
Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] How to use kv cache? #14

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

[Question] How to use kv cache? #14

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions