Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Llama.embed crashes when n_batch > 512 #1762

Copy link
Copy link
Open
@lsorber

Description

@lsorber
Issue body actions

Expected Behavior

Embedding text with a long-context model like BGE-M3 [1] should be able to output token embeddings for more than 512 tokens (this is of interest for 'late interaction' retrieval [2]).

Llama-cpp-python will truncate the input tokens to the first n_batch tokens, where n_batch is 512 by default. The expected behaviour is that setting n_batch to a larger value would allow computing the token embeddings for longer sequences.

[1] https://huggingface.co/BAAI/bge-m3
[2] https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/

Current Behavior

The kernel crashes when embedding text with n_batch > 512. This crash is not specific to the embedding model, for a few models I've tried.

Steps to Reproduce

On a Google Colab T4 instance:

%pip install --quiet --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/12.2 llama-cpp-python==0.3.0

from llama_cpp import LLAMA_POOLING_TYPE_NONE, Llama

embedder = Llama.from_pretrained(
    repo_id="lm-kit/bge-m3-gguf",
    filename="*F16.gguf",
    n_ctx=0,  # Model context is 8192
    n_gpu_layers=-1,
    n_batch=513,  # ← Any value larger than 512 (the default) causes a crash
    embedding=True,
    pooling_type=LLAMA_POOLING_TYPE_NONE,
    verbose=False
)

text = "Hello world" * 1000
embedding = embedder.embed(text)  # ← Crash 💥
len(embedding)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.