Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Fix incorrect token_logprobs (due to indexing after sorting) #453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 8, 2023

Conversation

wu-qing-157
Copy link
Contributor

Thanks for the wonderful Python wrapper for llama.cpp.
When using your framework, I found a problem in output['logprobs']['token_logprobs].

For example, on commit ca11673 (latest main branch when I write this) the following code

from llama_cpp import Llama
llama = Llama('.../llama.cpp/models/30B/ggml-model-q8_0.bin', logits_all=True)
print(llama('I', temperature=0, max_tokens=5, logprobs=1))

will output {'id': 'cmpl-caa5cef2-ae2f-4157-8323-2dc644bb6308', 'object': 'text_completion', 'created': 1688724957, 'model': '/data/yi/llama.cpp/models/30B/ggml-model-q8_0.bin', 'choices': [{'text': "'m a big fan", 'index': 0, 'logprobs': {'tokens': ["'", 'm', ' a', ' big', ' fan'], 'text_offset': [1, 2, 3, 5, 9], 'token_logprobs': [-19.985351200841283, -19.405950866756548, -7.967424092851152, -12.165255948926252, -17.411318060240337], 'top_logprobs': [{"'": -2.2789093215688228}, {'m': -0.5416520462609419}, {' a': -2.119851766190996}, {' big': -2.872671052838605}, {' fan': -0.29880118020493784}]}, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 2, 'completion_tokens': 5, 'total_tokens': 7}}. You can see the token_logprobs are quite small, and are different from the numbers in top_logprobs, which are assumed to be consistent.

When I look into the code, I find that this unexpected behavior is caused by indexing the sorted logprobs. In L1134 of

for token, token_str, logprobs_token in zip(
all_tokens, all_token_strs, all_logprobs
):
text_offsets.append(text_offset)
text_offset += len(token_str)
tokens.append(token_str)
sorted_logprobs = list(
sorted(
zip(logprobs_token, range(len(logprobs_token))), reverse=True
)
)
token_logprobs.append(sorted_logprobs[int(token)][0])
top_logprob: Optional[Dict[str, float]] = {
self.detokenize([i]).decode("utf-8", errors="ignore"): logprob
for logprob, i in sorted_logprobs[:logprobs]
}
top_logprob.update({token_str: logprobs_token[int(token)]})
top_logprobs.append(top_logprob)
since sorted_logprobs is already sorted, indexing the list by token id does not produce the corresponding logprobs. We should instead index the list before sorting, like the one in L1139.
Same fix for L961 and L1036.

After the fix, on commit 9e61661, the above code outputs {'id': 'cmpl-74ba65a7-1978-4dcd-aa43-09b35ec8361c', 'object': 'text_completion', 'created': 1688725240, 'model': '/data/yi/llama.cpp/models/30B/ggml-model-q8_0.bin', 'choices': [{'text': "'m a big fan", 'index': 0, 'logprobs': {'tokens': ["'", 'm', ' a', ' big', ' fan'], 'text_offset': [1, 2, 3, 5, 9], 'token_logprobs': [-2.2789093215688228, -0.5416520462609419, -2.119851766190996, -2.872671052838605, -0.29880118020493784], 'top_logprobs': [{"'": -2.2789093215688228}, {'m': -0.5416520462609419}, {' a': -2.119851766190996}, {' big': -2.872671052838605}, {' fan': -0.29880118020493784}]}, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 2, 'completion_tokens': 5, 'total_tokens': 7}}, where the token_logprobs matches top_logprobs.

I think #349 is also due to this bug.

@abetlen
Copy link
Owner

abetlen commented Jul 8, 2023

@wu-qing-157 thanks for the catch! I'd just been testing using openplayground to visualize the logprobs, however that just uses the top_logprobs. LGTM

@abetlen abetlen merged commit b8e0bed into abetlen:main Jul 8, 2023
antoine-lizee pushed a commit to antoine-lizee/llama-cpp-python that referenced this pull request Oct 30, 2023
…len#453)

* Support calling mlock() on loaded model data on Linux and macOS

This is enabled by a new --mlock command line option.

Using mlock() disables swapping and memory compression for the model
data.  Doing so can be useful on systems where the model takes up a
large fraction of system RAM.  In my experience, macOS is quite eager to
start compressing llama.cpp's memory, which then makes it halt for a few
seconds while it decompresses, even with a model that uses "only" 25GB
out of 32GB.

Of course, this comes at the cost of forcing the system to swap or
compress other processes' memory instead, so it needs to be used with
care and shouldn't be enabled by default.

In theory it should be possible to support this on Windows as well using
VirtualLock(), but I'm not much of a Windows user.

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.