-
Notifications
You must be signed in to change notification settings - Fork 449
Description
Describe the bug
From #1161:
What started as just trying to bump the version of llama_cpp_python (because of llama-cpp-python vulnerable to Remote Code Execution by Server-Side Template Injection in Model M...) turned into realizing that our trimming code in chat doesn't trim the way we'd expect it to anymore.
When max-ctx-size goes from 25 -> 55 , certain prompts throw a
openai.InternalServerErroreven though both max-ctx-sizes get rounded up to 64 in llama_cpp. Below is a breakdown of when the trimming code in chat is executed with different prompts and two small max-ctx-sizes.M1 = "Hello" M2 = "Hello, joe was born in 2000. How old is joe" M3 = "Hello, I am a ci message that should not finish because I am too long for the context window, tell me about your day please?" M4 = "Hello, I am a ci message that should not finish because I am too long for the context window, tell me about your day please? How many tokens could you take today. Could you tell me about the time you could only take 55 tokens" M5 = "I was born in 2000. How old am I?" 25 max context window (ilab serve --max-ctx-size 25): ilab chat -qq M1 - BadRequestError: Trims message - Responds back fine ilab chat -qq M2 - BadRequestError: Trims message - Responds back fine ilab chat -qq M3 - BadRequestError: Trims message - BadRequestError: Final message too large for context size ilab chat -qq M4 - BadRequestError: Trims message - BadRequestError: Final message too large for context size ilab chat -qq M5 - BadRequestError: Trims message - Responds back fine 55 max context window (ilab serve --max-ctx-size 55): ilab chat -qq M1 - Responds back fine ilab chat -qq M2 - BadRequestError: Trims message - Responds back fine ilab chat -qq M3 - BadRequestError: Trims message - Responds back fine ilab chat -qq M4 - BadRequestError: Trims message - InternalServerError ilab chat -qq M5 - InternalServerError
To Reproduce
See description above
Expected behavior
BadrequestError's get thrown every time a trim happens in the situation where the max-ctx-window is 55 for M4 and M5. InternalServerError's should not get thrown when trimming occurs.
Device Info (please complete the following information):
- OS Version: Fedora Linux 39
- Python Version:
Python 3.12.2 - InstructLab Version:
ilab, version 0.14.1.dev235