feat: support llama-cpp-python v0.3.2#2825
feat: support llama-cpp-python v0.3.2#2825cdoern merged 1 commit intoinstructlab:maininstructlab/instructlab:mainfrom
Conversation
67880cf to
e1a4a14
Compare
e1a4a14 to
31f9a46
Compare
f6b726e to
565ce5e
Compare
565ce5e to
5e77c56
Compare
5e77c56 to
e782906
Compare
|
update here: as of llama_cpp_python 0.3.z we need to keep track of the max_ctx_size being used in the active server, and then make sure before we pass the list to the openai endpoint that we remove the most recent message if the length of the content in the list is greater than max_ctx_size |
e782906 to
e50f7e0
Compare
e50f7e0 to
283d846
Compare
283d846 to
380bc82
Compare
|
e2e workflow failed on this PR: View run, please investigate. |
|
We'll need #2863 to merge to use the large test here |
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
|
e2e workflow failed on this PR: View run, please investigate. |
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
version 0.3.2 has granite 3.0 support and does not have this issue. Bump to this version this required some additions to how we handle chat exceptions. As of these newer 0.3.z llama-cpp-python versions, a bad request causes the server to die. This requires us to know the max_ctx_size of the server before passing a completions request so that we can maintain the behavior of trimming messages until we can respond to one that fits. in order to do this, the config now contains a `current_max_ctx_size` field that we will update when spinning up a server. in the case that a user implicitly starts a llama-cpp-python server when calling `ilab model chat`, we set the max_tokens to the current `max_ctx_size` in the serve config. Signed-off-by: Charlie Doern <cdoern@redhat.com>
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
|
e2e workflow succeeded on this PR: View run, congrats! |
|
This PR needs to be manually merged. Given it has two approvals and passes the S, M, and L E2E CI Jobs, I will be merging it manually. The container test has been failing and was only triggered since I needed to change build arguments in the various container files. Will merge once all CI except for the container build passes. |
|
@Mergifyio backport release-v0.22 |
✅ Backports have been createdDetails
|
version 0.3.5 of llama-cpp-python has a known issue abetlen/llama-cpp-python#1861 version 0.3.2 has granite 3.0 support and does not have this issue. Bump to this version this required some additions to how we handle chat exceptions. As of these newer 0.3.z llama-cpp-python versions, a bad request causes the server to die. This requires us to know the max_ctx_size of the server before passing a completions request so that we can maintain the behavior of trimming messages until we can respond to one that fits. in order to do this, the config now contains a `current_max_ctx_size` field that we will update when spinning up a server. in the case that a user implicitly starts a llama-cpp-python server when calling `ilab model chat`, we set the max_tokens to the current `max_ctx_size` in the serve config. **Checklist:** - [ ] **Commit Message Formatting**: Commit titles and messages follow guidelines in the [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/#summary). - [ ] [Changelog](https://github.com/instructlab/instructlab/blob/main/CHANGELOG.md) updated with breaking and/or notable changes for the next minor release. - [ ] Documentation has been updated, if necessary. - [ ] Unit tests have been added, if necessary. - [ ] Functional tests have been added, if necessary. - [ ] E2E Workflow tests have been added, if necessary. <hr>This is an automatic backport of pull request #2825 done by [Mergify](https://mergify.com). Approved-by: cdoern Approved-by: alinaryan
version 0.3.5 of llama-cpp-python has a known issue abetlen/llama-cpp-python#1861 version 0.3.2 has granite 3.0 support and does not have this issue. Bump to this version
this required some additions to how we handle chat exceptions. As of these newer 0.3.z llama-cpp-python versions,
a bad request causes the server to die. This requires us to know the max_ctx_size of the server before passing a completions request so that
we can maintain the behavior of trimming messages until we can respond to one that fits.
in order to do this, the config now contains a
current_max_ctx_sizefield that we will update when spinning up a server.in the case that a user implicitly starts a llama-cpp-python server when calling
ilab model chat, we set the max_tokens to thecurrent
max_ctx_sizein the serve config.Checklist:
conventional commits.