Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Comments

Close side panel

feat: support llama-cpp-python v0.3.2#2825

Merged
cdoern merged 1 commit intoinstructlab:maininstructlab/instructlab:mainfrom
cdoern:llama-cpp-bumpcdoern/instructlab:llama-cpp-bumpCopy head branch name to clipboard
Jan 8, 2025
Merged

feat: support llama-cpp-python v0.3.2#2825
cdoern merged 1 commit intoinstructlab:maininstructlab/instructlab:mainfrom
cdoern:llama-cpp-bumpcdoern/instructlab:llama-cpp-bumpCopy head branch name to clipboard

Conversation

@cdoern
Copy link
Contributor

@cdoern cdoern commented Dec 18, 2024

version 0.3.5 of llama-cpp-python has a known issue abetlen/llama-cpp-python#1861 version 0.3.2 has granite 3.0 support and does not have this issue. Bump to this version

this required some additions to how we handle chat exceptions. As of these newer 0.3.z llama-cpp-python versions,
a bad request causes the server to die. This requires us to know the max_ctx_size of the server before passing a completions request so that
we can maintain the behavior of trimming messages until we can respond to one that fits.

in order to do this, the config now contains a current_max_ctx_size field that we will update when spinning up a server.
in the case that a user implicitly starts a llama-cpp-python server when calling ilab model chat, we set the max_tokens to the
current max_ctx_size in the serve config.

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the
    conventional commits.
  • Changelog updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Functional tests have been added, if necessary.
  • E2E Workflow tests have been added, if necessary.

@mergify mergify bot added CI/CD Affects CI/CD configuration container Affects containization aspects documentation Improvements or additions to documentation testing Relates to testing dependencies Relates to dependencies ci-failure PR has at least one CI failure labels Dec 18, 2024
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Dec 18, 2024
@mergify mergify bot added the ci-failure PR has at least one CI failure label Dec 18, 2024
@cdoern cdoern force-pushed the llama-cpp-bump branch 2 times, most recently from f6b726e to 565ce5e Compare December 19, 2024 02:56
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 19, 2024
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 19, 2024
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Dec 19, 2024
@cdoern
Copy link
Contributor Author

cdoern commented Dec 19, 2024

update here: as of llama_cpp_python 0.3.z BadRequestError causes the server to become unavailable. This means our method of removing messages with exceed the context window from our list, does not work since at the point we catch them , the server is already dead.

we need to keep track of the max_ctx_size being used in the active server, and then make sure before we pass the list to the openai endpoint that we remove the most recent message if the length of the content in the list is greater than max_ctx_size

@mergify mergify bot added the ci-failure PR has at least one CI failure label Dec 19, 2024
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 19, 2024
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 20, 2024
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Dec 20, 2024
@mergify mergify bot added the ci-failure PR has at least one CI failure label Jan 7, 2025
@github-actions
Copy link

github-actions bot commented Jan 7, 2025

e2e workflow failed on this PR: View run, please investigate.

@nathan-weinberg
Copy link
Member

We'll need #2863 to merge to use the large test here

@mergify mergify bot removed the ci-failure PR has at least one CI failure label Jan 7, 2025
@github-actions
Copy link

github-actions bot commented Jan 7, 2025

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

@mergify mergify bot added the ci-failure PR has at least one CI failure label Jan 7, 2025
@github-actions
Copy link

github-actions bot commented Jan 7, 2025

e2e workflow failed on this PR: View run, please investigate.

@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jan 7, 2025
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Jan 7, 2025
@github-actions
Copy link

github-actions bot commented Jan 7, 2025

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

@mergify mergify bot added the ci-failure PR has at least one CI failure label Jan 7, 2025
version 0.3.2 has granite 3.0 support and does not have this issue. Bump to this version

this required some additions to how we handle chat exceptions. As of these newer 0.3.z llama-cpp-python versions,
a bad request causes the server to die. This requires us to know the max_ctx_size of the server before passing a completions request so that
we can maintain the behavior of trimming messages until we can respond to one that fits.

in order to do this, the config now contains a `current_max_ctx_size` field that we will update when spinning up a server.
in the case that a user implicitly starts a llama-cpp-python server when calling `ilab model chat`, we set the max_tokens to the
current `max_ctx_size` in the serve config.

Signed-off-by: Charlie Doern <cdoern@redhat.com>
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Jan 7, 2025
@github-actions
Copy link

github-actions bot commented Jan 7, 2025

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

@mergify mergify bot added the ci-failure PR has at least one CI failure label Jan 7, 2025
@github-actions
Copy link

github-actions bot commented Jan 8, 2025

e2e workflow succeeded on this PR: View run, congrats!

@cdoern cdoern removed the request for review from jaideepr97 January 8, 2025 00:48
@cdoern
Copy link
Contributor Author

cdoern commented Jan 8, 2025

This PR needs to be manually merged. Given it has two approvals and passes the S, M, and L E2E CI Jobs, I will be merging it manually.

The container test has been failing and was only triggered since I needed to change build arguments in the various container files.

Will merge once all CI except for the container build passes.

@cdoern cdoern merged commit 6a2f56b into instructlab:main Jan 8, 2025
32 of 33 checks passed
@cdoern
Copy link
Contributor Author

cdoern commented Jan 8, 2025

@Mergifyio backport release-v0.22

@mergify
Copy link
Contributor

mergify bot commented Jan 8, 2025

backport release-v0.22

✅ Backports have been created

Details

mergify bot added a commit that referenced this pull request Jan 10, 2025
version 0.3.5 of llama-cpp-python has a known issue abetlen/llama-cpp-python#1861 version 0.3.2 has granite 3.0 support and does not have this issue. Bump to this version

this required some additions to how we handle chat exceptions. As of these newer 0.3.z llama-cpp-python versions,
a bad request causes the server to die. This requires us to know the max_ctx_size of the server before passing a completions request so that
we can maintain the behavior of trimming messages until we can respond to one that fits.

in order to do this, the config now contains a `current_max_ctx_size` field that we will update when spinning up a server.
in the case that a user implicitly starts a llama-cpp-python server when calling `ilab model chat`, we set the max_tokens to the
current `max_ctx_size` in the serve config.









**Checklist:**

- [ ] **Commit Message Formatting**: Commit titles and messages follow guidelines in the
  [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/#summary).
- [ ] [Changelog](https://github.com/instructlab/instructlab/blob/main/CHANGELOG.md) updated with breaking and/or notable changes for the next minor release.
- [ ] Documentation has been updated, if necessary.
- [ ] Unit tests have been added, if necessary.
- [ ] Functional tests have been added, if necessary.
- [ ] E2E Workflow tests have been added, if necessary.
<hr>This is an automatic backport of pull request #2825 done by [Mergify](https://mergify.com).


Approved-by: cdoern

Approved-by: alinaryan
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Affects CI/CD configuration ci-failure PR has at least one CI failure container Affects containization aspects dependencies Relates to dependencies documentation Improvements or additions to documentation hold In-progress PR. Tag should be removed before merge. testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update llama_cpp_python from 0.2.79 to 0.3 for executing new LLM architectures (granite 3.0)

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.