Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Use proper backend for CLIP #1175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 11, 2024
Merged

Use proper backend for CLIP #1175

merged 1 commit into from
Feb 11, 2024

Conversation

iamlemec
Copy link
Contributor

When using LLaVa, CLIP does not load on the default backend (i.e. CUDA/Metal when they are available). This arises because the code in clip.cpp conditions on GGML_USE_XXX rather than LLAMA_USE_XXX. The main llama.cpp CMake file enables the former when the latter is present, but we need to do it manually for LLaVA since we're bypassing the top level.

@abetlen
Copy link
Owner

abetlen commented Feb 11, 2024

@iamlemec that's great, thank you!

@abetlen abetlen merged commit 19b55ad into abetlen:main Feb 11, 2024
@eisneim
Copy link

eisneim commented Mar 13, 2024

version llama-cpp-python-0.2.56, how to enable CLIP offload to GPU? my 3090 can do 50 token/s but total time would be tooo slow(92s), much slower than my Macbook M3 max(6s),
i'v tried: CMAKE_ARGS="-DLLAMA_CUBLAS=on -DLLAVA_BUILD=on" pip install llama-cpp-python but it does not work

@iamlemec
Copy link
Contributor Author

You should try to see whether it is in fact using the GPU. Assuming you're running something like the LLaVa example in the README, you can load the CLIP model in verbose mode with:

chat_handler = Llava15ChatHandler(clip_model_path="path/to/llava/mmproj.bin", verbose=True)

If you're using the CUDA backend, it should have the line: "clip_model_load: CLIP using CUDA backend".

But yeah, it would be surprising if a 3090 was that much slower than an M3.

@eisneim
Copy link

eisneim commented Mar 13, 2024

@iamlemec thanks!

@eisneim
Copy link

eisneim commented Mar 13, 2024

i found the issue! when setting the context_size to 4096 the generation never ends!! it will try to decode until the context is full which is why it took so long to finish, but this won't happen on M3 max, strange.....

@iamlemec
Copy link
Contributor Author

Hmm, that's curious. Does the output from CUDA look reasonable? Also, what are you setting max_tokens to?

@eisneim
Copy link

eisneim commented Mar 14, 2024

i did not set max_tokens, i just set n_ctx=4096 on Macbook is works ok, but on cuda it will keep generatting never ends

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.