Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit fe2da09

Browse filesBrowse files
authored
feat: Generic Chat Formats, Tool Calling, and Huggingface Pull Support for Multimodal Models (Obsidian, LLaVA1.6, Moondream) (abetlen#1147)
* Test dummy image tags in chat templates * Format and improve types for llava_cpp.py * Add from_pretrained support to llava chat format. * Refactor llava chat format to use a jinja2 * Revert chat format test * Add moondream support (wip) * Update moondream chat format * Update moondream chat format * Update moondream prompt * Add function calling support * Cache last image embed * Add Llava1.6 support * Add nanollava support * Add obisidian support * Remove unnecessary import * Re-order multimodal chat formats * Logits all no longer required for multi-modal models * Update README.md * Update docs * Update README * Fix typo * Update README * Fix typo
1 parent 97fb860 commit fe2da09
Copy full SHA for fe2da09

File tree

Expand file treeCollapse file tree

5 files changed

+711
-145
lines changed
Filter options
Expand file treeCollapse file tree

5 files changed

+711
-145
lines changed

‎README.md

Copy file name to clipboardExpand all lines: README.md
+36-5Lines changed: 36 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -490,14 +490,15 @@ Due to discrepancies between llama.cpp and HuggingFace's tokenizers, it is requi
490490

491491
### Multi-modal Models
492492

493-
`llama-cpp-python` supports the llava1.5 family of multi-modal models which allow the language model to
494-
read information from both text and images.
493+
`llama-cpp-python` supports such as llava1.5 which allow the language model to read information from both text and images.
495494

496495
You'll first need to download one of the available multi-modal models in GGUF format:
497496

498497
- [llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)
499498
- [llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)
500499
- [bakllava-1-7b](https://huggingface.co/mys/ggml_bakllava-1)
500+
- [llava-v1.6-34b](https://huggingface.co/cjpais/llava-v1.6-34B-gguf)
501+
- [moondream2](https://huggingface.co/vikhyatk/moondream2)
501502

502503
Then you'll need to use a custom chat handler to load the clip model and process the chat messages and images.
503504

@@ -509,22 +510,52 @@ Then you'll need to use a custom chat handler to load the clip model and process
509510
model_path="./path/to/llava/llama-model.gguf",
510511
chat_handler=chat_handler,
511512
n_ctx=2048, # n_ctx should be increased to accomodate the image embedding
512-
logits_all=True,# needed to make llava work
513513
)
514514
>>> llm.create_chat_completion(
515515
messages = [
516516
{"role": "system", "content": "You are an assistant who perfectly describes images."},
517517
{
518518
"role": "user",
519519
"content": [
520-
{"type": "image_url", "image_url": {"url": "https://.../image.png"}},
521-
{"type" : "text", "text": "Describe this image in detail please."}
520+
{"type" : "text", "text": "What's in this image?"},
521+
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
522522
]
523523
}
524524
]
525525
)
526526
```
527527

528+
You can also pull the model from the Hugging Face Hub using the `from_pretrained` method.
529+
530+
```python
531+
>>> from llama_cpp import Llama
532+
>>> from llama_cpp.llama_chat_format import MoondreamChatHandler
533+
>>> chat_handler = MoondreamChatHandler.from_pretrained(
534+
repo_id="vikhyatk/moondream2",
535+
filename="*mmproj*",
536+
)
537+
>>> llm = Llama.from_pretrained(
538+
repo_id="vikhyatk/moondream2"
539+
filename="*text-model*",
540+
chat_handler=chat_handler,
541+
n_ctx=2048, # n_ctx should be increased to accomodate the image embedding
542+
)
543+
>>> llm.create_chat_completion(
544+
messages = [
545+
{
546+
"role": "user",
547+
"content": [
548+
{"type" : "text", "text": "What's in this image?"},
549+
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
550+
551+
]
552+
}
553+
]
554+
)
555+
```
556+
557+
**Note**: Multi-modal models also support tool calling and JSON mode.
558+
528559
<details>
529560
<summary>Loading a Local Image</summary>
530561

‎docs/server.md

Copy file name to clipboardExpand all lines: docs/server.md
+2Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,8 @@ You'll first need to download one of the available multi-modal models in GGUF fo
9898
- [llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)
9999
- [llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)
100100
- [bakllava-1-7b](https://huggingface.co/mys/ggml_bakllava-1)
101+
- [llava-v1.6-34b](https://huggingface.co/cjpais/llava-v1.6-34B-gguf)
102+
- [moondream2](https://huggingface.co/vikhyatk/moondream2)
101103

102104
Then when you run the server you'll need to also specify the path to the clip model used for image embedding and the `llava-1-5` chat_format
103105

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.