Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 7da8e0f

Browse filesBrowse files
committed
Merge branch 'main' of github.com:abetlen/llama_cpp_python into main
2 parents 8474665 + 40b2290 commit 7da8e0f
Copy full SHA for 7da8e0f

File tree

Expand file treeCollapse file tree

1 file changed

+14
-7
lines changed
Filter options
Expand file treeCollapse file tree

1 file changed

+14
-7
lines changed

‎README.md

Copy file name to clipboardExpand all lines: README.md
+14-7Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -106,14 +106,14 @@ Below is a short example demonstrating how to use the high-level API to generate
106106

107107
```python
108108
>>> from llama_cpp import Llama
109-
>>> llm = Llama(model_path="./models/7B/ggml-model.bin")
109+
>>> llm = Llama(model_path="./models/7B/llama-model.gguf")
110110
>>> output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
111111
>>> print(output)
112112
{
113113
"id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
114114
"object": "text_completion",
115115
"created": 1679561337,
116-
"model": "./models/7B/ggml-model.bin",
116+
"model": "./models/7B/llama-model.gguf",
117117
"choices": [
118118
{
119119
"text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
@@ -136,15 +136,15 @@ The context window of the Llama models determines the maximum number of tokens t
136136
For instance, if you want to work with larger contexts, you can expand the context window by setting the n_ctx parameter when initializing the Llama object:
137137

138138
```python
139-
llm = Llama(model_path="./models/7B/ggml-model.bin", n_ctx=2048)
139+
llm = Llama(model_path="./models/7B/llama-model.gguf", n_ctx=2048)
140140
```
141141

142142
### Loading llama-2 70b
143143

144144
Llama2 70b must set the `n_gqa` parameter (grouped-query attention factor) to 8 when loading:
145145

146146
```python
147-
llm = Llama(model_path="./models/70B/ggml-model.bin", n_gqa=8)
147+
llm = Llama(model_path="./models/70B/llama-model.gguf", n_gqa=8)
148148
```
149149

150150
## Web Server
@@ -156,17 +156,24 @@ To install the server package and get started:
156156

157157
```bash
158158
pip install llama-cpp-python[server]
159-
python3 -m llama_cpp.server --model models/7B/ggml-model.bin
159+
python3 -m llama_cpp.server --model models/7B/llama-model.gguf
160+
```
161+
Similar to Hardware Acceleration section above, you can also install with GPU (cuBLAS) support like this:
162+
163+
```bash
164+
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python[server]
165+
python3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35
160166
```
161167

162168
Navigate to [http://localhost:8000/docs](http://localhost:8000/docs) to see the OpenAPI documentation.
163169

170+
164171
## Docker image
165172

166173
A Docker image is available on [GHCR](https://ghcr.io/abetlen/llama-cpp-python). To run the server:
167174

168175
```bash
169-
docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/ggml-model-name.bin ghcr.io/abetlen/llama-cpp-python:latest
176+
docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest
170177
```
171178
[Docker on termux (requires root)](https://gist.github.com/FreddieOliveira/efe850df7ff3951cb62d74bd770dce27) is currently the only known way to run this on phones, see [termux support issue](https://github.com/abetlen/llama-cpp-python/issues/389)
172179

@@ -183,7 +190,7 @@ Below is a short example demonstrating how to use the low-level API to tokenize
183190
>>> llama_cpp.llama_backend_init(numa=False) # Must be called once at the start of each program
184191
>>> params = llama_cpp.llama_context_default_params()
185192
# use bytes for char * params
186-
>>> model = llama_cpp.llama_load_model_from_file(b"./models/7b/ggml-model.bin", params)
193+
>>> model = llama_cpp.llama_load_model_from_file(b"./models/7b/llama-model.gguf", params)
187194
>>> ctx = llama_cpp.llama_new_context_with_model(model, params)
188195
>>> max_tokens = params.n_ctx
189196
# use ctypes arrays for array params

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.