Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit d8fddcc

Browse filesBrowse files
committed
Merge branch 'main' of github.com:abetlen/llama_cpp_python into better-server-params-and-fields
2 parents 3008a95 + 397ae97 commit d8fddcc
Copy full SHA for d8fddcc

File tree

Expand file treeCollapse file tree

13 files changed

+341
-142
lines changed
Filter options
Expand file treeCollapse file tree

13 files changed

+341
-142
lines changed

‎.github/ISSUE_TEMPLATE/bug_report.md

Copy file name to clipboard
+80Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
---
2+
name: Bug report
3+
about: Create a report to help us improve
4+
title: ''
5+
labels: ''
6+
assignees: ''
7+
8+
---
9+
10+
# Prerequisites
11+
12+
Please answer the following questions for yourself before submitting an issue.
13+
14+
- [ ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
15+
- [ ] I carefully followed the [README.md](https://github.com/abetlen/llama-cpp-python/blob/main/README.md).
16+
- [ ] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
17+
- [ ] I reviewed the [Discussions](https://github.com/abetlen/llama-cpp-python/discussions), and have a new bug or useful enhancement to share.
18+
19+
# Expected Behavior
20+
21+
Please provide a detailed written description of what you were trying to do, and what you expected `llama-cpp-python` to do.
22+
23+
# Current Behavior
24+
25+
Please provide a detailed written description of what `llama-cpp-python` did, instead.
26+
27+
# Environment and Context
28+
29+
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
30+
31+
* Physical (or virtual) hardware you are using, e.g. for Linux:
32+
33+
`$ lscpu`
34+
35+
* Operating System, e.g. for Linux:
36+
37+
`$ uname -a`
38+
39+
* SDK version, e.g. for Linux:
40+
41+
```
42+
$ python3 --version
43+
$ make --version
44+
$ g++ --version
45+
```
46+
47+
# Failure Information (for bugs)
48+
49+
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
50+
51+
# Steps to Reproduce
52+
53+
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
54+
55+
1. step 1
56+
2. step 2
57+
3. step 3
58+
4. etc.
59+
60+
**Note: Many issues seem to be regarding performance issues / differences with `llama.cpp`. In these cases we need to confirm that you're comparing against the version of `llama.cpp` that was built with your python package, and which parameters you're passing to the context.**
61+
62+
# Failure Logs
63+
64+
Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
65+
66+
Also, please try to **avoid using screenshots** if at all possible. Instead, copy/paste the console output and use [Github's markdown](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) to cleanly format your logs for easy readability.
67+
68+
Example environment info:
69+
```
70+
llama-cpp-python$ git log | head -1
71+
commit 47b0aa6e957b93dbe2c29d53af16fbae2dd628f2
72+
73+
llama-cpp-python$ python3 --version
74+
Python 3.10.10
75+
76+
llama-cpp-python$ pip list | egrep "uvicorn|fastapi|sse-starlette"
77+
fastapi 0.95.0
78+
sse-starlette 1.3.3
79+
uvicorn 0.21.1
80+
```
+20Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
name: Feature request
3+
about: Suggest an idea for this project
4+
title: ''
5+
labels: ''
6+
assignees: ''
7+
8+
---
9+
10+
**Is your feature request related to a problem? Please describe.**
11+
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
12+
13+
**Describe the solution you'd like**
14+
A clear and concise description of what you want to happen.
15+
16+
**Describe alternatives you've considered**
17+
A clear and concise description of any alternative solutions or features you've considered.
18+
19+
**Additional context**
20+
Add any other context or screenshots about the feature request here.

‎.github/dependabot.yml

Copy file name to clipboard
+11Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# To get started with Dependabot version updates, you'll need to specify which
2+
# package ecosystems to update and where the package manifests are located.
3+
# Please see the documentation for all configuration options:
4+
# https://docs.github.com/github/administering-a-repository/configuration-options-for-dependency-updates
5+
6+
version: 2
7+
updates:
8+
- package-ecosystem: "pip" # See documentation for possible values
9+
directory: "/" # Location of package manifests
10+
schedule:
11+
interval: "weekly"

‎.github/workflows/build-docker.yaml

Copy file name to clipboardExpand all lines: .github/workflows/build-docker.yaml
+1-1Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,4 +36,4 @@ jobs:
3636
push: true # push to registry
3737
pull: true # always fetch the latest base images
3838
platforms: linux/amd64,linux/arm64 # build for both amd64 and arm64
39-
tags: ghcr.io/abetlen/llama-cpp-python:latest
39+
tags: ghcr.io/abetlen/llama-cpp-python:latest

‎Dockerfile

Copy file name to clipboard
+3-3Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
1-
FROM python:3-bullseye
1+
FROM python:3-slim-bullseye
22

33
# We need to set the host to 0.0.0.0 to allow outside access
44
ENV HOST 0.0.0.0
55

66
COPY . .
77

88
# Install the package
9-
RUN apt update && apt install -y libopenblas-dev
9+
RUN apt update && apt install -y libopenblas-dev ninja-build build-essential
1010
RUN python -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette
1111

1212
RUN LLAMA_OPENBLAS=1 python3 setup.py develop
1313

1414
# Run the server
15-
CMD python3 -m llama_cpp.server
15+
CMD python3 -m llama_cpp.server

‎README.md

Copy file name to clipboardExpand all lines: README.md
+31-2Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,10 @@ You can force the use of `cmake` on Linux / MacOS setting the `FORCE_CMAKE=1` en
3131

3232
## High-level API
3333

34+
The high-level API provides a simple managed interface through the `Llama` class.
35+
36+
Below is a short example demonstrating how to use the high-level API to generate text:
37+
3438
```python
3539
>>> from llama_cpp import Llama
3640
>>> llm = Llama(model_path="./models/7B/ggml-model.bin")
@@ -64,12 +68,20 @@ This allows you to use llama.cpp compatible models with any OpenAI compatible cl
6468

6569
To install the server package and get started:
6670

71+
Linux/MacOS
6772
```bash
6873
pip install llama-cpp-python[server]
6974
export MODEL=./models/7B/ggml-model.bin
7075
python3 -m llama_cpp.server
7176
```
7277

78+
Windows
79+
```cmd
80+
pip install llama-cpp-python[server]
81+
SET MODEL=..\models\7B\ggml-model.bin
82+
python3 -m llama_cpp.server
83+
```
84+
7385
Navigate to [http://localhost:8000/docs](http://localhost:8000/docs) to see the OpenAPI documentation.
7486

7587
## Docker image
@@ -82,8 +94,25 @@ docker run --rm -it -p8000:8000 -v /path/to/models:/models -eMODEL=/models/ggml-
8294

8395
## Low-level API
8496

85-
The low-level API is a direct `ctypes` binding to the C API provided by `llama.cpp`.
86-
The entire API can be found in [llama_cpp/llama_cpp.py](https://github.com/abetlen/llama-cpp-python/blob/master/llama_cpp/llama_cpp.py) and should mirror [llama.h](https://github.com/ggerganov/llama.cpp/blob/master/llama.h).
97+
The low-level API is a direct [`ctypes`](https://docs.python.org/3/library/ctypes.html) binding to the C API provided by `llama.cpp`.
98+
The entire lowe-level API can be found in [llama_cpp/llama_cpp.py](https://github.com/abetlen/llama-cpp-python/blob/master/llama_cpp/llama_cpp.py) and directly mirrors the C API in [llama.h](https://github.com/ggerganov/llama.cpp/blob/master/llama.h).
99+
100+
Below is a short example demonstrating how to use the low-level API to tokenize a prompt:
101+
102+
```python
103+
>>> import llama_cpp
104+
>>> import ctypes
105+
>>> params = llama_cpp.llama_context_default_params()
106+
# use bytes for char * params
107+
>>> ctx = llama_cpp.llama_init_from_file(b"./models/7b/ggml-model.bin", params)
108+
>>> max_tokens = params.n_ctx
109+
# use ctypes arrays for array params
110+
>>> tokens = (llama_cppp.llama_token * int(max_tokens))()
111+
>>> n_tokens = llama_cpp.llama_tokenize(ctx, b"Q: Name the planets in the solar system? A: ", tokens, max_tokens, add_bos=llama_cpp.c_bool(True))
112+
>>> llama_cpp.llama_free(ctx)
113+
```
114+
115+
Check out the [examples folder](examples/low_level_api) for more examples of using the low-level API.
87116

88117

89118
# Documentation

‎llama_cpp/llama.py

Copy file name to clipboardExpand all lines: llama_cpp/llama.py
+40-22Lines changed: 40 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -33,12 +33,10 @@ def _find_key(
3333
return k
3434
return None
3535

36-
def __getitem__(
37-
self, key: Sequence[llama_cpp.llama_token]
38-
) -> Optional["LlamaState"]:
36+
def __getitem__(self, key: Sequence[llama_cpp.llama_token]) -> "LlamaState":
3937
_key = self._find_key(tuple(key))
4038
if _key is None:
41-
return None
39+
raise KeyError(f"Key not found: {key}")
4240
return self.cache_state[_key]
4341

4442
def __contains__(self, key: Sequence[llama_cpp.llama_token]) -> bool:
@@ -53,8 +51,8 @@ class LlamaState:
5351
def __init__(
5452
self,
5553
eval_tokens: Deque[llama_cpp.llama_token],
56-
eval_logits: Deque[List[llama_cpp.c_float]],
57-
llama_state,
54+
eval_logits: Deque[List[float]],
55+
llama_state, # type: llama_cpp.Array[llama_cpp.c_uint8]
5856
llama_state_size: llama_cpp.c_size_t,
5957
):
6058
self.eval_tokens = eval_tokens
@@ -129,7 +127,7 @@ def __init__(
129127
self.last_n_tokens_size = last_n_tokens_size
130128
self.n_batch = min(n_ctx, n_batch)
131129
self.eval_tokens: Deque[llama_cpp.llama_token] = deque(maxlen=n_ctx)
132-
self.eval_logits: Deque[List[llama_cpp.c_float]] = deque(
130+
self.eval_logits: Deque[List[float]] = deque(
133131
maxlen=n_ctx if logits_all else 1
134132
)
135133

@@ -247,7 +245,7 @@ def eval(self, tokens: Sequence[llama_cpp.llama_token]):
247245
n_vocab = llama_cpp.llama_n_vocab(self.ctx)
248246
cols = int(n_vocab)
249247
logits_view = llama_cpp.llama_get_logits(self.ctx)
250-
logits: List[List[llama_cpp.c_float]] = [
248+
logits: List[List[float]] = [
251249
[logits_view[i * cols + j] for j in range(cols)] for i in range(rows)
252250
]
253251
self.eval_logits.extend(logits)
@@ -289,7 +287,7 @@ def _sample_top_p_top_k(
289287
candidates=llama_cpp.ctypes.pointer(candidates),
290288
penalty=repeat_penalty,
291289
)
292-
if temp == 0.0:
290+
if float(temp.value) == 0.0:
293291
return llama_cpp.llama_sample_token_greedy(
294292
ctx=self.ctx,
295293
candidates=llama_cpp.ctypes.pointer(candidates),
@@ -299,21 +297,25 @@ def _sample_top_p_top_k(
299297
ctx=self.ctx,
300298
candidates=llama_cpp.ctypes.pointer(candidates),
301299
k=top_k,
300+
min_keep=llama_cpp.c_size_t(1),
302301
)
303302
llama_cpp.llama_sample_tail_free(
304303
ctx=self.ctx,
305304
candidates=llama_cpp.ctypes.pointer(candidates),
306305
z=llama_cpp.c_float(1.0),
306+
min_keep=llama_cpp.c_size_t(1),
307307
)
308308
llama_cpp.llama_sample_typical(
309309
ctx=self.ctx,
310310
candidates=llama_cpp.ctypes.pointer(candidates),
311311
p=llama_cpp.c_float(1.0),
312+
min_keep=llama_cpp.c_size_t(1),
312313
)
313314
llama_cpp.llama_sample_top_p(
314315
ctx=self.ctx,
315316
candidates=llama_cpp.ctypes.pointer(candidates),
316317
p=top_p,
318+
min_keep=llama_cpp.c_size_t(1),
317319
)
318320
llama_cpp.llama_sample_temperature(
319321
ctx=self.ctx,
@@ -390,18 +392,28 @@ def generate(
390392
"""
391393
assert self.ctx is not None
392394

393-
if (
394-
reset
395-
and len(self.eval_tokens) > 0
396-
and tuple(self.eval_tokens) == tuple(tokens[: len(self.eval_tokens)])
397-
):
398-
if self.verbose:
399-
print("Llama.generate: cache hit", file=sys.stderr)
400-
reset = False
401-
tokens = tokens[len(self.eval_tokens) :]
395+
if reset and len(self.eval_tokens) > 0:
396+
longest_prefix = 0
397+
for a, b in zip(self.eval_tokens, tokens[:-1]):
398+
if a == b:
399+
longest_prefix += 1
400+
else:
401+
break
402+
if longest_prefix > 0:
403+
if self.verbose:
404+
print("Llama.generate: prefix-match hit", file=sys.stderr)
405+
reset = False
406+
tokens = tokens[longest_prefix:]
407+
for _ in range(len(self.eval_tokens) - longest_prefix):
408+
self.eval_tokens.pop()
409+
try:
410+
self.eval_logits.pop()
411+
except IndexError:
412+
pass
402413

403414
if reset:
404415
self.reset()
416+
405417
while True:
406418
self.eval(tokens)
407419
token = self.sample(
@@ -639,7 +651,10 @@ def _create_completion(
639651
self.detokenize([token]).decode("utf-8", errors="ignore")
640652
for token in all_tokens
641653
]
642-
all_logprobs = [Llama._logits_to_logprobs(row) for row in self.eval_logits]
654+
all_logprobs = [
655+
Llama.logits_to_logprobs(list(map(float, row)))
656+
for row in self.eval_logits
657+
]
643658
for token, token_str, logprobs_token in zip(
644659
all_tokens, all_token_strs, all_logprobs
645660
):
@@ -958,7 +973,10 @@ def save_state(self) -> LlamaState:
958973
llama_state_compact = (llama_cpp.c_uint8 * int(n_bytes))()
959974
llama_cpp.ctypes.memmove(llama_state_compact, llama_state, int(n_bytes))
960975
if self.verbose:
961-
print(f"Llama.save_state: saving {n_bytes} bytes of llama state", file=sys.stderr)
976+
print(
977+
f"Llama.save_state: saving {n_bytes} bytes of llama state",
978+
file=sys.stderr,
979+
)
962980
return LlamaState(
963981
eval_tokens=self.eval_tokens.copy(),
964982
eval_logits=self.eval_logits.copy(),
@@ -985,7 +1003,7 @@ def token_bos() -> llama_cpp.llama_token:
9851003
return llama_cpp.llama_token_bos()
9861004

9871005
@staticmethod
988-
def logits_to_logprobs(logits: List[llama_cpp.c_float]) -> List[llama_cpp.c_float]:
1006+
def logits_to_logprobs(logits: List[float]) -> List[float]:
9891007
exps = [math.exp(float(x)) for x in logits]
9901008
sum_exps = sum(exps)
991-
return [llama_cpp.c_float(math.log(x / sum_exps)) for x in exps]
1009+
return [math.log(x / sum_exps) for x in exps]

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.