th-neu
diff --git a/‎.github/ISSUE_TEMPLATE/bug_report.md
Copy file name to clipboard
+80Lines changed: 80 additions & 0 deletions b/‎.github/ISSUE_TEMPLATE/bug_report.md
Copy file name to clipboard
+80Lines changed: 80 additions & 0 deletions
diff --git a/‎.github/ISSUE_TEMPLATE/feature_request.md
Copy file name to clipboard
+20Lines changed: 20 additions & 0 deletions b/‎.github/ISSUE_TEMPLATE/feature_request.md
Copy file name to clipboard
+20Lines changed: 20 additions & 0 deletions
diff --git a/‎.github/dependabot.yml
Copy file name to clipboard
+11Lines changed: 11 additions & 0 deletions b/‎.github/dependabot.yml
Copy file name to clipboard
+11Lines changed: 11 additions & 0 deletions
diff --git a/‎.github/workflows/build-docker.yaml
Copy file name to clipboardExpand all lines: .github/workflows/build-docker.yaml
+1-1Lines changed: 1 addition & 1 deletion b/‎.github/workflows/build-docker.yaml
Copy file name to clipboardExpand all lines: .github/workflows/build-docker.yaml
+1-1Lines changed: 1 addition & 1 deletion
diff --git a/‎Dockerfile
Copy file name to clipboard
+3-3Lines changed: 3 additions & 3 deletions b/‎Dockerfile
Copy file name to clipboard
+3-3Lines changed: 3 additions & 3 deletions
diff --git a/‎README.md
Copy file name to clipboardExpand all lines: README.md
+31-2Lines changed: 31 additions & 2 deletions b/‎README.md
Copy file name to clipboardExpand all lines: README.md
+31-2Lines changed: 31 additions & 2 deletions
diff --git a/‎llama_cpp/llama.py
Copy file name to clipboardExpand all lines: llama_cpp/llama.py
+40-22Lines changed: 40 additions & 22 deletions b/‎llama_cpp/llama.py
Copy file name to clipboardExpand all lines: llama_cpp/llama.py
+40-22Lines changed: 40 additions & 22 deletions
@@ -0,0 +1,80 @@
+---
+name: Bug report
+about: Create a report to help us improve
+title: ''
+labels: ''
+assignees: ''
+
+---
+
+# Prerequisites
+
+Please answer the following questions for yourself before submitting an issue.
+
+- [ ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
+- [ ] I carefully followed the [README.md](https://github.com/abetlen/llama-cpp-python/blob/main/README.md).
+- [ ] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
+- [ ] I reviewed the [Discussions](https://github.com/abetlen/llama-cpp-python/discussions), and have a new bug or useful enhancement to share.
+
+# Expected Behavior
+
+Please provide a detailed written description of what you were trying to do, and what you expected `llama-cpp-python` to do.
+
+# Current Behavior
+
+Please provide a detailed written description of what `llama-cpp-python` did, instead.
+
+# Environment and Context
+
+Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
+
+* Physical (or virtual) hardware you are using, e.g. for Linux:
+
+`$ lscpu`
+
+* Operating System, e.g. for Linux:
+
+`$ uname -a`
+
+* SDK version, e.g. for Linux:
+
+```
+$ python3 --version
+$ make --version
+$ g++ --version
+```
+
+# Failure Information (for bugs)
+
+Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
+
+# Steps to Reproduce
+
+Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
+
+1. step 1
+2. step 2
+3. step 3
+4. etc.
+
+**Note: Many issues seem to be regarding performance issues / differences with `llama.cpp`. In these cases we need to confirm that you're comparing against the version of `llama.cpp` that was built with your python package, and which parameters you're passing to the context.**
+
+# Failure Logs
+
+Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
+
+Also, please try to **avoid using screenshots** if at all possible. Instead, copy/paste the console output and use [Github's markdown](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) to cleanly format your logs for easy readability.
+
+Example environment info:
+```
+llama-cpp-python$ git log | head -1
+commit 47b0aa6e957b93dbe2c29d53af16fbae2dd628f2
+
+llama-cpp-python$ python3 --version
+Python 3.10.10
+
+llama-cpp-python$ pip list | egrep "uvicorn|fastapi|sse-starlette"
+fastapi            0.95.0
+sse-starlette      1.3.3
+uvicorn            0.21.1
+```
@@ -0,0 +1,20 @@
+---
+name: Feature request
+about: Suggest an idea for this project
+title: ''
+labels: ''
+assignees: ''
+
+---
+
+**Is your feature request related to a problem? Please describe.**
+A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
+
+**Describe the solution you'd like**
+A clear and concise description of what you want to happen.
+
+**Describe alternatives you've considered**
+A clear and concise description of any alternative solutions or features you've considered.
+
+**Additional context**
+Add any other context or screenshots about the feature request here.
@@ -0,0 +1,11 @@
+# To get started with Dependabot version updates, you'll need to specify which
+# package ecosystems to update and where the package manifests are located.
+# Please see the documentation for all configuration options:
+# https://docs.github.com/github/administering-a-repository/configuration-options-for-dependency-updates
+
+version: 2
+updates:
+  - package-ecosystem: "pip" # See documentation for possible values
+    directory: "/" # Location of package manifests
+    schedule:
+      interval: "weekly"
@@ -36,4 +36,4 @@ jobs:
           push: true # push to registry
           pull: true # always fetch the latest base images
           platforms: linux/amd64,linux/arm64 # build for both amd64 and arm64
-          tags: ghcr.io/abetlen/llama-cpp-python:latest
+          tags: ghcr.io/abetlen/llama-cpp-python:latest
@@ -1,15 +1,15 @@
-FROM python:3-bullseye
+FROM python:3-slim-bullseye
 
 # We need to set the host to 0.0.0.0 to allow outside access
 ENV HOST 0.0.0.0
 
 COPY . .
 
 # Install the package
-RUN apt update && apt install -y libopenblas-dev
+RUN apt update && apt install -y libopenblas-dev ninja-build build-essential
 RUN python -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette
 
 RUN LLAMA_OPENBLAS=1 python3 setup.py develop
 
 # Run the server
-CMD python3 -m llama_cpp.server
+CMD python3 -m llama_cpp.server
@@ -31,6 +31,10 @@ You can force the use of `cmake` on Linux / MacOS setting the `FORCE_CMAKE=1` en
 
 ## High-level API
 
+The high-level API provides a simple managed interface through the `Llama` class.
+
+Below is a short example demonstrating how to use the high-level API to generate text:
+
 ```python
 >>> from llama_cpp import Llama
 >>> llm = Llama(model_path="./models/7B/ggml-model.bin")
@@ -64,12 +68,20 @@ This allows you to use llama.cpp compatible models with any OpenAI compatible cl
 
 To install the server package and get started:
 
+Linux/MacOS
 ```bash
 pip install llama-cpp-python[server]
 export MODEL=./models/7B/ggml-model.bin
 python3 -m llama_cpp.server
 ```
 
+Windows
+```cmd
+pip install llama-cpp-python[server]
+SET MODEL=..\models\7B\ggml-model.bin
+python3 -m llama_cpp.server
+```
+
 Navigate to [http://localhost:8000/docs](http://localhost:8000/docs) to see the OpenAPI documentation.
 
 ## Docker image
@@ -82,8 +94,25 @@ docker run --rm -it -p8000:8000 -v /path/to/models:/models -eMODEL=/models/ggml-
 
 ## Low-level API
 
-The low-level API is a direct `ctypes` binding to the C API provided by `llama.cpp`.
-The entire API can be found in [llama_cpp/llama_cpp.py](https://github.com/abetlen/llama-cpp-python/blob/master/llama_cpp/llama_cpp.py) and should mirror [llama.h](https://github.com/ggerganov/llama.cpp/blob/master/llama.h).
+The low-level API is a direct [`ctypes`](https://docs.python.org/3/library/ctypes.html) binding to the C API provided by `llama.cpp`.
+The entire lowe-level API can be found in [llama_cpp/llama_cpp.py](https://github.com/abetlen/llama-cpp-python/blob/master/llama_cpp/llama_cpp.py) and directly mirrors the C API in [llama.h](https://github.com/ggerganov/llama.cpp/blob/master/llama.h).
+
+Below is a short example demonstrating how to use the low-level API to tokenize a prompt:
+
+```python
+>>> import llama_cpp
+>>> import ctypes
+>>> params = llama_cpp.llama_context_default_params()
+# use bytes for char * params
+>>> ctx = llama_cpp.llama_init_from_file(b"./models/7b/ggml-model.bin", params)
+>>> max_tokens = params.n_ctx
+# use ctypes arrays for array params
+>>> tokens = (llama_cppp.llama_token * int(max_tokens))()
+>>> n_tokens = llama_cpp.llama_tokenize(ctx, b"Q: Name the planets in the solar system? A: ", tokens, max_tokens, add_bos=llama_cpp.c_bool(True))
+>>> llama_cpp.llama_free(ctx)
+```
+
+Check out the [examples folder](examples/low_level_api) for more examples of using the low-level API.
 
 
 # Documentation
 
@@ -33,12 +33,10 @@ def _find_key(
                 return k
         return None
 
-    def __getitem__(
-        self, key: Sequence[llama_cpp.llama_token]
-    ) -> Optional["LlamaState"]:
+    def __getitem__(self, key: Sequence[llama_cpp.llama_token]) -> "LlamaState":
         _key = self._find_key(tuple(key))
         if _key is None:
-            return None
+            raise KeyError(f"Key not found: {key}")
         return self.cache_state[_key]
 
     def __contains__(self, key: Sequence[llama_cpp.llama_token]) -> bool:
@@ -53,8 +51,8 @@ class LlamaState:
     def __init__(
         self,
         eval_tokens: Deque[llama_cpp.llama_token],
-        eval_logits: Deque[List[llama_cpp.c_float]],
-        llama_state,
+        eval_logits: Deque[List[float]],
+        llama_state,  # type: llama_cpp.Array[llama_cpp.c_uint8]
         llama_state_size: llama_cpp.c_size_t,
     ):
         self.eval_tokens = eval_tokens
@@ -129,7 +127,7 @@ def __init__(
         self.last_n_tokens_size = last_n_tokens_size
         self.n_batch = min(n_ctx, n_batch)
         self.eval_tokens: Deque[llama_cpp.llama_token] = deque(maxlen=n_ctx)
-        self.eval_logits: Deque[List[llama_cpp.c_float]] = deque(
+        self.eval_logits: Deque[List[float]] = deque(
             maxlen=n_ctx if logits_all else 1
         )
 
@@ -247,7 +245,7 @@ def eval(self, tokens: Sequence[llama_cpp.llama_token]):
             n_vocab = llama_cpp.llama_n_vocab(self.ctx)
             cols = int(n_vocab)
             logits_view = llama_cpp.llama_get_logits(self.ctx)
-            logits: List[List[llama_cpp.c_float]] = [
+            logits: List[List[float]] = [
                 [logits_view[i * cols + j] for j in range(cols)] for i in range(rows)
             ]
             self.eval_logits.extend(logits)
@@ -289,7 +287,7 @@ def _sample_top_p_top_k(
             candidates=llama_cpp.ctypes.pointer(candidates),
             penalty=repeat_penalty,
         )
-        if temp == 0.0:
+        if float(temp.value) == 0.0:
             return llama_cpp.llama_sample_token_greedy(
                 ctx=self.ctx,
                 candidates=llama_cpp.ctypes.pointer(candidates),
@@ -299,21 +297,25 @@ def _sample_top_p_top_k(
                 ctx=self.ctx,
                 candidates=llama_cpp.ctypes.pointer(candidates),
                 k=top_k,
+                min_keep=llama_cpp.c_size_t(1),
             )
             llama_cpp.llama_sample_tail_free(
                 ctx=self.ctx,
                 candidates=llama_cpp.ctypes.pointer(candidates),
                 z=llama_cpp.c_float(1.0),
+                min_keep=llama_cpp.c_size_t(1),
             )
             llama_cpp.llama_sample_typical(
                 ctx=self.ctx,
                 candidates=llama_cpp.ctypes.pointer(candidates),
                 p=llama_cpp.c_float(1.0),
+                min_keep=llama_cpp.c_size_t(1),
             )
             llama_cpp.llama_sample_top_p(
                 ctx=self.ctx,
                 candidates=llama_cpp.ctypes.pointer(candidates),
                 p=top_p,
+                min_keep=llama_cpp.c_size_t(1),
             )
             llama_cpp.llama_sample_temperature(
                 ctx=self.ctx,
@@ -390,18 +392,28 @@ def generate(
         """
         assert self.ctx is not None
 
-        if (
-            reset
-            and len(self.eval_tokens) > 0
-            and tuple(self.eval_tokens) == tuple(tokens[: len(self.eval_tokens)])
-        ):
-            if self.verbose:
-                print("Llama.generate: cache hit", file=sys.stderr)
-            reset = False
-            tokens = tokens[len(self.eval_tokens) :]
+        if reset and len(self.eval_tokens) > 0:
+            longest_prefix = 0
+            for a, b in zip(self.eval_tokens, tokens[:-1]):
+                if a == b:
+                    longest_prefix += 1
+                else:
+                    break
+            if longest_prefix > 0:
+                if self.verbose:
+                    print("Llama.generate: prefix-match hit", file=sys.stderr)
+                reset = False
+                tokens = tokens[longest_prefix:]
+                for _ in range(len(self.eval_tokens) - longest_prefix):
+                    self.eval_tokens.pop()
+                    try:
+                        self.eval_logits.pop()
+                    except IndexError:
+                        pass
 
         if reset:
             self.reset()
+
         while True:
             self.eval(tokens)
             token = self.sample(
@@ -639,7 +651,10 @@ def _create_completion(
                 self.detokenize([token]).decode("utf-8", errors="ignore")
                 for token in all_tokens
             ]
-            all_logprobs = [Llama._logits_to_logprobs(row) for row in self.eval_logits]
+            all_logprobs = [
+                Llama.logits_to_logprobs(list(map(float, row)))
+                for row in self.eval_logits
+            ]
             for token, token_str, logprobs_token in zip(
                 all_tokens, all_token_strs, all_logprobs
             ):
@@ -958,7 +973,10 @@ def save_state(self) -> LlamaState:
         llama_state_compact = (llama_cpp.c_uint8 * int(n_bytes))()
         llama_cpp.ctypes.memmove(llama_state_compact, llama_state, int(n_bytes))
         if self.verbose:
-            print(f"Llama.save_state: saving {n_bytes} bytes of llama state", file=sys.stderr)
+            print(
+                f"Llama.save_state: saving {n_bytes} bytes of llama state",
+                file=sys.stderr,
+            )
         return LlamaState(
             eval_tokens=self.eval_tokens.copy(),
             eval_logits=self.eval_logits.copy(),
@@ -985,7 +1003,7 @@ def token_bos() -> llama_cpp.llama_token:
         return llama_cpp.llama_token_bos()
 
     @staticmethod
-    def logits_to_logprobs(logits: List[llama_cpp.c_float]) -> List[llama_cpp.c_float]:
+    def logits_to_logprobs(logits: List[float]) -> List[float]:
         exps = [math.exp(float(x)) for x in logits]
         sum_exps = sum(exps)
-        return [llama_cpp.c_float(math.log(x / sum_exps)) for x in exps]
+        return [math.log(x / sum_exps) for x in exps]