abetlen
diff --git a/‎CHANGELOG.md
Copy file name to clipboardExpand all lines: CHANGELOG.md
+7Lines changed: 7 additions & 0 deletions b/‎CHANGELOG.md
Copy file name to clipboardExpand all lines: CHANGELOG.md
+7Lines changed: 7 additions & 0 deletions
diff --git a/‎README.md
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion b/‎README.md
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
diff --git a/‎docker/README.md
Copy file name to clipboard
+25-27Lines changed: 25 additions & 27 deletions b/‎docker/README.md
Copy file name to clipboard
+25-27Lines changed: 25 additions & 27 deletions
diff --git a/‎docker/cuda_simple/Dockerfile
Copy file name to clipboardExpand all lines: docker/cuda_simple/Dockerfile
+14-3Lines changed: 14 additions & 3 deletions b/‎docker/cuda_simple/Dockerfile
Copy file name to clipboardExpand all lines: docker/cuda_simple/Dockerfile
+14-3Lines changed: 14 additions & 3 deletions
diff --git a/‎llama_cpp/llama.py
Copy file name to clipboardExpand all lines: llama_cpp/llama.py
-1Lines changed: 0 additions & 1 deletion b/‎llama_cpp/llama.py
Copy file name to clipboardExpand all lines: llama_cpp/llama.py
-1Lines changed: 0 additions & 1 deletion
diff --git a/‎llama_cpp/llama_grammar.py
Copy file name to clipboardExpand all lines: llama_cpp/llama_grammar.py
+13-10Lines changed: 13 additions & 10 deletions b/‎llama_cpp/llama_grammar.py
Copy file name to clipboardExpand all lines: llama_cpp/llama_grammar.py
+13-10Lines changed: 13 additions & 10 deletions
diff --git a/‎pyproject.toml
Copy file name to clipboardExpand all lines: pyproject.toml
+1-1Lines changed: 1 addition & 1 deletion b/‎pyproject.toml
Copy file name to clipboardExpand all lines: pyproject.toml
+1-1Lines changed: 1 addition & 1 deletion
diff --git a/‎setup.py
Copy file name to clipboardExpand all lines: setup.py
+1-1Lines changed: 1 addition & 1 deletion b/‎setup.py
Copy file name to clipboardExpand all lines: setup.py
+1-1Lines changed: 1 addition & 1 deletion
diff --git a/‎vendor/llama.cpp
Copy file name to clipboard b/‎vendor/llama.cpp
Copy file name to clipboard
@@ -7,6 +7,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [0.1.78]
+
+### Added
+
+- Grammar based sampling via LlamaGrammar which can be passed to completions
+- Make n_gpu_layers == -1 offload all layers
+
 ## [0.1.77]
 
 - (llama.cpp) Update llama.cpp add support for LLaMa 2 70B
 
@@ -201,7 +201,7 @@ This package is under active development and I welcome any contributions.
 To get started, clone the repository and install the package in development mode:
 
 ```bash
-git clone --recurse-submodules git@github.com:abetlen/llama-cpp-python.git
+git clone --recurse-submodules https://github.com/abetlen/llama-cpp-python.git
 cd llama-cpp-python
 
 # Install with pip
 
@@ -1,46 +1,55 @@
-# Install Docker Server
-
-**Note #1:** This was tested with Docker running on Linux. If you can get it working on Windows or MacOS, please update this `README.md` with a PR!
+### Install Docker Server
+> [!IMPORTANT]  
+> This was tested with Docker running on Linux. <br>If you can get it working on Windows or MacOS, please update this `README.md` with a PR!<br>
 
 [Install Docker Engine](https://docs.docker.com/engine/install)
 
-**Note #2:** NVidia GPU CuBLAS support requires a NVidia GPU with sufficient VRAM (approximately as much as the size in the table below) and Docker NVidia support (see [container-toolkit/install-guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html))
 
-# Simple Dockerfiles for building the llama-cpp-python server with external model bin files
-## openblas_simple - a simple Dockerfile for non-GPU OpenBLAS, where the model is located outside the Docker image
+## Simple Dockerfiles for building the llama-cpp-python server with external model bin files
+### openblas_simple
+A simple Dockerfile for non-GPU OpenBLAS, where the model is located outside the Docker image:
 ```
 cd ./openblas_simple
 docker build -t openblas_simple .
-docker run -e USE_MLOCK=0 -e MODEL=/var/model/<model-path> -v <model-root-path>:/var/model -t openblas_simple
+docker run --cap-add SYS_RESOURCE -e USE_MLOCK=0 -e MODEL=/var/model/<model-path> -v <model-root-path>:/var/model -t openblas_simple
 ```
 where `<model-root-path>/<model-path>` is the full path to the model file on the Docker host system.
 
-## cuda_simple - a simple Dockerfile for CUDA accelerated CuBLAS, where the model is located outside the Docker image
+### cuda_simple
+> [!WARNING]  
+> Nvidia GPU CuBLAS support requires an Nvidia GPU with sufficient VRAM (approximately as much as the size in the table below) and Docker Nvidia support (see [container-toolkit/install-guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)) <br>
+
+A simple Dockerfile for CUDA-accelerated CuBLAS, where the model is located outside the Docker image:
+
 ```
 cd ./cuda_simple
 docker build -t cuda_simple .
-docker run -e USE_MLOCK=0 -e MODEL=/var/model/<model-path> -v <model-root-path>:/var/model -t cuda_simple
+docker run --gpus=all --cap-add SYS_RESOURCE -e USE_MLOCK=0 -e MODEL=/var/model/<model-path> -v <model-root-path>:/var/model -t cuda_simple
 ```
 where `<model-root-path>/<model-path>` is the full path to the model file on the Docker host system.
 
-# "Open-Llama-in-a-box"
-## Download an Apache V2.0 licensed 3B paramter Open Llama model and install into a Docker image that runs an OpenBLAS-enabled llama-cpp-python server
+--------------------------------------------------------------------------
+
+### "Open-Llama-in-a-box"
+Download an Apache V2.0 licensed 3B params Open LLaMA model and install into a Docker image that runs an OpenBLAS-enabled llama-cpp-python server:
 ```
 $ cd ./open_llama
 ./build.sh
 ./start.sh
 ```
 
-# Manually choose your own Llama model from Hugging Face
+### Manually choose your own Llama model from Hugging Face
 `python3 ./hug_model.py -a TheBloke -t llama`
 You should now have a model in the current directory and `model.bin` symlinked to it for the subsequent Docker build and copy step. e.g.
 ```
 docker $ ls -lh *.bin
 -rw-rw-r-- 1 user user 4.8G May 23 18:30 <downloaded-model-file>q5_1.bin
 lrwxrwxrwx 1 user user   24 May 23 18:30 model.bin -> <downloaded-model-file>q5_1.bin
 ```
-**Note #1:** Make sure you have enough disk space to download the model. As the model is then copied into the image you will need at least
-**TWICE** as much disk space as the size of the model:
+
+> [!NOTE]  
+> Make sure you have enough disk space to download the model. As the model is then copied into the image you will need at least
+**TWICE** as much disk space as the size of the model:<br>
 
 | Model |  Quantized size |
 |------:|----------------:|
@@ -50,17 +59,6 @@ lrwxrwxrwx 1 user user   24 May 23 18:30 model.bin -> <downloaded-model-file>q5_
 |   33B |           25 GB |
 |   65B |           50 GB |
 
-**Note #2:** If you want to pass or tune additional parameters, customise `./start_server.sh` before running `docker build ...`
-
-## Use OpenBLAS
-Use if you don't have a NVidia GPU. Defaults to `python:3-slim-bullseye` Docker base image and OpenBLAS:
-### Build:
-`docker build -t openblas .`
-### Run:
-`docker run --cap-add SYS_RESOURCE -t openblas`
 
-## Use CuBLAS
-### Build:
-`docker build --build-arg IMAGE=nvidia/cuda:12.1.1-devel-ubuntu22.04 -t cublas .`
-### Run:
-`docker run --cap-add SYS_RESOURCE -t cublas`
+> [!NOTE]  
+> If you want to pass or tune additional parameters, customise `./start_server.sh` before running `docker build ...`
@@ -4,13 +4,24 @@ FROM nvidia/cuda:${CUDA_IMAGE}
 # We need to set the host to 0.0.0.0 to allow outside access
 ENV HOST 0.0.0.0
 
+RUN apt-get update && apt-get upgrade -y \
+    && apt-get install -y git build-essential \
+    python3 python3-pip gcc wget \
+    ocl-icd-opencl-dev opencl-headers clinfo \
+    libclblast-dev libopenblas-dev \
+    && mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd
+
 COPY . .
 
-# Install the package
-RUN apt update && apt install -y python3 python3-pip
+# setting build related env vars
+ENV CUDA_DOCKER_ARCH=all
+ENV LLAMA_CUBLAS=1
+
+# Install depencencies
 RUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette pydantic-settings
 
-RUN LLAMA_CUBLAS=1 pip install llama-cpp-python
+# Install llama-cpp-python (build with cuda)
+RUN CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
 
 # Run the server
 CMD python3 -m llama_cpp.server
@@ -1,5 +1,4 @@
 import os
-from pathlib import Path
 import sys
 import uuid
 import time
 
@@ -1031,10 +1031,10 @@ def print_grammar_char(file: TextIO, c: int) -> None:
 # }
 def is_char_element(elem: LlamaGrammarElement) -> bool:
     return elem.type in (
-        llama_gretype.LLAMA_GRETYPE_CHAR.value,
-        llama_gretype.LLAMA_GRETYPE_CHAR_NOT.value,
-        llama_gretype.LLAMA_GRETYPE_CHAR_ALT.value,
-        llama_gretype.LLAMA_GRETYPE_CHAR_RNG_UPPER.value,
+        llama_gretype.LLAMA_GRETYPE_CHAR,
+        llama_gretype.LLAMA_GRETYPE_CHAR_NOT,
+        llama_gretype.LLAMA_GRETYPE_CHAR_ALT,
+        llama_gretype.LLAMA_GRETYPE_CHAR_RNG_UPPER,
     )
 
 
@@ -1054,9 +1054,10 @@ def print_rule(
     #             "malformed rule, does not end with LLAMA_GRETYPE_END: " + std::to_string(rule_id));
     #     }
     #     fprintf(file, "%s ::= ", symbol_id_names.at(rule_id).c_str());
-    if rule.empty() or rule.back().type != llama_gretype.LLAMA_GRETYPE_END.value:
+    if rule.empty() or rule.back().type != llama_gretype.LLAMA_GRETYPE_END:
         raise RuntimeError(
-            "malformed rule, does not end with LLAMA_GRETYPE_END: " + str(rule_id)
+            "malformed rule, does not end with LLAMA_GRETYPE_END: "
+            + str(rule_id)
         )
     print(f"{symbol_id_names.at(rule_id)} ::=", file=file, end=" ")
     #     for (size_t i = 0, end = rule.size() - 1; i < end; i++) {
@@ -1100,8 +1101,10 @@ def print_rule(
     #         }
     for i, elem in enumerate(rule[:-1]):
         case = elem.type  # type: llama_gretype
-        if case is llama_gretype.LLAMA_GRETYPE_END.value:
-            raise RuntimeError("unexpected end of rule: " + str(rule_id) + "," + str(i))
+        if case is llama_gretype.LLAMA_GRETYPE_END:
+            raise RuntimeError(
+                "unexpected end of rule: " + str(rule_id) + "," + str(i)
+            )
         elif case is llama_gretype.LLAMA_GRETYPE_ALT:
             print("| ", file=file, end="")
         elif case is llama_gretype.LLAMA_GRETYPE_RULE_REF:
@@ -1140,8 +1143,8 @@ def print_rule(
         #             fprintf(file, "] ");
         if is_char_element(elem):
             if rule[i + 1].type in (
-                llama_gretype.LLAMA_GRETYPE_CHAR_ALT.value,
-                llama_gretype.LLAMA_GRETYPE_CHAR_RNG_UPPER.value,
+                llama_gretype.LLAMA_GRETYPE_CHAR_ALT,
+                llama_gretype.LLAMA_GRETYPE_CHAR_RNG_UPPER,
             ):
                 pass
             else:
 
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "llama_cpp_python"
-version = "0.1.77"
+version = "0.1.78"
 description = "Python bindings for the llama.cpp library"
 authors = ["Andrei Betlen <abetlen@gmail.com>"]
 license = "MIT"
 
@@ -10,7 +10,7 @@
     description="A Python wrapper for llama.cpp",
     long_description=long_description,
     long_description_content_type="text/markdown",
-    version="0.1.77",
+    version="0.1.78",
     author="Andrei Betlen",
     author_email="abetlen@gmail.com",
     license="MIT",
Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,4 @@`
`1`	`1`	`import os`
`2`		`-from pathlib import Path`
`3`	`2`	`import sys`
`4`	`3`	`import uuid`
`5`	`4`	`import time`