Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 447a3d2

Browse filesBrowse files
committed
Merge branch 'main' into setup
2 parents bebe771 + 030fafe commit 447a3d2
Copy full SHA for 447a3d2
Expand file treeCollapse file tree

22 files changed

+1091
-250
lines changed

‎.gitmodules

Copy file name to clipboard
+1-1Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
[submodule "vendor/llama.cpp"]
22
path = vendor/llama.cpp
3-
url = git@github.com:ggerganov/llama.cpp.git
3+
url = https://github.com/ggerganov/llama.cpp.git

‎CHANGELOG.md

Copy file name to clipboard
+12Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Changelog
2+
3+
All notable changes to this project will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7+
8+
## [Unreleased]
9+
10+
### Added
11+
12+
- Added first version of the changelog

‎CMakeLists.txt

Copy file name to clipboardExpand all lines: CMakeLists.txt
+1-1Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,4 +28,4 @@ else()
2828
LIBRARY DESTINATION llama_cpp
2929
RUNTIME DESTINATION llama_cpp
3030
)
31-
endif(UNIX)
31+
endif()

‎README.md

Copy file name to clipboardExpand all lines: README.md
+19-5Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ This package provides:
1515
- OpenAI-like API
1616
- LangChain compatibility
1717

18+
Documentation is available at [https://abetlen.github.io/llama-cpp-python](https://abetlen.github.io/llama-cpp-python).
19+
1820
## Installation from PyPI (recommended)
1921

2022
Install from PyPI (requires a c compiler):
@@ -26,6 +28,18 @@ pip install llama-cpp-python
2628
The above command will attempt to install the package and build build `llama.cpp` from source.
2729
This is the recommended installation method as it ensures that `llama.cpp` is built with the available optimizations for your system.
2830

31+
If you have previously installed `llama-cpp-python` through pip and want to upgrade your version or rebuild the package with different compiler options, please add the following flags to ensure that the package is rebuilt correctly:
32+
33+
```bash
34+
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
35+
```
36+
37+
Note: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:
38+
```
39+
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
40+
bash Miniforge3-MacOSX-arm64.sh
41+
```
42+
Otherwise, while installing it will build the llama.ccp x86 version which will be 10x slower on Apple Silicon (M1) Mac.
2943

3044
### Installation with OpenBLAS / cuBLAS / CLBlast
3145

@@ -35,19 +49,19 @@ Use the `FORCE_CMAKE=1` environment variable to force the use of `cmake` and ins
3549
To install with OpenBLAS, set the `LLAMA_OPENBLAS=1` environment variable before installing:
3650

3751
```bash
38-
LLAMA_OPENBLAS=1 FORCE_CMAKE=1 pip install llama-cpp-python
52+
CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
3953
```
4054

4155
To install with cuBLAS, set the `LLAMA_CUBLAS=1` environment variable before installing:
4256

4357
```bash
44-
LLAMA_CUBLAS=1 FORCE_CMAKE=1 pip install llama-cpp-python
58+
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
4559
```
4660

4761
To install with CLBlast, set the `LLAMA_CLBLAST=1` environment variable before installing:
4862

4963
```bash
50-
LLAMA_CLBLAST=1 FORCE_CMAKE=1 pip install llama-cpp-python
64+
CMAKE_ARGS="-DLLAMA_CLBLAST=on" FORCE_CMAKE=1 pip install llama-cpp-python
5165
```
5266

5367

@@ -102,7 +116,7 @@ Navigate to [http://localhost:8000/docs](http://localhost:8000/docs) to see the
102116
A Docker image is available on [GHCR](https://ghcr.io/abetlen/llama-cpp-python). To run the server:
103117

104118
```bash
105-
docker run --rm -it -p8000:8000 -v /path/to/models:/models -eMODEL=/models/ggml-model-name.bin ghcr.io/abetlen/llama-cpp-python:latest
119+
docker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/ggml-model-name.bin ghcr.io/abetlen/llama-cpp-python:latest
106120
```
107121

108122
## Low-level API
@@ -120,7 +134,7 @@ Below is a short example demonstrating how to use the low-level API to tokenize
120134
>>> ctx = llama_cpp.llama_init_from_file(b"./models/7b/ggml-model.bin", params)
121135
>>> max_tokens = params.n_ctx
122136
# use ctypes arrays for array params
123-
>>> tokens = (llama_cppp.llama_token * int(max_tokens))()
137+
>>> tokens = (llama_cpp.llama_token * int(max_tokens))()
124138
>>> n_tokens = llama_cpp.llama_tokenize(ctx, b"Q: Name the planets in the solar system? A: ", tokens, max_tokens, add_bos=llama_cpp.c_bool(True))
125139
>>> llama_cpp.llama_free(ctx)
126140
```

‎docker/Dockerfile

Copy file name to clipboard
+51Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# Define the image argument and provide a default value
2+
ARG IMAGE=python:3-slim-bullseye
3+
4+
# Use the image as specified
5+
FROM ${IMAGE}
6+
7+
# Re-declare the ARG after FROM
8+
ARG IMAGE
9+
10+
# Update and upgrade the existing packages
11+
RUN apt-get update && apt-get upgrade -y && apt-get install -y --no-install-recommends \
12+
python3 \
13+
python3-pip \
14+
ninja-build \
15+
build-essential
16+
17+
RUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette
18+
19+
# Perform the conditional installations based on the image
20+
RUN echo "Image: ${IMAGE}" && \
21+
if [ "${IMAGE}" = "python:3-slim-bullseye" ] ; then \
22+
echo "OpenBLAS install:" && \
23+
apt-get install -y --no-install-recommends libopenblas-dev && \
24+
LLAMA_OPENBLAS=1 pip install llama-cpp-python --verbose; \
25+
else \
26+
echo "CuBLAS install:" && \
27+
LLAMA_CUBLAS=1 pip install llama-cpp-python --verbose; \
28+
fi
29+
30+
# Clean up apt cache
31+
RUN rm -rf /var/lib/apt/lists/*
32+
33+
# Set a working directory for better clarity
34+
WORKDIR /app
35+
36+
# Copy files to the app directory
37+
RUN echo "Installing model...this can take some time..."
38+
COPY ./model.bin /app/model.bin
39+
COPY ./start_server.sh /app/start_server.sh
40+
41+
# Make the server start script executable
42+
RUN chmod +x /app/start_server.sh
43+
44+
# Set environment variable for the host
45+
ENV HOST=0.0.0.0
46+
47+
# Expose a port for the server
48+
EXPOSE 8000
49+
50+
# Run the server start script
51+
CMD ["/bin/sh", "/app/start_server.sh"]

‎Dockerfile.cuda renamed to ‎docker/Dockerfile.cuda_simple

Copy file name to clipboardExpand all lines: docker/Dockerfile.cuda_simple
+3-2Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
FROM nvidia/cuda:12.1.1-devel-ubuntu20.04
1+
ARG CUDA_IMAGE="12.1.1-devel-ubuntu22.04"
2+
FROM ${CUDA_IMAGE}
23

34
# We need to set the host to 0.0.0.0 to allow outside access
45
ENV HOST 0.0.0.0
@@ -12,4 +13,4 @@ RUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fa
1213
RUN LLAMA_CUBLAS=1 python3 setup.py develop
1314

1415
# Run the server
15-
CMD python3 -m llama_cpp.server
16+
CMD python3 -m llama_cpp.server
File renamed without changes.

‎docker/README.md

Copy file name to clipboard
+46Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Dockerfiles for building the llama-cpp-python server
2+
- `Dockerfile.openblas_simple` - a simple Dockerfile for non-GPU OpenBLAS
3+
- `Dockerfile.cuda_simple` - a simple Dockerfile for CUDA accelerated CuBLAS
4+
- `hug_model.py` - a Python utility for interactively choosing and downloading the latest `5_1` quantized models from [huggingface.co/TheBloke]( https://huggingface.co/TheBloke)
5+
- `Dockerfile` - a single OpenBLAS and CuBLAS combined Dockerfile that automatically installs a previously downloaded model `model.bin`
6+
7+
# Get model from Hugging Face
8+
`python3 ./hug_model.py`
9+
10+
You should now have a model in the current directory and `model.bin` symlinked to it for the subsequent Docker build and copy step. e.g.
11+
```
12+
docker $ ls -lh *.bin
13+
-rw-rw-r-- 1 user user 4.8G May 23 18:30 <downloaded-model-file>.q5_1.bin
14+
lrwxrwxrwx 1 user user 24 May 23 18:30 model.bin -> <downloaded-model-file>.q5_1.bin
15+
```
16+
**Note #1:** Make sure you have enough disk space to download the model. As the model is then copied into the image you will need at least
17+
**TWICE** as much disk space as the size of the model:
18+
19+
| Model | Quantized size |
20+
|------:|----------------:|
21+
| 7B | 5 GB |
22+
| 13B | 10 GB |
23+
| 30B | 25 GB |
24+
| 65B | 50 GB |
25+
26+
**Note #2:** If you want to pass or tune additional parameters, customise `./start_server.sh` before running `docker build ...`
27+
28+
# Install Docker Server
29+
30+
**Note #3:** This was tested with Docker running on Linux. If you can get it working on Windows or MacOS, please update this `README.md` with a PR!
31+
32+
[Install Docker Engine](https://docs.docker.com/engine/install)
33+
34+
# Use OpenBLAS
35+
Use if you don't have a NVidia GPU. Defaults to `python:3-slim-bullseye` Docker base image and OpenBLAS:
36+
## Build:
37+
`docker build --build-arg -t openblas .`
38+
## Run:
39+
`docker run --cap-add SYS_RESOURCE -t openblas`
40+
41+
# Use CuBLAS
42+
Requires a NVidia GPU with sufficient VRAM (approximately as much as the size above) and Docker NVidia support (see [container-toolkit/install-guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html))
43+
## Build:
44+
`docker build --build-arg IMAGE=nvidia/cuda:12.1.1-devel-ubuntu22.04 -t cublas .`
45+
## Run:
46+
`docker run --cap-add SYS_RESOURCE -t cublas`

‎docker/hug_model.py

Copy file name to clipboard
+116Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
import requests
2+
import json
3+
import os
4+
import struct
5+
6+
def make_request(url, params=None):
7+
print(f"Making request to {url}...")
8+
response = requests.get(url, params=params)
9+
if response.status_code == 200:
10+
return json.loads(response.text)
11+
else:
12+
print(f"Request failed with status code {response.status_code}")
13+
return None
14+
15+
def check_magic_and_version(filename):
16+
with open(filename, 'rb') as f:
17+
# Read the first 6 bytes from the file
18+
data = f.read(6)
19+
20+
# Unpack the binary data, interpreting the first 4 bytes as a little-endian unsigned int
21+
# and the next 2 bytes as a little-endian unsigned short
22+
magic, version = struct.unpack('<I H', data)
23+
24+
print(f"magic: 0x{magic:08x}, version: 0x{version:04x}, file: {filename}")
25+
26+
return magic, version
27+
28+
def download_file(url, destination):
29+
print(f"Downloading {url} to {destination}...")
30+
response = requests.get(url, stream=True)
31+
if response.status_code == 200:
32+
with open(destination, 'wb') as f:
33+
total_downloaded = 0
34+
for chunk in response.iter_content(chunk_size=1024):
35+
if chunk: # filter out keep-alive new chunks
36+
f.write(chunk)
37+
total_downloaded += len(chunk)
38+
if total_downloaded >= 10485760: # 10 MB
39+
print('.', end='', flush=True)
40+
total_downloaded = 0
41+
print("\nDownload complete.")
42+
43+
# Creating a symbolic link from destination to "model.bin"
44+
if os.path.isfile("model.bin"):
45+
os.remove("model.bin") # remove the existing link if any
46+
os.symlink(destination, "model.bin")
47+
else:
48+
print(f"Download failed with status code {response.status_code}")
49+
50+
def get_user_choice(model_list):
51+
# Print the enumerated list
52+
print("\n")
53+
for i, (model_id, rfilename) in enumerate(model_list):
54+
print(f"{i+1}: Model ID: {model_id}, RFilename: {rfilename}")
55+
56+
# Get user's choice
57+
choice = input("Choose a model to download by entering the corresponding number: ")
58+
try:
59+
index = int(choice) - 1
60+
if 0 <= index < len(model_list):
61+
# Return the chosen model
62+
return model_list[index]
63+
else:
64+
print("Invalid choice.")
65+
except ValueError:
66+
print("Invalid input. Please enter a number corresponding to a model.")
67+
except IndexError:
68+
print("Invalid choice. Index out of range.")
69+
70+
return None
71+
72+
import argparse
73+
74+
def main():
75+
# Create an argument parser
76+
parser = argparse.ArgumentParser(description='Process the model version.')
77+
parser.add_argument('-v', '--version', type=int, default=0x0003,
78+
help='an integer for the version to be used')
79+
80+
# Parse the arguments
81+
args = parser.parse_args()
82+
83+
# Define the parameters
84+
params = {
85+
"author": "TheBloke", # Filter by author
86+
"tags": "llama"
87+
}
88+
89+
models = make_request('https://huggingface.co/api/models', params=params)
90+
if models is None:
91+
return
92+
93+
model_list = []
94+
# Iterate over the models
95+
for model in models:
96+
model_id = model['id']
97+
model_info = make_request(f'https://huggingface.co/api/models/{model_id}')
98+
if model_info is None:
99+
continue
100+
101+
for sibling in model_info.get('siblings', []):
102+
rfilename = sibling.get('rfilename')
103+
if rfilename and 'q5_1' in rfilename:
104+
model_list.append((model_id, rfilename))
105+
106+
model_choice = get_user_choice(model_list)
107+
if model_choice is not None:
108+
model_id, rfilename = model_choice
109+
url = f"https://huggingface.co/{model_id}/resolve/main/{rfilename}"
110+
download_file(url, rfilename)
111+
_, version = check_magic_and_version(rfilename)
112+
if version != args.version:
113+
print(f"Warning: Expected version {args.version}, but found different version in the file.")
114+
115+
if __name__ == '__main__':
116+
main()

‎docker/start_server.sh

Copy file name to clipboard
+11Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#!/bin/sh
2+
3+
# For mmap support
4+
ulimit -l unlimited
5+
6+
if [ "$IMAGE" = "python:3-slim-bullseye" ]; then
7+
python3 -B -m llama_cpp.server --model /app/model.bin
8+
else
9+
# You may have to reduce --n_gpu_layers=1000 to 20 or less if you don't have enough VRAM
10+
python3 -B -m llama_cpp.server --model /app/model.bin --n_gpu_layers=1000
11+
fi

‎docs/index.md

Copy file name to clipboardExpand all lines: docs/index.md
+4Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,8 +112,12 @@ python3 setup.py develop
112112
show_root_heading: true
113113

114114
::: llama_cpp.LlamaCache
115+
options:
116+
show_root_heading: true
115117

116118
::: llama_cpp.LlamaState
119+
options:
120+
show_root_heading: true
117121

118122
::: llama_cpp.llama_cpp
119123
options:

‎examples/low_level_api/low_level_api_chat_cpp.py

Copy file name to clipboardExpand all lines: examples/low_level_api/low_level_api_chat_cpp.py
+11-8Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -368,10 +368,10 @@ def generate(self):
368368
id = llama_cpp.llama_sample_token_mirostat_v2(self.ctx, candidates_p, llama_cpp.c_float(self.params.mirostat_tau), llama_cpp.c_float(self.params.mirostat_eta), llama_cpp.c_float(mirostat_mu))
369369
else:
370370
# Temperature sampling
371-
llama_cpp.llama_sample_top_k(self.ctx, candidates_p, top_k)
372-
llama_cpp.llama_sample_tail_free(self.ctx, candidates_p, llama_cpp.c_float(self.params.tfs_z))
373-
llama_cpp.llama_sample_typical(self.ctx, candidates_p, llama_cpp.c_float(self.params.typical_p))
374-
llama_cpp.llama_sample_top_p(self.ctx, candidates_p, llama_cpp.c_float(self.params.top_p))
371+
llama_cpp.llama_sample_top_k(self.ctx, candidates_p, top_k, min_keep=llama_cpp.c_size_t(1))
372+
llama_cpp.llama_sample_tail_free(self.ctx, candidates_p, llama_cpp.c_float(self.params.tfs_z), min_keep=llama_cpp.c_size_t(1))
373+
llama_cpp.llama_sample_typical(self.ctx, candidates_p, llama_cpp.c_float(self.params.typical_p), min_keep=llama_cpp.c_size_t(1))
374+
llama_cpp.llama_sample_top_p(self.ctx, candidates_p, llama_cpp.c_float(self.params.top_p), min_keep=llama_cpp.c_size_t(1))
375375
llama_cpp.llama_sample_temperature(self.ctx, candidates_p, llama_cpp.c_float(self.params.temp))
376376
id = llama_cpp.llama_sample_token(self.ctx, candidates_p)
377377
# print("`{}`".format(candidates_p.size))
@@ -382,12 +382,15 @@ def generate(self):
382382
# replace end of text token with newline token when in interactive mode
383383
if (id == llama_cpp.llama_token_eos() and self.params.interactive and not self.params.instruct):
384384
id = self.llama_token_newline[0]
385+
self.embd.append(id)
385386
if (self.use_antiprompt()):
386387
# tokenize and inject first reverse prompt
387388
self.embd_inp += self.first_antiprompt[0]
388-
389-
# add it to the context
390-
self.embd.append(id)
389+
for id in self.first_antiprompt[0]:
390+
self.embd.append(id)
391+
else:
392+
# add it to the context
393+
self.embd.append(id)
391394

392395
# echo this to console
393396
self.output_echo = True
@@ -493,7 +496,7 @@ def output(self):
493496
# Contains multi-byte UTF8
494497
for num, pattern in [(2, 192), (3, 224), (4, 240)]:
495498
# Bitwise AND check
496-
if pattern & int.from_bytes(cur_char) == pattern:
499+
if pattern & int.from_bytes(cur_char, 'little') == pattern:
497500
self.multibyte_fix = [cur_char] + ([None] * (num-1))
498501

499502
# Stop incomplete bytes from passing

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.