Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 6dde6bd

Browse filesBrowse files
authored
bug fixing (abetlen#925)
1 parent f3117c0 commit 6dde6bd
Copy full SHA for 6dde6bd

File tree

Expand file treeCollapse file tree

2 files changed

+86
-8
lines changed
Filter options
Expand file treeCollapse file tree

2 files changed

+86
-8
lines changed

‎examples/low_level_api/low_level_api_llama_cpp.py

Copy file name to clipboardExpand all lines: examples/low_level_api/low_level_api_llama_cpp.py
+25-8Lines changed: 25 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,20 +11,34 @@
1111

1212
prompt = b"\n\n### Instruction:\nWhat is the capital of France?\n\n### Response:\n"
1313

14-
lparams = llama_cpp.llama_context_default_params()
14+
lparams = llama_cpp.llama_model_default_params()
15+
cparams = llama_cpp.llama_context_default_params()
1516
model = llama_cpp.llama_load_model_from_file(MODEL_PATH.encode('utf-8'), lparams)
16-
ctx = llama_cpp.llama_new_context_with_model(model, lparams)
17+
ctx = llama_cpp.llama_new_context_with_model(model, cparams)
1718

1819
# determine the required inference memory per token:
1920
tmp = [0, 1, 2, 3]
20-
llama_cpp.llama_eval(ctx, (llama_cpp.c_int * len(tmp))(*tmp), len(tmp), 0, N_THREADS)
21+
llama_cpp.llama_eval(
22+
ctx = ctx,
23+
tokens=(llama_cpp.c_int * len(tmp))(*tmp),
24+
n_tokens=len(tmp),
25+
n_past=0
26+
)# Deprecated
2127

2228
n_past = 0
2329

2430
prompt = b" " + prompt
2531

2632
embd_inp = (llama_cpp.llama_token * (len(prompt) + 1))()
27-
n_of_tok = llama_cpp.llama_tokenize(ctx, prompt, embd_inp, len(embd_inp), True)
33+
n_of_tok = llama_cpp.llama_tokenize(
34+
model=model,
35+
text=bytes(str(prompt),'utf-8'),
36+
text_len=len(embd_inp),
37+
tokens=embd_inp,
38+
n_max_tokens=len(embd_inp),
39+
add_bos=False,
40+
special=False
41+
)
2842
embd_inp = embd_inp[:n_of_tok]
2943

3044
n_ctx = llama_cpp.llama_n_ctx(ctx)
@@ -49,8 +63,11 @@
4963
while remaining_tokens > 0:
5064
if len(embd) > 0:
5165
llama_cpp.llama_eval(
52-
ctx, (llama_cpp.c_int * len(embd))(*embd), len(embd), n_past, N_THREADS
53-
)
66+
ctx = ctx,
67+
tokens=(llama_cpp.c_int * len(embd))(*embd),
68+
n_tokens=len(embd),
69+
n_past=n_past
70+
)# Deprecated
5471

5572
n_past += len(embd)
5673
embd = []
@@ -93,7 +110,7 @@
93110
for id in embd:
94111
size = 32
95112
buffer = (ctypes.c_char * size)()
96-
n = llama_cpp.llama_token_to_piece_with_model(
113+
n = llama_cpp.llama_token_to_piece(
97114
model, llama_cpp.llama_token(id), buffer, size)
98115
assert n <= size
99116
print(
@@ -109,4 +126,4 @@
109126

110127
llama_cpp.llama_print_timings(ctx)
111128

112-
llama_cpp.llama_free(ctx)
129+
llama_cpp.llama_free(ctx)
+61Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# Low-Level API for Llama_cpp
2+
3+
## Overview
4+
This Python script, low_level_api_llama_cpp.py, demonstrates the implementation of a low-level API for interacting with the llama_cpp library. The script defines an inference that generates embeddings based on a given prompt using .gguf model.
5+
6+
### Prerequisites
7+
Before running the script, ensure that you have the following dependencies installed:
8+
9+
. Python 3.6 or higher
10+
. llama_cpp: A C++ library for working with .gguf model
11+
. NumPy: A fundamental package for scientific computing with Python
12+
. multiprocessing: A Python module for parallel computing
13+
14+
### Usage
15+
install depedencies:
16+
```bash
17+
python -m pip install llama-cpp-python ctypes os multiprocessing
18+
```
19+
Run the script:
20+
```bash
21+
python low_level_api_llama_cpp.py
22+
```
23+
24+
## Code Structure
25+
The script is organized as follows:
26+
27+
### . Initialization:
28+
Load the model from the specified path.
29+
Create a context for model evaluation.
30+
31+
### . Tokenization:
32+
Tokenize the input prompt using the llama_tokenize function.
33+
Prepare the input tokens for model evaluation.
34+
35+
### . Inference:
36+
Perform model evaluation to generate responses.
37+
Sample from the model's output using various strategies (top-k, top-p, temperature).
38+
39+
### . Output:
40+
Print the generated tokens and the corresponding decoded text.
41+
42+
### .Cleanup:
43+
Free resources and print timing information.
44+
45+
## Configuration
46+
Customize the inference behavior by adjusting the following variables:
47+
48+
#### . N_THREADS: Number of CPU threads to use for model evaluation.
49+
#### . MODEL_PATH: Path to the model file.
50+
#### . prompt: Input prompt for the chatbot.
51+
52+
## Notes
53+
. Ensure that the llama_cpp library is built and available in the system. Follow the instructions in the llama_cpp repository for building and installing the library.
54+
55+
. This script is designed to work with the .gguf model and may require modifications for compatibility with other models.
56+
57+
## Acknowledgments
58+
This code is based on the llama_cpp library developed by the community. Special thanks to the contributors for their efforts.
59+
60+
## License
61+
This project is licensed under the MIT License - see the LICENSE file for details.

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.