Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 03f171e

Browse filesBrowse files
authored
example: LLM inference with Ray Serve (abetlen#1465)
1 parent b564d05 commit 03f171e
Copy full SHA for 03f171e

File tree

Expand file treeCollapse file tree

3 files changed

+42
-0
lines changed
Filter options
Expand file treeCollapse file tree

3 files changed

+42
-0
lines changed

‎examples/ray/README.md

Copy file name to clipboard
+19Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
This is an example of doing LLM inference with [Ray](https://docs.ray.io/en/latest/index.html) and [Ray Serve](https://docs.ray.io/en/latest/serve/index.html).
2+
3+
First, install the requirements:
4+
5+
```bash
6+
$ pip install -r requirements.txt
7+
```
8+
9+
Deploy a GGUF model to Ray Serve with the following command:
10+
11+
```bash
12+
$ serve run llm:llm_builder model_path='../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf'
13+
```
14+
15+
This will start an API endpoint at `http://localhost:8000/`. You can query the model like this:
16+
17+
```bash
18+
$ curl -k -d '{"prompt": "tell me a joke", "max_tokens": 128}' -X POST http://localhost:8000
19+
```

‎examples/ray/llm.py

Copy file name to clipboard
+20Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
from starlette.requests import Request
2+
from typing import Dict
3+
from ray import serve
4+
from ray.serve import Application
5+
from llama_cpp import Llama
6+
7+
@serve.deployment
8+
class LlamaDeployment:
9+
def __init__(self, model_path: str):
10+
self._llm = Llama(model_path=model_path)
11+
12+
async def __call__(self, http_request: Request) -> Dict:
13+
input_json = await http_request.json()
14+
prompt = input_json["prompt"]
15+
max_tokens = input_json.get("max_tokens", 64)
16+
return self._llm(prompt, max_tokens=max_tokens)
17+
18+
19+
def llm_builder(args: Dict[str, str]) -> Application:
20+
return LlamaDeployment.bind(args["model_path"])

‎examples/ray/requirements.txt

Copy file name to clipboard
+3Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
ray[serve]
2+
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
3+
llama-cpp-python

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.