Open
Description
I've noticed that the GPU utilization is very low during model inference, with a maximum of only 80%, but I want to increase the GPU utilization to 99%. How can I adjust the parameters?
GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... Off | 00000000:8A:00.0 Off | 0 |
| N/A 66C P0 205W / 250W | 14807MiB / 40960MiB | 78% Default
N_THREADS = multiprocessing.cpu_count()
self.runner = Llama(
model_path=self.model_name,
n_gpu_layers=-1,
chat_format=self.generating_args["chat_format"],
tokenizer=self.llama_tokenizer,
flash_attn=True,
verbose=False,
n_ctx=1024,
n_threads=N_THREADS // 2,
n_threads_batch=N_THREADS
)
x = runner.create_chat_completion(
messages=messages,
top_p=0.0,
top_k=1,
temperature=1,
max_tokens=512,
seed=1337
)
Originally posted by @xiangxinhello in #1669 (comment)
Metadata
Metadata
Assignees
Labels
No labels