Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 6a2bc8b

Browse filesBrowse files
server : added --no-prefill-assistant flag (#13608)
* added no-prefill-assistant flag * reworded documentation comment * updated server README.md
1 parent e3a7cf6 commit 6a2bc8b
Copy full SHA for 6a2bc8b

File tree

5 files changed

+17
-1
lines changed
Filter options

5 files changed

+17
-1
lines changed

‎common/arg.cpp

Copy file name to clipboardExpand all lines: common/arg.cpp
+10
Original file line numberDiff line numberDiff line change
@@ -2880,6 +2880,16 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
28802880
params.chat_template = read_file(value);
28812881
}
28822882
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_CHAT_TEMPLATE_FILE"));
2883+
add_opt(common_arg(
2884+
{"--no-prefill-assistant"},
2885+
string_format(
2886+
"whether to prefill the assistant's response if the last message is an assistant message (default: prefill enabled)\n"
2887+
"when this flag is set, if the last message is an assistant message then it will be treated as a full message and not prefilled\n"
2888+
),
2889+
[](common_params & params) {
2890+
params.prefill_assistant = false;
2891+
}
2892+
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_NO_PREFILL_ASSISTANT"));
28832893
add_opt(common_arg(
28842894
{"-sps", "--slot-prompt-similarity"}, "SIMILARITY",
28852895
string_format("how much the prompt of a request must match the prompt of a slot in order to use that slot (default: %.2f, 0.0 = disabled)\n", params.slot_prompt_similarity),

‎common/common.h

Copy file name to clipboardExpand all lines: common/common.h
+1
Original file line numberDiff line numberDiff line change
@@ -368,6 +368,7 @@ struct common_params {
368368
bool use_jinja = false; // NOLINT
369369
bool enable_chat_template = true;
370370
common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
371+
bool prefill_assistant = true; // if true, any trailing assistant message will be prefilled into the response
371372

372373
std::vector<std::string> api_keys;
373374

‎tools/server/README.md

Copy file name to clipboardExpand all lines: tools/server/README.md
+2
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
1313
* Multimodal ([documentation](../../docs/multimodal.md)) / with OpenAI-compatible API support
1414
* Monitoring endpoints
1515
* Schema-constrained JSON response format
16+
* Prefilling of assistant messages similar to the Claude API
1617
* [Function calling](../../docs/function-calling.md) / tool use for ~any model
1718
* Speculative decoding
1819
* Easy-to-use web UI
@@ -175,6 +176,7 @@ The project is under active development, and we are [looking for feedback and co
175176
| `--reasoning-format FORMAT` | reasoning format (default: deepseek; allowed values: deepseek, none)<br/>controls whether thought tags are extracted from the response, and in which format they're returned. 'none' leaves thoughts unparsed in `message.content`, 'deepseek' puts them in `message.reasoning_content` (for DeepSeek R1 & Command R7B only).<br/>only supported for non-streamed responses<br/>(env: LLAMA_ARG_THINK) |
176177
| `--chat-template JINJA_TEMPLATE` | set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone3, falcon3, gemma, gigachat, glmedge, granite, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, phi3, phi4, rwkv-world, smolvlm, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
177178
| `--chat-template-file JINJA_TEMPLATE_FILE` | set custom jinja chat template file (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>only commonly used templates are accepted (unless --jinja is set before this flag):<br/>list of built-in templates:<br/>bailing, chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, deepseek3, exaone3, falcon3, gemma, gigachat, glmedge, granite, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch, openchat, orion, phi3, phi4, rwkv-world, smolvlm, vicuna, vicuna-orca, yandex, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE_FILE) |
179+
| `--no-prefill-assistant` | whether to prefill the assistant's response if the last message is an assistant message (default: prefill enabled)<br/>when this flag is set, if the last message is an assistant message then it will be treated as a full message and not prefilled<br/>(env: LLAMA_ARG_NO_PREFILL_ASSISTANT) |
178180
| `-sps, --slot-prompt-similarity SIMILARITY` | how much the prompt of a request must match the prompt of a slot in order to use that slot (default: 0.50, 0.0 = disabled)<br/> |
179181
| `--lora-init-without-apply` | load LoRA adapters without applying them (apply later via POST /lora-adapters) (default: disabled) |
180182
| `--draft-max, --draft, --draft-n N` | number of tokens to draft for speculative decoding (default: 16)<br/>(env: LLAMA_ARG_DRAFT_MAX) |

‎tools/server/server.cpp

Copy file name to clipboardExpand all lines: tools/server/server.cpp
+2
Original file line numberDiff line numberDiff line change
@@ -4348,6 +4348,7 @@ int main(int argc, char ** argv) {
43484348
json data = oaicompat_completion_params_parse(
43494349
body,
43504350
params.use_jinja,
4351+
params.prefill_assistant,
43514352
params.reasoning_format,
43524353
ctx_server.chat_templates.get(),
43534354
ctx_server.mctx,
@@ -4369,6 +4370,7 @@ int main(int argc, char ** argv) {
43694370
json data = oaicompat_completion_params_parse(
43704371
body,
43714372
params.use_jinja,
4373+
params.prefill_assistant,
43724374
params.reasoning_format,
43734375
ctx_server.chat_templates.get(),
43744376
ctx_server.mctx,

‎tools/server/utils.hpp

Copy file name to clipboardExpand all lines: tools/server/utils.hpp
+2-1
Original file line numberDiff line numberDiff line change
@@ -583,6 +583,7 @@ static json oaicompat_completion_params_parse(const json & body) {
583583
static json oaicompat_completion_params_parse(
584584
const json & body, /* openai api json semantics */
585585
bool use_jinja,
586+
bool prefill_assistant,
586587
common_reasoning_format reasoning_format,
587588
const struct common_chat_templates * tmpls,
588589
bool allow_non_text,
@@ -732,7 +733,7 @@ static json oaicompat_completion_params_parse(
732733

733734
// if the assistant message appears at the end of list, we do not add end-of-turn token
734735
// for ex. this can be useful to modify the reasoning process in reasoning models
735-
bool prefill_assistant_message = !inputs.messages.empty() && inputs.messages.back().role == "assistant";
736+
bool prefill_assistant_message = !inputs.messages.empty() && inputs.messages.back().role == "assistant" && prefill_assistant;
736737
common_chat_msg last_message;
737738
if (prefill_assistant_message) {
738739
last_message = inputs.messages.back();

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.