-
Notifications
You must be signed in to change notification settings - Fork 11.8k
server
: streaming of tool calls and thoughts when --jinja
is on
#12379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…al based thinking tags parsing)
I'm experiencing this issue consistently across all models tested on open-webui, which leverages langchain as its backend. I've documented reproduction steps below using langchain directly in Python. For context, this PR would be transformative for the local-LLM community. While open-webui recently implemented native tool calling, it currently functions only when streaming is enabled. If llama.cpp could support native tool calls during streaming, this would finally enable proper tool utilization with local models. Steps to Reproduce:
Let me know if I can help in any way. langchain-toolcall.pyIn case it matters, I used python 3.11. Create a langchain-toolcall.py file as such:from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain.globals import set_debug
set_debug(True)
@tool
def add_two_numbers(x: float, y: float) -> float:
"""Add 'x' and 'y'."""
return x + y
prompt = ChatPromptTemplate.from_messages([
("system", "you're a helpful assistant"),
("human", "{input}"),
("placeholder", "{agent_scratchpad}"),
])
tools = [add_two_numbers]
llm = ChatOpenAI(
model="llama-cpp-model",
api_key="sk-null",
base_url="http://localhost:8080/v1",
disable_streaming=False,
)
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
print (agent_executor.invoke({"input": "what's 3 plus 5", })) docker-compose.yamlCreate a docker-compose.yaml file as such and run `docker compose up --build -d`.services:
llama-cpp-server:
build:
context: https://github.com/ochafik/llama.cpp.git#tool-diffs
dockerfile: .devops/cpu.Dockerfile
target: full
ports:
- "0.0.0.0:8080:8080"
command: --jinja --alias llama-cpp-model --host 0.0.0.0 --verbose -fa -hf bartowski/microsoft_Phi-4-mini-instruct-GGUF
entrypoint: ./llama-server |
@ochafik Fantastic work here. Do you have an ETA for getting this PR ready to merge? I'm experimenting with the Continue.dev VSCode extension in OpenShift Dev Spaces (Eclipse Che), and using Llama.cpp to serve models from the OpenShift cluster. This PR is a breakout feature for Llama.cpp IMO. Cheers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces support for streaming tool calls and thinking outputs in OpenAI delta format while improving handling for various thinking models and truncated outputs. Key changes include updates to chat response handling (via a new common_chat_syntax structure), modifications in the test suite to support both streamed and non‐streamed modes, and enhancements to JSON and regex partial parsing for resilient output healing.
Reviewed Changes
Copilot reviewed 26 out of 28 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
models/templates/README.md | Added a new template command for Qwen/QwQ-32B, updating the available templates list. |
examples/server/utils.hpp | Removed outdated inline utility functions and refactored tool/grammar handling logic. |
examples/server/tests/utils.py | Added a new helper (make_any_request) to better support streamed and non‐streamed test requests. |
examples/server/tests/unit/test_tool_call.py | Updated tests to parameterize streaming mode and adjust assertions for tool call IDs. |
examples/server/server.cpp | Refactored OAI-compatible responses using the new common_chat_syntax and improved delta/diff handling. |
docs/function-calling.md | Added cautionary documentation regarding extreme KV quantizations. |
common/sampling.cpp | Simplified trigger pattern handling by removing legacy support for pattern start triggers. |
common/regex-partial.* | New partial regex parser implementation, including a reversed partial-matching engine. |
common/json-partial.* | Introduced new JSON healing and partial parsing functions for resilient parsing of streaming outputs. |
common/chat*.{h,cpp} and common/chat-parser*.{h,cpp} | Overhauled chat message parsing and conversion to support partial, streaming, and healing of outputs. |
Files not reviewed (2)
- common/CMakeLists.txt: Language not supported
- models/templates/Qwen-QwQ-32B.jinja: Language not supported
Comments suppressed due to low confidence (1)
common/chat-parser.cpp:180
- [nitpick] The parser frequently throws common_chat_msg_partial_exception for incomplete inputs. Ensure that comprehensive unit tests cover these partial parsing scenarios and that additional error logging is provided to aid debugging in streaming contexts.
throw common_chat_msg_partial_exception(regex.str());
assert res.status_code == 200, f"Expected status code 200, got {res.status_code}" | ||
choice = res.body["choices"][0] | ||
# assert res.status_code == 200, f"Expected status code 200, got {res.status_code}" | ||
choice = body["choices"][0] | ||
tool_calls = choice["message"].get("tool_calls") | ||
assert tool_calls and len(tool_calls) == 1, f'Expected 1 tool call in {choice["message"]}' | ||
tool_call = tool_calls[0] | ||
assert choice["message"].get("content") in (None, ""), f'Expected no content in {choice["message"]}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider adding a comment explaining why the check for a non-empty tool call id is currently disabled. This will help future maintainers understand the rationale behind omitting this assertion.
assert choice["message"].get("content") in (None, ""), f'Expected no content in {choice["message"]}' | |
assert choice["message"].get("content") in (None, ""), f'Expected no content in {choice["message"]}' | |
# The following assertion is disabled because tool call IDs are currently optional | |
# in the system's response. If this changes in the future, this assertion can be | |
# re-enabled to enforce the presence of a non-empty tool call ID. |
Copilot uses AI. Check for mistakes.
examples/server/server.cpp
Outdated
@@ -349,11 +353,14 @@ struct server_task { | ||
{ | ||
auto it = data.find("chat_format"); | ||
if (it != data.end()) { | ||
params.oaicompat_chat_format = static_cast<common_chat_format>(it->get<int>()); | ||
SRV_INF("Chat format: %s\n", common_chat_format_name(params.oaicompat_chat_format).c_str()); | ||
params.oaicompat_chat_syntax.format = static_cast<common_chat_format>(it->get<int>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The transition from using a simple chat format enum to a full common_chat_syntax struct enhances flexibility but consider adding inline documentation or comments on the new fields (reasoning_format, reasoning_in_content, thinking_forced_open) to aid readability and backward compatibility.
Copilot uses AI. Check for mistakes.
common/regex-partial.cpp
Outdated
auto it = pattern.begin(); | ||
const auto end = pattern.end(); | ||
|
||
std::function<std::string()> process = [&]() { | ||
std::vector<std::vector<std::string>> alternatives(1); | ||
std::vector<std::string> * sequence = &alternatives.back(); | ||
|
||
while (it != end) { | ||
if (*it == '[') { | ||
auto start = it; | ||
++it; | ||
while (it != end) { | ||
if (*it == '\\' && (++it != end)) { | ||
++it; | ||
} else if (*it == ']') { | ||
break; | ||
} else { | ||
++it; | ||
} | ||
} | ||
if (it == end) { | ||
throw std::runtime_error("Unmatched '[' in pattern"); | ||
} | ||
++it; | ||
sequence->push_back(std::string(start, it)); | ||
} else if (*it == '*' || *it == '?' || *it == '+') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The logic in regex_to_reversed_partial_regex is rather complex. Consider adding more detailed inline comments or function-level documentation describing the steps and edge cases to improve maintainability and ease future debugging.
auto it = pattern.begin(); | |
const auto end = pattern.end(); | |
std::function<std::string()> process = [&]() { | |
std::vector<std::vector<std::string>> alternatives(1); | |
std::vector<std::string> * sequence = &alternatives.back(); | |
while (it != end) { | |
if (*it == '[') { | |
auto start = it; | |
++it; | |
while (it != end) { | |
if (*it == '\\' && (++it != end)) { | |
++it; | |
} else if (*it == ']') { | |
break; | |
} else { | |
++it; | |
} | |
} | |
if (it == end) { | |
throw std::runtime_error("Unmatched '[' in pattern"); | |
} | |
++it; | |
sequence->push_back(std::string(start, it)); | |
} else if (*it == '*' || *it == '?' || *it == '+') { | |
// Initialize iterators for traversing the regex pattern. | |
auto it = pattern.begin(); | |
const auto end = pattern.end(); | |
// Lambda function to process the regex pattern and construct reversed partial regex components. | |
std::function<std::string()> process = [&]() { | |
// `alternatives` stores groups of alternative sequences (e.g., for patterns separated by '|'). | |
std::vector<std::vector<std::string>> alternatives(1); | |
// `sequence` points to the current sequence being constructed. | |
std::vector<std::string> * sequence = &alternatives.back(); | |
// Main loop to parse the regex pattern character by character. | |
while (it != end) { | |
if (*it == '[') { // Handle character classes (e.g., [a-z]). | |
auto start = it; // Mark the start of the character class. | |
++it; | |
while (it != end) { | |
if (*it == '\\' && (++it != end)) { // Skip escaped characters. | |
++it; | |
} else if (*it == ']') { // End of character class. | |
break; | |
} else { | |
++it; // Continue parsing the character class. | |
} | |
} | |
if (it == end) { // Error: unmatched '['. | |
throw std::runtime_error("Unmatched '[' in pattern"); | |
} | |
++it; // Include the closing ']' in the character class. | |
sequence->push_back(std::string(start, it)); // Add the character class to the sequence. | |
} else if (*it == '*' || *it == '?' || *it == '+') { // Handle quantifiers. |
Copilot uses AI. Check for mistakes.
try { | ||
auto _ = json::parse(str); // NOLINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The JSON healing mechanism iteratively tests parseability on potentially large substrings. It may be worthwhile to benchmark this logic under heavy load and consider potential optimizations if performance becomes an issue.
try { | |
auto _ = json::parse(str); // NOLINT | |
size_t left = 0, right = str.size(); | |
while (left < right) { | |
size_t mid = left + (right - left) / 2; | |
try { | |
auto _ = json::parse(str.substr(0, mid)); // NOLINT | |
left = mid + 1; // Move right if parse succeeds | |
} catch (const std::exception &) { | |
right = mid; // Move left if parse fails | |
} | |
} | |
// Final check to confirm parseability of the largest substring | |
try { | |
auto _ = json::parse(str.substr(0, left)); // NOLINT |
Copilot uses AI. Check for mistakes.
@ochafik @ericcurtin PTAL - ochafik#3 |
Hope this PR won't be forgotten or dropped, without it some interesting recent tools doesn't work with LCPP (in particular https://github.com/bytedance/deer-flow gives an error: |
I'm stoked and waiting for this as well, but sadly many MCP tools currently seem to have some compatibility problems with Llama.cpp from what I've seen. Dive, Aider's MCP PR, and others I've tried. Streaming support would make a big difference, but I'm honestly not sure it's the only issue. Perhaps I just resolved the conflicts incorrectly when pulling and merging this PR, or perhaps it's not far enough along yet. (Roo-Code's MCP calls works great with or without streaming as it seems to work around tool calling, likely in the normal prompts.) |
Sorry everyone for the lack of activity. Perfect storm of job change and life events (all good!). Will try and push this (and related PRs) through in the next week, as I'm unsure how much I'll be able to do afterwards 😅. |
@ochafik Congrats and good luck. Obviously I think folks just want to encourage, not demand. Life always comes first. |
@ochafik @strawberrymelonpanda absolutely right — not in any way demanding! Best wishes! |
Best of luck in the new role! |
This merged PR on Minja a couple of hours ago (which I believe Llama.CPP uses) might just solve the problem I was having above. |
For those who are following this PR, I am trying to maintain a merge from this branch and the master branch of llama.cpp here - https://github.com/cgruver/llama.cpp/tree/tools |
This PR is still WIP (see todos at the bottom) but welcoming early feedback / testing
<think>
reasoning content inside the content (same output for all thinking models when using the default--reasoning-content deepseek
, even for those not using the<think>
syntax like Command R7B), and even if the<think>
tag was added at the end of the prompt by the template (as for DeepSeek R1 & QwQ).{"code": "json-encoded code"}
for multiline programs)This fixes #12107, #10920, #11861
Follow up to #9639
How to test / use
Get and build this PR's branch
Run
llama-server
w/ any model (see more details in the tool calling docs; note that some GGUFs require a chat template override!):Call the chat completions endpoint in streamed mode with any OpenAI-compatible library, or plain curl:
You can also open http://localhost:8080/ to see thoughts being streamed back properly even for models which template add an opening
<think>
tag to the end of the prompt (QwQ, now DeepSeek R1 too although most GGUFs have their initial version) and models like Cohere Command R7B that natively use a different thinking tags syntax (now normalized, since—reasoning-format deepseek
is the default)Context
Supporting OpenAI's streaming delta format was a bit tricky, as it returns chunks of JSON-encoded arguments for each function call, but that's not necessarily what models give us.
While tool calls are returned in a standard format, each w/ a function name, tool call id and JSON encoded arguments, model outputs vary greatly in their syntax. That syntax mostly uses JSON for arguments but not always.
Function calls and their arguments can be at various levels:
[TOOL_CALLS][{"name": "special_function", "arguments": {"arg1": 1}, "id": "123456789"}]
)<tool_call>{"name": "special_function", "arguments": {"arg1": 1}}</tool_call>
; note that some models use other keys here, e.g.tool_name
,parameters
, and may have the tool call id too)<|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>special_function\n```json\n{"arg1": 1}\n```<|tool▁call▁end|><|tool▁calls▁end|>
, or functionary v3.2:special_function\n{"arg1": 1}
){"tool_call": {"name": "special_function", "arguments": {"arg1": 1}}}
(or insidetool_calls
array ifparallel_tool_calls
is on)python
tool call, with two variants:<|python_tag|>multiline python code here
(functionary v3.1),python\nmultiline python code here
(functionary v3.2; w/ prefix>>>
if after textual response)<|python_tag|>python.call(code="multiline\npython\ncode\nhere")
Side note about raw python code:
<|python_tag>foo.call(bar="baz")
in Llama 3.x style will return"tool_calls": [{"name": "foo", "arguments": "{\"bar\": \"baz\"}"}]
, while the same output from Functionary would be parsed as"tool_calls": [{"name": "python", "arguments": "{\"code\": \"foo.call(bar=\\\"baz\\\")\"}"}]
.Now when streaming, we may have sampled only a prefix of the aforementioned output, and want ideally to parse what can be parsed out of it, and send a JSON-encoded arguments object that is cut at a safe place, so that the sum of all the deltas adds up to the full arguments JSON string.
(A primary use case for partial JSON arguments streaming is streaming large multiline diff tool arguments in tools such as RooCode / Cline / Cursor)
The cleanest option would have been to create a unified parser / state machine that can be drip-fed tokens, and preserve its state in the server slot. But I figured the complexity was too high for now (see notes on speeding up below), and instead I've implemented something definitely inefficient but relatively simple (chat.cpp it still the same size): for every token coming in, I try and parse the entire output so far, with partial regex & json parsing support, which allows recovering cleanly cut-off JSON-encoded function arguments (regardless of the original format of said arguments). I then compare the full
common_chat_msg
against the last one we sent back, and compute OpenAI-compatible deltas out of this.Location, location, location 🏡
Note that the output of the model may be truncated (max token output length reached or streaming in progress), and that may fall inside an expected literal (e.g.
<think>
isn't a single token on QwQ-32B), inside a regex (used for some matchers), or inside some JSON.But more interesting is where it happens, esp. for partial JSON:
tests/test-chat-parser.cpp should make this a bit clearer, and I'm in the process of adding partial examples w/ the actual formats in tests/test-chat.cpp (look out for
/* is_partial= */ true
)See examples of streamed tool call deltas
Implementation notes
Partial parsing utils
I added a
common_chat_msg_parser
utility with syntax reminiscent of @ngxson's suggestions in #11607 (comment), but relying on control flow to allow more flexibility:common_regex
(seecommon/regex-partial.cpp
)./abc/
gives/((?:(?:c)?b)?a)[\s\S]*/
, with a single capturing group which end indicates - in reverse - where the partial match started)nlohmann/json
's SAX interface to build location awareness / stack to know how to heal a JSON that fails to parseconsume_json
accepts a list of json paths under which to expect arguments objects; could be from the root = empty path if the entire json object is an arguments object)try_*
parsing methods. This makes the code relatively easy to read and debug. No exotic syntax (apart fromoptional
s, they really help here imho), which should make it easier to convert to coroutines when we wanna make it all incremental.This allows parsing of partial model outputs, whether in streaming mode or when reaching the token limit (currently, tool calls give ugly unparsed outputs when
finish_reason
!=tool_call
).To think or not to think... what is the prompt?
I've also introduced
common_chat_syntax
which wrapscommon_reasoning_format
,common_chat_format
together with:thinking_forced_open
: whether the prompt was detected to end w/ a (model-specific)<think>
tag to force thinking modereasoning_in_content
: whether the thinking tags should be left in the content, which is currently the case in streaming mode as the DeepSeek API does.This allows streaming back a standard
<think>...
syntax even for models that use a different set of tags (e.g. Command R7B). And of course,--reasoning-format none
is still allowed to get the raw output.Note: Ideally, we'd stream the thoughts as a
reasoning_content
delta (now trivial to implement), but for now we are just aiming for compatibility w/ DeepSeek's API (if--reasoning-format deepseek
, which is the default).Triggering thoughts 😓
I noticed DeepSeek R1 Qwen 7B sometimes obsesses over the tool call syntax and "thinks" about how it's gonna call it... which triggers the lazy grammars for said calls before the thoughts are closed.
To address this, I made it possible for
common_chat_templates_apply
to create trigger regexes that match on the entire output (this was already the case in the sampler).COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL
(renamed from_START
) is now expected to have a single capturing group from the start of which the grammar sampler will be activated.Functionary v3.2 w/ raw python
Ask
bartowski/functionary-small-v3.2-GGUF:Q4_K_M
to write a hello world in Python and it outputspython\n{"code": "print('hey')"}
.But ask it to print a hello world in python w/ matplotlib, and it uses its raw multiline python syntax
python\nprint('hey')\n# many other lines
. This is now supported.TODOs
tool-call
: ensure there's always a non-empty tool call id #12292logprobs
for tools mode (right now, forbidden; we don't return diffs for every token, for instance if a function name is in multiple tokens we don't want to send its name in chunks)llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L
)common_regex
) as separate PR:common
: add partial regex support #12808common_json
) as separate PR(?) or fold intochat-parser.cpp
<|START_RESPONSE|>
at the end of the prompt. Output will contain an<|END_RESPONSE|>
that needs handling (would fit nicely in newcommon_chat_syntax
struct). Maybe combine w/ forced/disabled thinking modes as a follow up PRscripts/tool_bench.sh
to compare againstmaster
(+ compare timings)Future follow ups:
cc/ @jpohhhh