Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

server: streaming of tool calls and thoughts when --jinja is on #12379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 100 commits into
base: master
Choose a base branch
Loading
from

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Mar 14, 2025

This PR is still WIP (see todos at the bottom) but welcoming early feedback / testing

  • Support streaming of tool calls in OpenAI format
  • Improve handling of thinking model (DeepSeek R1 Distills, QwQ, Command R7B):
    • Stream <think> reasoning content inside the content (same output for all thinking models when using the default --reasoning-content deepseek, even for those not using the <think> syntax like Command R7B), and even if the <think> tag was added at the end of the prompt by the template (as for DeepSeek R1 & QwQ).
    • Avoid spurious lazy (tool call) grammar triggers from "thoughts about tool calls" (only trigger after closing any unclosed thoughts)
  • Improves Functionary v3.2 support (allow raw python code, preferred by models over {"code": "json-encoded code"} for multiline programs)
  • Support truncated outputs incl. reasoning_content & tool_calls (returns salvageable fields when finish_reason = length)

This fixes #12107, #10920, #11861

Follow up to #9639

How to test / use

  • Get and build this PR's branch
    git clone https://github.com/ggml-org/llama.cpp
    cd llama.cpp
    git remote add ochafik https://github.com/ochafik/llama.cpp
    git fetch ochafik
    git checkout ochafik/tool-diffs
    cmake -B build -DLLAMA_CURL=1 # -DGGML_CUDA=1 ...
    cmake --build build -t llama-server --parallel --config Release
    alias llama-server=./build/bin/llama-server
  • Run llama-server w/ any model (see more details in the tool calling docs; note that some GGUFs require a chat template override!):

    # Thoughts of Command R7B / DeepSeek R1 / QwQ will be streamed in the content inside <think> tags
    llama-server --jinja -fa -hf bartowski/Qwen_QwQ-32B-GGUF
    
    # Models w/ generic tool call support now return clean interrupted output when hitting token limit
    llama-server --jinja -fa -hf bartowski/microsoft_Phi-4-mini-instruct-GGUF
    
  • Call the chat completions endpoint in streamed mode with any OpenAI-compatible library, or plain curl:

    curl http://localhost:8080/v1/chat/completions -d '{
      "model": "gpt-3.5-turbo",
      "tools": [
        {
          "type": "function",
          "function": {
            "name": "python",
            "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
            "parameters": {
              "type": "object",
              "properties": {
                "code": {
                  "type": "string",
                  "description": "The code to run in the ipython interpreter."
                }
              },
              "required": ["code"]
            }
          }
        }
      ],
      "messages": [
        {
          "role": "user",
          "content": "Print a hello world message with python."
        }
      ],
      "stream": true
    }'
  • You can also open http://localhost:8080/ to see thoughts being streamed back properly even for models which template add an opening <think> tag to the end of the prompt (QwQ, now DeepSeek R1 too although most GGUFs have their initial version) and models like Cohere Command R7B that natively use a different thinking tags syntax (now normalized, since —reasoning-format deepseek is the default)

Context

Supporting OpenAI's streaming delta format was a bit tricky, as it returns chunks of JSON-encoded arguments for each function call, but that's not necessarily what models give us.

While tool calls are returned in a standard format, each w/ a function name, tool call id and JSON encoded arguments, model outputs vary greatly in their syntax. That syntax mostly uses JSON for arguments but not always.

Function calls and their arguments can be at various levels:

  • JSON array of tool calls (e.g. Mistral Nemo: [TOOL_CALLS][{"name": "special_function", "arguments": {"arg1": 1}, "id": "123456789"}])
  • Standalone JSON tool call (e.g. Hermes syntax: <tool_call>{"name": "special_function", "arguments": {"arg1": 1}}</tool_call>; note that some models use other keys here, e.g. tool_name, parameters, and may have the tool call id too)
  • JSON arguments object w/ name in some prefix (e.g. Deepseek: <|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>special_function\n```json\n{"arg1": 1}\n```<|tool▁call▁end|><|tool▁calls▁end|>, or functionary v3.2: special_function\n{"arg1": 1})
  • Nested JSON for the generic mode {"tool_call": {"name": "special_function", "arguments": {"arg1": 1}}} (or inside tool_calls array if parallel_tool_calls is on)
  • No JSON / raw code string for python tool call, with two variants:
    • Unconstrained verbatim code: <|python_tag|>multiline python code here (functionary v3.1), python\nmultiline python code here (functionary v3.2; w/ prefix >>> if after textual response)
    • Constrained pythonish syntax for "builtin tools" (Llama 3.x, quite widespread): <|python_tag|>python.call(code="multiline\npython\ncode\nhere")

Side note about raw python code: <|python_tag>foo.call(bar="baz") in Llama 3.x style will return "tool_calls": [{"name": "foo", "arguments": "{\"bar\": \"baz\"}"}], while the same output from Functionary would be parsed as "tool_calls": [{"name": "python", "arguments": "{\"code\": \"foo.call(bar=\\\"baz\\\")\"}"}].

Now when streaming, we may have sampled only a prefix of the aforementioned output, and want ideally to parse what can be parsed out of it, and send a JSON-encoded arguments object that is cut at a safe place, so that the sum of all the deltas adds up to the full arguments JSON string.

(A primary use case for partial JSON arguments streaming is streaming large multiline diff tool arguments in tools such as RooCode / Cline / Cursor)

The cleanest option would have been to create a unified parser / state machine that can be drip-fed tokens, and preserve its state in the server slot. But I figured the complexity was too high for now (see notes on speeding up below), and instead I've implemented something definitely inefficient but relatively simple (chat.cpp it still the same size): for every token coming in, I try and parse the entire output so far, with partial regex & json parsing support, which allows recovering cleanly cut-off JSON-encoded function arguments (regardless of the original format of said arguments). I then compare the full common_chat_msg against the last one we sent back, and compute OpenAI-compatible deltas out of this.

Location, location, location 🏡

Note that the output of the model may be truncated (max token output length reached or streaming in progress), and that may fall inside an expected literal (e.g. <think> isn't a single token on QwQ-32B), inside a regex (used for some matchers), or inside some JSON.

But more interesting is where it happens, esp. for partial JSON:

  • If it happens inside an arguments object or a contents string (for generic mode), we should return it partial / truncated (and json-dumped in the case of the arguments), and diffed from the last parsed value for the streamed case
  • If it happens inside the wrapper of the arguments, then it depends. We don't want to get a half-function name, but as soon as we have a complete function name we can send a diff. So we try and heal the JSON (we identify which json paths can be partially healed - because they're inside the arguments, and which ones must be dropped), and only populate a tool call if we have at least a name). Likewise, if there is an array of function calls with the first complete, and the next partial, we want to make sure the client can start calling the first function.

tests/test-chat-parser.cpp should make this a bit clearer, and I'm in the process of adding partial examples w/ the actual formats in tests/test-chat.cpp (look out for /* is_partial= */ true)

See examples of streamed tool call deltas
curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo",
    "tools": [
        {
        "type":"function",
        "function":{
            "name":"python",
            "description":"Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
            "parameters":{
            "type":"object",
            "properties":{
                "code":{
                "type":"string",
                "description":"The code to run in the ipython interpreter."
                }
            },
            "required":["code"]
            }
        }
        }
    ],
    "messages": [
        {
        "role": "user",
        "content": "Print a hello world message with python."
        }
    ], "stream": true
}'
data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":null,"tool_calls":[{"index":0,"id":"call_aqwOReHDKPnqiF7NbRxzDTY1","type":"function","function":{"name":"python","arguments":""}}],"refusal":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"code"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\":\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"print"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"('"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"Hello"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":","}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":" World"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"!"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"')"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"}"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"tool_calls"}]}

data: [DONE]

Implementation notes

Partial parsing utils

I added a common_chat_msg_parser utility with syntax reminiscent of @ngxson's suggestions in #11607 (comment), but relying on control flow to allow more flexibility:

  • Supports partial regex parsing
    • Since the STL still doesn't have partial matching support (unlike Boost), I had to implement my own in common_regex (see common/regex-partial.cpp).
    • The trick = transform the original regex to a regex that matches in reverse from the end of the string (e.g. /abc/ gives /((?:(?:c)?b)?a)[\s\S]*/, with a single capturing group which end indicates - in reverse - where the partial match started)
  • Supports partial JSON parsing:
    • Used nlohmann/json's SAX interface to build location awareness / stack to know how to heal a JSON that fails to parse
    • Healing the JSON w/ a healing marker that can then be found when visiting the resulting JSON (to remove things we don't want to heal - e.g. function name - and cut any JSON encoded result at the "right" place, which must be somewhere inside function arguments: consume_json accepts a list of json paths under which to expect arguments objects; could be from the root = empty path if the entire json object is an arguments object)
  • Supports control flow w/ try_* parsing methods. This makes the code relatively easy to read and debug. No exotic syntax (apart from optionals, they really help here imho), which should make it easier to convert to coroutines when we wanna make it all incremental.
  • Supports full or partial parsing w/ same code (throws partial exceptions to interrupt the control flow without making parsing code more complex)

This allows parsing of partial model outputs, whether in streaming mode or when reaching the token limit (currently, tool calls give ugly unparsed outputs when finish_reason != tool_call).

To think or not to think... what is the prompt?

I've also introduced common_chat_syntax which wraps common_reasoning_format, common_chat_format together with:

  • thinking_forced_open: whether the prompt was detected to end w/ a (model-specific) <think> tag to force thinking mode
  • reasoning_in_content: whether the thinking tags should be left in the content, which is currently the case in streaming mode as the DeepSeek API does.

This allows streaming back a standard <think>... syntax even for models that use a different set of tags (e.g. Command R7B). And of course, --reasoning-format none is still allowed to get the raw output.

Note: Ideally, we'd stream the thoughts as a reasoning_content delta (now trivial to implement), but for now we are just aiming for compatibility w/ DeepSeek's API (if --reasoning-format deepseek, which is the default).

Triggering thoughts 😓

I noticed DeepSeek R1 Qwen 7B sometimes obsesses over the tool call syntax and "thinks" about how it's gonna call it... which triggers the lazy grammars for said calls before the thoughts are closed.

To address this, I made it possible for common_chat_templates_apply to create trigger regexes that match on the entire output (this was already the case in the sampler). COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL (renamed from _START) is now expected to have a single capturing group from the start of which the grammar sampler will be activated.

Functionary v3.2 w/ raw python

Ask bartowski/functionary-small-v3.2-GGUF:Q4_K_M to write a hello world in Python and it outputs python\n{"code": "print('hey')"}.

But ask it to print a hello world in python w/ matplotlib, and it uses its raw multiline python syntax python\nprint('hey')\n# many other lines. This is now supported.

TODOs

  • Fix tool call id attribution logic (disabled for now) from tool-call: ensure there's always a non-empty tool call id #12292
  • Might need one last diff in the final response after a stream, say, to close any raw python code
  • Decide what to do about logprobs for tools mode (right now, forbidden; we don't return diffs for every token, for instance if a function name is in multiple tokens we don't want to send its name in chunks)
    • Edit: OpenAI returns null logpropbs in tool call mode. Just need to ensure normal mode doesn't regress (test failing atm)
  • Fix Mistral Nemo crash (llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L)
  • Send partial regex (common_regex) as separate PR: common: add partial regex support #12808
  • Send partial JSON (common_json) as separate PR(?) or fold into chat-parser.cpp
  • Command R7B's non-tool-calling template (they have 3 templates) forces <|START_RESPONSE|> at the end of the prompt. Output will contain an <|END_RESPONSE|> that needs handling (would fit nicely in new common_chat_syntax struct). Maybe combine w/ forced/disabled thinking modes as a follow up PR
  • Add some docs
  • Add more tests
  • Run scripts/tool_bench.sh to compare against master (+ compare timings)

Future follow ups:

  • To make this faster, I suggest two options:
    • Wait for the project to switch to C++20 & turn all the parser functions into resumable coroutines (feed them tokens and persist their state in the slot)
    • Only compute and send deltas after N milliseconds

cc/ @jpohhhh

@github-actions github-actions bot added documentation Improvements or additions to documentation testing Everything test related examples python python script changes server labels Mar 14, 2025
@colout
Copy link

colout commented Apr 27, 2025

I've been using this successfully with Llama 3.1 8B-Instruct, but have encountered a compatibility issue between this implementation and PydanticAI's OpenAI API compatible client, and I don't know which side is correct.

I'm experiencing this issue consistently across all models tested on open-webui, which leverages langchain as its backend. I've documented reproduction steps below using langchain directly in Python.

For context, this PR would be transformative for the local-LLM community. While open-webui recently implemented native tool calling, it currently functions only when streaming is enabled. If llama.cpp could support native tool calls during streaming, this would finally enable proper tool utilization with local models.

Steps to Reproduce:

  1. Using the CPU build of this branch, launch the server with the bartowski/microsoft_Phi-4-mini-instruct-GGUF model as specified in the PR instructions. My docker-compose.yaml is available in the foldout below if you wish to replicate my exact setup.

  2. Execute a langchain tool calling agent in streaming mode. To replicate my exact setup, runpip install langchain langchain-openai and run the Python code provided in the foldout below.

  3. The Python logs show repeated tool call names: "add_two_numbersadd_two_numbersadd_two_numbersadd_two_numbersadd_two_numbersadd_two_numbersadd_two_numbersadd_two_numbersadd_two_numbersadd_two_numbersadd_two_numbers". In llama.cpp server logs, JSON parsing errors appear (possibly expected in this build) and the complete tool name is sent each time instead of deltas (data stream, to_send: data: {"choices":[{ ... "delta":{"tool_calls":[{ ... "function":{"name":"add_two_numbers", ... }}]}}]}).

Let me know if I can help in any way.

langchain-toolcall.py In case it matters, I used python 3.11. Create a langchain-toolcall.py file as such:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain.globals import set_debug
set_debug(True)

@tool
def add_two_numbers(x: float, y: float) -> float:
    """Add 'x' and 'y'."""
    return x + y

prompt = ChatPromptTemplate.from_messages([
    ("system", "you're a helpful assistant"),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

tools = [add_two_numbers]


llm = ChatOpenAI(
    model="llama-cpp-model",
    api_key="sk-null",
    base_url="http://localhost:8080/v1",
    disable_streaming=False,
)

agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

print (agent_executor.invoke({"input": "what's 3 plus 5", }))
docker-compose.yaml Create a docker-compose.yaml file as such and run `docker compose up --build -d`.
services:
  llama-cpp-server:
    build:
      context: https://github.com/ochafik/llama.cpp.git#tool-diffs
      dockerfile: .devops/cpu.Dockerfile
      target: full
    ports:
      - "0.0.0.0:8080:8080"
    command: --jinja --alias llama-cpp-model --host 0.0.0.0 --verbose -fa -hf bartowski/microsoft_Phi-4-mini-instruct-GGUF
    entrypoint: ./llama-server

@cgruver
Copy link

cgruver commented Apr 28, 2025

@ochafik Fantastic work here. Do you have an ETA for getting this PR ready to merge?

I'm experimenting with the Continue.dev VSCode extension in OpenShift Dev Spaces (Eclipse Che), and using Llama.cpp to serve models from the OpenShift cluster.

This PR is a breakout feature for Llama.cpp IMO.

Cheers.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces support for streaming tool calls and thinking outputs in OpenAI delta format while improving handling for various thinking models and truncated outputs. Key changes include updates to chat response handling (via a new common_chat_syntax structure), modifications in the test suite to support both streamed and non‐streamed modes, and enhancements to JSON and regex partial parsing for resilient output healing.

Reviewed Changes

Copilot reviewed 26 out of 28 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
models/templates/README.md Added a new template command for Qwen/QwQ-32B, updating the available templates list.
examples/server/utils.hpp Removed outdated inline utility functions and refactored tool/grammar handling logic.
examples/server/tests/utils.py Added a new helper (make_any_request) to better support streamed and non‐streamed test requests.
examples/server/tests/unit/test_tool_call.py Updated tests to parameterize streaming mode and adjust assertions for tool call IDs.
examples/server/server.cpp Refactored OAI-compatible responses using the new common_chat_syntax and improved delta/diff handling.
docs/function-calling.md Added cautionary documentation regarding extreme KV quantizations.
common/sampling.cpp Simplified trigger pattern handling by removing legacy support for pattern start triggers.
common/regex-partial.* New partial regex parser implementation, including a reversed partial-matching engine.
common/json-partial.* Introduced new JSON healing and partial parsing functions for resilient parsing of streaming outputs.
common/chat*.{h,cpp} and common/chat-parser*.{h,cpp} Overhauled chat message parsing and conversion to support partial, streaming, and healing of outputs.
Files not reviewed (2)
  • common/CMakeLists.txt: Language not supported
  • models/templates/Qwen-QwQ-32B.jinja: Language not supported
Comments suppressed due to low confidence (1)

common/chat-parser.cpp:180

  • [nitpick] The parser frequently throws common_chat_msg_partial_exception for incomplete inputs. Ensure that comprehensive unit tests cover these partial parsing scenarios and that additional error logging is provided to aid debugging in streaming contexts.
throw common_chat_msg_partial_exception(regex.str());

assert res.status_code == 200, f"Expected status code 200, got {res.status_code}"
choice = res.body["choices"][0]
# assert res.status_code == 200, f"Expected status code 200, got {res.status_code}"
choice = body["choices"][0]
tool_calls = choice["message"].get("tool_calls")
assert tool_calls and len(tool_calls) == 1, f'Expected 1 tool call in {choice["message"]}'
tool_call = tool_calls[0]
assert choice["message"].get("content") in (None, ""), f'Expected no content in {choice["message"]}'
Copy link
Preview

Copilot AI Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider adding a comment explaining why the check for a non-empty tool call id is currently disabled. This will help future maintainers understand the rationale behind omitting this assertion.

Suggested change
assert choice["message"].get("content") in (None, ""), f'Expected no content in {choice["message"]}'
assert choice["message"].get("content") in (None, ""), f'Expected no content in {choice["message"]}'
# The following assertion is disabled because tool call IDs are currently optional
# in the system's response. If this changes in the future, this assertion can be
# re-enabled to enforce the presence of a non-empty tool call ID.

Copilot uses AI. Check for mistakes.

Positive FeedbackNegative Feedback
@@ -349,11 +353,14 @@ struct server_task {
{
auto it = data.find("chat_format");
if (it != data.end()) {
params.oaicompat_chat_format = static_cast<common_chat_format>(it->get<int>());
SRV_INF("Chat format: %s\n", common_chat_format_name(params.oaicompat_chat_format).c_str());
params.oaicompat_chat_syntax.format = static_cast<common_chat_format>(it->get<int>());
Copy link
Preview

Copilot AI Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transition from using a simple chat format enum to a full common_chat_syntax struct enhances flexibility but consider adding inline documentation or comments on the new fields (reasoning_format, reasoning_in_content, thinking_forced_open) to aid readability and backward compatibility.

Copilot uses AI. Check for mistakes.

Positive FeedbackNegative Feedback
Comment on lines 71 to 96
auto it = pattern.begin();
const auto end = pattern.end();

std::function<std::string()> process = [&]() {
std::vector<std::vector<std::string>> alternatives(1);
std::vector<std::string> * sequence = &alternatives.back();

while (it != end) {
if (*it == '[') {
auto start = it;
++it;
while (it != end) {
if (*it == '\\' && (++it != end)) {
++it;
} else if (*it == ']') {
break;
} else {
++it;
}
}
if (it == end) {
throw std::runtime_error("Unmatched '[' in pattern");
}
++it;
sequence->push_back(std::string(start, it));
} else if (*it == '*' || *it == '?' || *it == '+') {
Copy link
Preview

Copilot AI Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The logic in regex_to_reversed_partial_regex is rather complex. Consider adding more detailed inline comments or function-level documentation describing the steps and edge cases to improve maintainability and ease future debugging.

Suggested change
auto it = pattern.begin();
const auto end = pattern.end();
std::function<std::string()> process = [&]() {
std::vector<std::vector<std::string>> alternatives(1);
std::vector<std::string> * sequence = &alternatives.back();
while (it != end) {
if (*it == '[') {
auto start = it;
++it;
while (it != end) {
if (*it == '\\' && (++it != end)) {
++it;
} else if (*it == ']') {
break;
} else {
++it;
}
}
if (it == end) {
throw std::runtime_error("Unmatched '[' in pattern");
}
++it;
sequence->push_back(std::string(start, it));
} else if (*it == '*' || *it == '?' || *it == '+') {
// Initialize iterators for traversing the regex pattern.
auto it = pattern.begin();
const auto end = pattern.end();
// Lambda function to process the regex pattern and construct reversed partial regex components.
std::function<std::string()> process = [&]() {
// `alternatives` stores groups of alternative sequences (e.g., for patterns separated by '|').
std::vector<std::vector<std::string>> alternatives(1);
// `sequence` points to the current sequence being constructed.
std::vector<std::string> * sequence = &alternatives.back();
// Main loop to parse the regex pattern character by character.
while (it != end) {
if (*it == '[') { // Handle character classes (e.g., [a-z]).
auto start = it; // Mark the start of the character class.
++it;
while (it != end) {
if (*it == '\\' && (++it != end)) { // Skip escaped characters.
++it;
} else if (*it == ']') { // End of character class.
break;
} else {
++it; // Continue parsing the character class.
}
}
if (it == end) { // Error: unmatched '['.
throw std::runtime_error("Unmatched '[' in pattern");
}
++it; // Include the closing ']' in the character class.
sequence->push_back(std::string(start, it)); // Add the character class to the sequence.
} else if (*it == '*' || *it == '?' || *it == '+') { // Handle quantifiers.

Copilot uses AI. Check for mistakes.

Positive FeedbackNegative Feedback
Comment on lines +133 to +134
try {
auto _ = json::parse(str); // NOLINT
Copy link
Preview

Copilot AI Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The JSON healing mechanism iteratively tests parseability on potentially large substrings. It may be worthwhile to benchmark this logic under heavy load and consider potential optimizations if performance becomes an issue.

Suggested change
try {
auto _ = json::parse(str); // NOLINT
size_t left = 0, right = str.size();
while (left < right) {
size_t mid = left + (right - left) / 2;
try {
auto _ = json::parse(str.substr(0, mid)); // NOLINT
left = mid + 1; // Move right if parse succeeds
} catch (const std::exception &) {
right = mid; // Move left if parse fails
}
}
// Final check to confirm parseability of the largest substring
try {
auto _ = json::parse(str.substr(0, left)); // NOLINT

Copilot uses AI. Check for mistakes.

Positive FeedbackNegative Feedback
@cgruver
Copy link

cgruver commented Apr 30, 2025

@ochafik @ericcurtin PTAL - ochafik#3

@drrros
Copy link

drrros commented May 13, 2025

Hope this PR won't be forgotten or dropped, without it some interesting recent tools doesn't work with LCPP (in particular https://github.com/bytedance/deer-flow gives an error: openai.InternalServerError: Error code: 500 - {'error': {'code': 500, 'message': 'Cannot use tools with stream', 'type': 'server_error'}})

@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented May 13, 2025

I'm stoked and waiting for this as well, but sadly many MCP tools currently seem to have some compatibility problems with Llama.cpp from what I've seen. Dive, Aider's MCP PR, and others I've tried. Streaming support would make a big difference, but I'm honestly not sure it's the only issue.

Perhaps I just resolved the conflicts incorrectly when pulling and merging this PR, or perhaps it's not far enough along yet.

(Roo-Code's MCP calls works great with or without streaming as it seems to work around tool calling, likely in the normal prompts.)

@ochafik
Copy link
Collaborator Author

ochafik commented May 14, 2025

Sorry everyone for the lack of activity. Perfect storm of job change and life events (all good!). Will try and push this (and related PRs) through in the next week, as I'm unsure how much I'll be able to do afterwards 😅.

@strawberrymelonpanda
Copy link
Contributor

@ochafik Congrats and good luck. Obviously I think folks just want to encourage, not demand. Life always comes first.

@drrros
Copy link

drrros commented May 14, 2025

@ochafik @strawberrymelonpanda absolutely right — not in any way demanding! Best wishes!

@ericcurtin
Copy link
Collaborator

Sorry everyone for the lack of activity. Perfect storm of job change and life events (all good!). Will try and push this (and related PRs) through in the next week, as I'm unsure how much I'll be able to do afterwards 😅.

Best of luck in the new role!

@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented May 15, 2025

Streaming support would make a big difference, but I'm honestly not sure it's the only issue.

This merged PR on Minja a couple of hours ago (which I believe Llama.CPP uses) might just solve the problem I was having above.

@cgruver
Copy link

cgruver commented May 16, 2025

For those who are following this PR, I am trying to maintain a merge from this branch and the master branch of llama.cpp here - https://github.com/cgruver/llama.cpp/tree/tools

@ochafik ochafik marked this pull request as ready for review May 16, 2025 23:03
@ochafik ochafik requested a review from ngxson as a code owner May 16, 2025 23:03
@ochafik ochafik marked this pull request as draft May 16, 2025 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation examples python python script changes script Script related server testing Everything test related tool calling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: llama-cpp-deepseek-r1.jinja template will miss the <think> tag
Morty Proxy This is a proxified and sanitized view of the page, visit original site.