Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) #13771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 25, 2025

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented May 25, 2025

This allows disabling thinking for all supported thinking models (QwQ, DeepSeek R1 distills, Qwen3, Command R7B), when the flag --reasoning-budget 0 is set

For per-request behaviour, see #13272 (discussion on upcoming reasoning budget request param) and #13196 (support passing generic kvs).

cc/ @matteoserva
cc/ @ngxson Not sure about the slight alteration of the semantics of the CLI flag (updated docs + inline help), but doesn't feel worth adding a separate flag at this stage, wdyt?

@github-actions github-actions bot added testing Everything test related examples python python script changes server labels May 25, 2025
@ngxson
Copy link
Collaborator

ngxson commented May 25, 2025

yes this can be useful, I thought about it in #13272 , which is part of my idea about implementing the thinking budget.

just to be less confused between none and disabled, I think it's better to call this flag nothink instead. In the future, we may also want to add hidden mode which still allow the model to generate thought, but is hidden from the response

@CISC
Copy link
Collaborator

CISC commented May 25, 2025

Consider adding Granite's thinking option in it's chat template, which changes the system prompt. Basically the inverse of Qwen3's option.

@ochafik
Copy link
Collaborator Author

ochafik commented May 25, 2025

Consider adding Granite's thinking option in it's chat template, which changes the system prompt. Basically the inverse of Qwen3's option.

@CISC I hadn't seen that one, thanks for bringing this up! Strong case for support through @ngxson's #13272, the request param could override the flag then, or something.

@ochafik ochafik changed the title server: add --reasoning-format=disabled to disable thinking (incl. qwen3 w/ enable_thinking:false) server: add --reasoning-format=nothink to disable thinking (incl. qwen3 w/ enable_thinking:false) May 25, 2025
@ochafik ochafik marked this pull request as ready for review May 25, 2025 09:44
@ochafik ochafik requested a review from ngxson as a code owner May 25, 2025 09:44
common/arg.cpp Outdated Show resolved Hide resolved
common/chat.cpp Outdated Show resolved Hide resolved
common/arg.cpp Outdated
"controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:\n"
"- none: leaves thoughts unparsed in `message.content`\n"
"- deepseek: puts thoughts in `message.reasoning_content` (except in streaming mode, which behaves as `none`)\n"
"- nothink: prevents generation of thoughts (forcibly closing thoughts tag or setting template-specific variables such as `enable_thinking: false` for Qwen3)\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't feel worth adding a separate flag at this stage, wdyt?

Tbh I think we should still separate it to another flag. The format meaning it only format the response, not changing the behavior, but here nothink changes the generation behavior

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to just add a flag called --reasoning-budget and only support either -1 (unlimited budget) or 0 (no think) for now

@ngxson ngxson changed the title server: add --reasoning-format=nothink to disable thinking (incl. qwen3 w/ enable_thinking:false) server: add --reasoning-budget to disable thinking (incl. qwen3 w/ enable_thinking:false) May 25, 2025
@ngxson ngxson changed the title server: add --reasoning-budget to disable thinking (incl. qwen3 w/ enable_thinking:false) server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) May 25, 2025
@ochafik ochafik merged commit e121edc into ggml-org:master May 25, 2025
48 checks passed
@countzero
Copy link

@ngxson & @ochafik I have a question regarding the usage. Simply adding --reasoning-budget 0 does not stop Qwen3 to output <think> tags and reason before answering. Am I missing something?

llama-server `
    --model 'D:\AI\LLM\gguf\Qwen3-30B-A3B\Qwen3-30B-A3B.IQ3_XXS.gguf' `
    --alias 'Qwen3-30B-A3B.IQ3_XXS.gguf' `
    --ctx-size 8192 `
    --threads 16 `
    --n-gpu-layers 99 `
    --reasoning-budget 0 `
    --flash-attn

This request:

curl.exe http://127.0.0.1:8080/v1/chat/completions `
    --silent `
    --header "Content-Type: application/json" `
    --data '{
        \"model\": \"Qwen3-30B-A3B.IQ3_XXS.gguf\",
        \"messages\": [
            {
                \"role\": \"user\",
                \"content\": \"How are you?\"
            }
        ],
        \"temperature\": 0.6,
        \"max_tokens\": 1024
    }'

Returns the following:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<think>\nOkay, the user asked, \"How are you?\" I need to respond appropriately. Since I'm an AI, I don't have feelings, but I should keep the response friendly and helpful. Maybe say something like, \"I'm just a bunch of code, but I'm doing great! How can I assist you today?\" That's positive and shifts the focus back to the user. Let me make sure it's concise and friendly. Yep, that works.\n</think>\n\nI'm just a bunch of code, but I'm doing great! How can I assist you today? ­ƒÿè"
      }
    }
  ],
  "created": 1748251147,
  "model": "Qwen3-30B-A3B.IQ3_XXS.gguf",
  "system_fingerprint": "b5490-fef693dc",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 121,
    "prompt_tokens": 12,
    "total_tokens": 133
  },
  "id": "chatcmpl-Ihg3Q1yUsY6rFGKOnOXr6hbRtTR42v2e",
  "timings": {
    "prompt_n": 12,
    "prompt_ms": 69.177,
    "prompt_per_token_ms": 5.76475,
    "prompt_per_second": 173.46806019341687,
    "predicted_n": 121,
    "predicted_ms": 893.017,
    "predicted_per_token_ms": 7.3803057851239675,
    "predicted_per_second": 135.49574084255954
  }
}

@kth8
Copy link

kth8 commented May 26, 2025

@countzero You need to start the server with --jinja in addition to --reasoning-budget 0

@countzero
Copy link

@kth8 Thank you for the hint. That indeed works now:

llama-server `
    --model 'D:\AI\LLM\gguf\Qwen3-30B-A3B\Qwen3-30B-A3B.IQ3_XXS.gguf' `
    --alias 'Qwen3-30B-A3B.IQ3_XXS.gguf' `
    --ctx-size 8192 `
    --threads 16 `
    --n-gpu-layers 99 `
    --reasoning-budget 0 `
    --jinja `
    --flash-attn

@ngxson & @ochafik As a developer I would like to use the --reasoning-budget argument without having to know about the --jinja flag, so that I can simply use what I read in the usage documentation directly.

Suggestion: Activate --jinja automatically if --reasoning-budget needs it. I think a similar mechanism is already implemented for other flags.

@characharm
Copy link
Contributor

Please take a look: #13877

@jacekpoplawski
Copy link

I am not able to get reasoning-budget to work

CUDA_VISIBLE_DEVICES=0,1 llama-cli --reasoning-budget 0 -ngl 99 -fa -ctv q8_0 -ctk q8_0 -m /mnt/models3/Qwen_Qwen3-4B-Q8_0.gguf -p "what is the capital of Poland?" 2>/dev/null
user
what is the capital of Poland?
assistant
<think>
Okay, the user is asking for the capital of Poland. I know that Poland is a country in Europe, and I remember that its capital is Warsaw. But wait, I should make sure I'm not mixing it up with another city. Let me think. Poland's capital is indeed Warsaw. I think that's correct. But maybe I should double-check. Let me recall some facts. Poland has several cities, like Kraków, Wrocław, and Gdańsk, but the capital is Warsaw. Yes, I'm pretty sure that's right. I think Warsaw is the capital. So the answer should be Warsaw. But wait, maybe the user is testing if I know that Warsaw is the capital. I should confirm. Let me think of other countries. For example, the capital of France is Paris, Germany is Berlin, and so on. Poland's capital is Warsaw. Yeah, that's right. I don't think there's any confusion here. So the answer is Warsaw.
</think>

The capital of Poland is **Warsaw**. It is the largest city in the country and serves as its political, cultural, and economic center.

>
CUDA_VISIBLE_DEVICES=0,1 llama-cli --reasoning-budget 0 -ngl 99 -fa -ctv q8_0 -ctk q8_0 -m /mnt/models3/Qwen_QwQ-32B-Q8_0.gguf -p "what is the capital of Poland?" 2>/dev/null
user
what is the capital of Poland?
assistant
<think>
Okay, the user is asking for the capital of Poland. Let me think... I remember that Warsaw is the capital, but I should make sure I'm not confusing it with another city. Let me verify. Poland is a country in Central Europe. The capital cities can sometimes be tricky because some countries have capitals that aren't their most famous cities, like Brazil's capital is Brasília, not Rio de Janeiro. But in Poland's case, Warsaw is definitely the largest city and the capital.

Wait, maybe I should recall some historical context. Warsaw was the site of the Warsaw Uprising during World War II, which I think was the largest single military effort by any European resistance movement during the war. That reinforces that it's an important city. Also, the Palace of Culture and Science is a landmark there, which is a gift from the Soviet Union. Yeah, so that's in Warsaw.

I don't think there's been any recent changes to the capital. Poland has been independent since the end of Communist rule, but the capital hasn't changed in modern times. Kraków is another major city in Poland, but it's not the capital. Sometimes people might confuse them because Kraków was the historical capital before Warsaw became the capital in 1596. So historically, there was a shift, but currently, Warsaw is definitely the capital.

Let me think if there's any other possible confusion. Maybe some might think about Gdańsk because of the Solidarity movement, but that's a different city on the coast. So no, the answer should be Warsaw. I can't think of any recent news where the capital would have changed. Therefore, I'm confident that the capital of Poland is Warsaw.
</think>

The capital of Poland is **Warsaw** (Warszawa in Polish). It has been the country's political, cultural, and economic center since the 16th century. Warsaw is known for its rich history, including its role in World War II and its subsequent reconstruction. Key landmarks include the Royal Castle, Old Town (Stare Miasto), and the Palace of Culture and Science.

@kth8
Copy link

kth8 commented Jun 1, 2025

@jacekpoplawski you didn't run with --jinja like mentioned in previous comments

@jacekpoplawski
Copy link

jacekpoplawski commented Jun 1, 2025

does it work for you with --jinja?
UPDATE: it works, but only with server, not with cli

jacek@AI-SuperComputer:~$ CUDA_VISIBLE_DEVICES=0,1 llama-cli --jinja --reasoning-budget 0 -ngl 99 -fa -ctv q8_0 -ctk q8_
0 -m /mnt/models3/Qwen_Qwen3-4B-Q8_0.gguf -p "what is the capital of Poland?" 2>/dev/null
user
what is the capital of Poland?
assistant
<think>
Okay, the user is asking for the capital of Poland. I need to make sure I give the correct answer. First, I recall that Poland is a country in Central Europe. The capital is a city that's well-known for its history and culture. I think it's Warsaw. But wait, I should double-check that to be sure.

Let me think. Poland's major cities include Warsaw, Kraków, Wrocław, and others. But the capital is usually the most important city, especially in terms of government and politics. I remember that Warsaw was the capital during the interwar period, and even after World War II, it remained the capital. There was a period when the capital was moved to Kraków, but that was during the time when the country was under German occupation. However, after the war, Warsaw was restored as the capital. So yes, Warsaw is the capital of Poland. I should confirm that there's no other city that's more commonly referred to as the capital. Maybe some people confuse it with other cities, but I'm pretty sure it's Warsaw. The official name is Warszawa. So the answer is Warsaw, and the official name is Warszawa. I need to present that clearly.
</think>

The capital of Poland is **Warsaw** (in Polish: **Warszawa**). It is the country's political, cultural, and economic center, home to the Polish parliament (Sejm) and government institutions. Warsaw has a rich history, including being the capital during the Polish–Lithuanian Commonwealth and after World War II, when it was rebuilt as the heart of a unified Poland.

@jacekpoplawski
Copy link

server works "kind of" but it think this is a problem with QwQ itself
reasoning
command was:
CUDA_VISIBLE_DEVICES=0,1 llama-server --reasoning-budget 0 --jinja -ngl 99 -fa -ctv q8_0 -ctk q8_0 -m /mnt/models3/Qwen_QwQ-32B-Q8_0.gguf --host 0.0.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.