[TRTLLM-11357][feat] Support interleaved thinking for trtllm-serve#12199
[TRTLLM-11357][feat] Support interleaved thinking for trtllm-serve#12199JunyiXu-nv merged 5 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom JunyiXu-nv:user/junyix/fix-trtllm-11357JunyiXu-nv/TensorRT-LLM:user/junyix/fix-trtllm-11357Copy head branch name to clipboard
Conversation
📝 WalkthroughWalkthroughThe PR introduces support for MiniMax-M2 and Kimi K2 models through a new Changes
Sequence DiagramsequenceDiagram
participant Client
participant ReasoningParser as KimiK2ReasoningParser
participant ToolParser as MiniMaxM2ToolParser
participant Output
Client->>ReasoningParser: parse_delta(token_stream)
activate ReasoningParser
ReasoningParser->>ReasoningParser: Buffer delta text
alt Detect <think> tag
ReasoningParser->>ReasoningParser: Mark reasoning started
end
alt Detect </think> or tool_calls marker
ReasoningParser->>ReasoningParser: Mark reasoning ended
ReasoningParser->>Output: ReasoningParserResult(reasoning_content)
end
ReasoningParser->>Output: ReasoningParserResult(content)
deactivate ReasoningParser
Output->>ToolParser: parse_streaming_increment(content)
activate ToolParser
alt Detect <minimax:tool_call> start
ToolParser->>ToolParser: Initialize tool invocation buffer
end
ToolParser->>ToolParser: Extract function name, parameters
ToolParser->>ToolParser: Convert parameters with type inference
alt Detect </minimax:tool_call> end
ToolParser->>Output: StreamingParseResult(ToolCallItem)
end
deactivate ToolParser
Output-->>Client: Combined reasoning + tool results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/llmapi/reasoning_parser.py`:
- Around line 75-76: The MiniMax parser registrations ("minimax_m2" and
"minimax_m2_append_think") reuse DeepSeekR1Parser.parse_delta(), which fails
when a single delta contains both the post-reasoning tail and the next "<think>"
opener (e.g. "reason1</think>text1<think>reason2") because it emits the entire
tail as plain content instead of splitting and reopening a reasoning block;
update the parse_delta implementation used by these registrations (or add an
overriding wrapper) to detect a "</think>...<think>" pattern in the incoming
delta, split the tail into post-reasoning content and the reopened reasoning
segment, emit the post-reasoning text as content, then emit a token/event to
reopen a reasoning block before passing the remaining text back into the
existing parsing flow (use the register_reasoning_parser handlers and
DeepSeekR1Parser.parse_delta as reference points when inserting the
split-and-reopen logic).
- Around line 267-322: The parser currently discards any text before a found
<think> (and leaks partial start-tags) when self.in_reasoning is False; fix by
preserving the prefix as content and buffering partial start-tag suffixes.
Specifically, in the branch that computes begin_idx from self.reasoning_start,
when begin_idx != -1 set content = delta_text[:begin_idx] and set
reasoning_content = delta_text[begin_idx + len(self.reasoning_start):] (and set
self.in_reasoning True); when begin_idx == -1 do not always clear self._buffer —
detect a trailing partial prefix of self.reasoning_start or
self.tool_section_start (e.g. last '<' suffix) and set self._buffer to that
suffix while returning content=delta_text up to that suffix, otherwise clear
self._buffer and return content=delta_text; update uses of self.in_reasoning,
self._buffer, reasoning_content, begin_idx and self.reasoning_start accordingly.
In `@tensorrt_llm/serve/tool_parser/minimax_m2_parser.py`:
- Around line 104-110: The detect_and_parse method currently drops text after or
around <minimax:tool_call> by returning only the prefix or empty normal_text
when an opener exists; update detect_and_parse to preserve prefix (text before
the opener) as normal_text and also detect and include any suffix after the
closing tag when present, parsing tool call content between opener and closer
into calls; when an opener exists but no closer yet, keep the prefix in
normal_text (do not return ""), buffer the remainder for streaming updates, and
only remove the tool block once its closing tag is seen. Apply the same
preservation logic to the corresponding streaming/partial-parse handlers
referenced at the other ranges (the functions handling streaming deltas) so they
similarly retain text before the opener and surface suffix text after the closer
instead of dropping it.
- Around line 33-61: The function _parse_param_value currently JSON-parses and
coerces values before checking the declared param_type, causing values declared
as "string" in the schema to be mutated (e.g., "42" -> 42, "true" -> True). Fix
by short-circuiting when param_type == "string" (return the stripped value_str
unchanged) before any json.loads or numeric/boolean conversions; otherwise keep
the existing logic (JSON parse first, then numeric/boolean fallbacks) and
preserve the original behavior for non-string types.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6efa071d-da90-46ac-916d-81f3f4e3635a
📒 Files selected for processing (5)
tensorrt_llm/llmapi/reasoning_parser.pytensorrt_llm/serve/tool_parser/minimax_m2_parser.pytensorrt_llm/serve/tool_parser/tool_parser_factory.pytests/unittest/llmapi/apps/test_tool_parsers.pytests/unittest/llmapi/test_reasoning_parser.py
Add support for interleaved thinking (reasoning between tool calls) for MiniMax-M2 and GLM-4.7 model families in trtllm-serve. - Add MiniMaxM2ToolParser for <minimax:tool_call> XML format with single/parallel tool calls and streaming support - Add Glm47ToolParser extending Glm4ToolParser for GLM-4.7 models with optional arguments support - Register new reasoning parsers (glm45, minimax_m2, minimax_m2_append_think) using existing DeepSeekR1Parser with reasoning_at_start=True for <think>...</think> format - Register new tool parsers (glm47, minimax_m2) in ToolParserFactory - Add comprehensive unit tests for new parsers including streaming, parallel tool calls, and interleaved thinking integration tests Signed-off-by: Junyi Xi <junyix@nvidia.com> Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
- Add docstrings to all methods in Glm47ToolParser and MiniMaxM2ToolParser to meet the 80% docstring coverage threshold required by CI pre-merge checks Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
- Remove glm47_parser.py (GLM-4.7 is not in ticket scope) - Remove glm45 reasoning parser registration - Remove GLM-4.7 related tests - Keep only Kimi-K2 and MiniMax-M2 as specified in ticket Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
…nking support - Add KimiK2ReasoningParser that extends DeepSeekR1Parser to handle reasoning content implicitly ended by tool call sections (<|tool_calls_section_begin|>) without explicit </think> tags - Support standard <think>...</think>, tool-call-interrupted reasoning, and no-reasoning patterns in both streaming and non-streaming modes - Add comprehensive unit tests for kimi_k2 parser (non-streaming, streaming, and interleaved thinking scenarios) - Adapted from vLLM kimi_k2_reasoning_parser.py and sglang reasoning parser implementations Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
- Fix KimiK2 streaming parser to buffer partial special tags (e.g. partial <think> or <|tool_calls_section_begin|>) instead of leaking them as content text - Fix _parse_param_value to short-circuit for string-typed params, preventing json.loads from coercing values like "42" or "true" - Fix MiniMaxM2ToolParser.detect_and_parse to preserve text after the closing </minimax:tool_call> tag - Fix MiniMaxM2ToolParser.parse_streaming_increment to preserve prefix text before <minimax:tool_call> when both arrive in the same chunk - Add tests covering partial tag buffering, string param preservation, suffix preservation, and streaming prefix handling Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
5fececb to
4cf3e6d
Compare
|
/bot run |
|
PR_Github #39431 [ run ] triggered by Bot. Commit: |
|
PR_Github #39431 [ run ] completed with state
|
|
/bot run |
|
PR_Github #39543 [ run ] triggered by Bot. Commit: |
|
PR_Github #39543 [ run ] completed with state |
…VIDIA#12199) Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
Summary by CodeRabbit
Release Notes
New Features
Tests
Description
Add support for interleaved thinking in trtllm-serve, specifically for the Kimi-K2-Thinking model. This addresses the case where reasoning content may be implicitly ended by a tool call section (
<|tool_calls_section_begin|>) without an explicit</think>tag.Changes:
KimiK2ReasoningParser: ExtendsDeepSeekR1Parserto detect both</think>and<|tool_calls_section_begin|>as reasoning end markers. When a tool call section starts during reasoning, the reasoning is implicitly ended and the tool call section is passed through as content.<think>...</think>patterns, tool-call-interrupted reasoning, and no-reasoning content in bothparse()andparse_delta()modes.kimi_k2: Usesreasoning_at_start=False(matching sglang's Qwen3Detector mapping), so the model must explicitly start reasoning with<think>.Supported patterns:
<think>reasoning</think>content– standard thinking<think>reasoning<|tool_calls_section_begin|>...– interleaved thinking (reasoning interrupted by tool call)content(no<think>) – no reasoningAdapted from:
vllm/reasoning/kimi_k2_reasoning_parser.pysglang/srt/parser/reasoning_parser.pyTest Coverage
test_kimi_k2_reasoning_parser– 8 parametrized non-streaming cases including tool call interruptiontest_kimi_k2_reasoning_parser_stream– 7 parametrized streaming cases including buffered tool token handlingtest_interleaved_thinking_stream– Cross-parser interleaved thinking tests for minimax_m2, deepseek-r1, qwen3, and kimi_k2 (including tool-call-interrupted reasoning)PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why.
PR Follows TRT-LLM CODING GUIDELINES.
Test cases are provided for new code paths.
Please check this after reviewing the above items as appropriate for this PR.