Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[TRTLLM-11357][feat] Support interleaved thinking for trtllm-serve#12199

Merged
JunyiXu-nv merged 5 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
JunyiXu-nv:user/junyix/fix-trtllm-11357JunyiXu-nv/TensorRT-LLM:user/junyix/fix-trtllm-11357Copy head branch name to clipboard
Mar 20, 2026
Merged

[TRTLLM-11357][feat] Support interleaved thinking for trtllm-serve#12199
JunyiXu-nv merged 5 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
JunyiXu-nv:user/junyix/fix-trtllm-11357JunyiXu-nv/TensorRT-LLM:user/junyix/fix-trtllm-11357Copy head branch name to clipboard

Conversation

@JunyiXu-nv
Copy link
Copy Markdown
Collaborator

@JunyiXu-nv JunyiXu-nv commented Mar 13, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for MiniMax-M2 and Kimi K2 models with interleaved reasoning capabilities.
    • Introduced streaming tool parsing for improved real-time tool invocation handling.
  • Tests

    • Added comprehensive test coverage for new model parsers and streaming scenarios.

Description

Add support for interleaved thinking in trtllm-serve, specifically for the Kimi-K2-Thinking model. This addresses the case where reasoning content may be implicitly ended by a tool call section (<|tool_calls_section_begin|>) without an explicit </think> tag.

Changes:

  • New KimiK2ReasoningParser: Extends DeepSeekR1Parser to detect both </think> and <|tool_calls_section_begin|> as reasoning end markers. When a tool call section starts during reasoning, the reasoning is implicitly ended and the tool call section is passed through as content.
  • Both streaming and non-streaming support: The parser handles standard <think>...</think> patterns, tool-call-interrupted reasoning, and no-reasoning content in both parse() and parse_delta() modes.
  • Registered as kimi_k2: Uses reasoning_at_start=False (matching sglang's Qwen3Detector mapping), so the model must explicitly start reasoning with <think>.

Supported patterns:

  • <think>reasoning</think>content – standard thinking
  • <think>reasoning<|tool_calls_section_begin|>... – interleaved thinking (reasoning interrupted by tool call)
  • content (no <think>) – no reasoning

Adapted from:

  • vLLM vllm/reasoning/kimi_k2_reasoning_parser.py
  • sglang sglang/srt/parser/reasoning_parser.py

Test Coverage

  • test_kimi_k2_reasoning_parser – 8 parametrized non-streaming cases including tool call interruption
  • test_kimi_k2_reasoning_parser_stream – 7 parametrized streaming cases including buffered tool token handling
  • test_interleaved_thinking_stream – Cross-parser interleaved thinking tests for minimax_m2, deepseek-r1, qwen3, and kimi_k2 (including tool-call-interrupted reasoning)
  • All 69 reasoning parser tests pass (existing + new)

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why.

  • PR Follows TRT-LLM CODING GUIDELINES.

  • Test cases are provided for new code paths.

  • Please check this after reviewing the above items as appropriate for this PR.

@JunyiXu-nv JunyiXu-nv requested a review from a team as a code owner March 13, 2026 11:18
@JunyiXu-nv JunyiXu-nv requested a review from hchings March 13, 2026 11:18
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 13, 2026

📝 Walkthrough

Walkthrough

The PR introduces support for MiniMax-M2 and Kimi K2 models through a new KimiK2ReasoningParser that handles interleaved thinking (where tool-call sections implicitly end reasoning), a corresponding MiniMaxM2ToolParser for parsing tool calls in an XML-like structure, factory registration, and comprehensive test coverage spanning reasoning and tool parsing scenarios.

Changes

Cohort / File(s) Summary
Reasoning Parser Enhancements
tensorrt_llm/llmapi/reasoning_parser.py
Adds decorator registrations for minimax_m2 and minimax_m2_append_think on DeepSeekR1Parser with reasoning_at_start=True. Introduces new KimiK2ReasoningParser class extending DeepSeekR1Parser to support interleaved thinking with explicit (</think>) and implicit (<|tool_calls_section_begin|>) reasoning end markers. Implements parse for full-text parsing and parse_delta for incremental streaming with buffer management and partial token handling. Implementation appears duplicated later in file for NemotronV3 context.
Tool Parser Implementation
tensorrt_llm/serve/tool_parser/minimax_m2_parser.py
Introduces MiniMaxM2ToolParser extending BaseToolParser to parse MiniMax-M2 XML-like tool calls. Includes helper functions _get_param_types and _parse_param_value for type inference and value conversion. Implements detect_and_parse for full parsing and parse_streaming_increment for incremental streaming with JSON parameter buffering, partial invocation handling, and error recovery.
Tool Parser Registration
tensorrt_llm/serve/tool_parser/tool_parser_factory.py
Adds import and factory registration mapping "minimax_m2" to MiniMaxM2ToolParser in ToolParserFactory.parsers dictionary.
Test Coverage
tests/unittest/llmapi/apps/test_tool_parsers.py, tests/unittest/llmapi/test_reasoning_parser.py
Comprehensive test suites for MiniMaxM2ToolParser (initialization, detection, single/multiple tool parsing, streaming, parameter types, structural tag support) and reasoning parsers (minimax_m2, minimax_m2_append_think, kimi_k2 across streaming/non-streaming modes with interleaved thinking and tool-call scenarios). Verifies factory registrations and parser coordination.

Sequence Diagram

sequenceDiagram
    participant Client
    participant ReasoningParser as KimiK2ReasoningParser
    participant ToolParser as MiniMaxM2ToolParser
    participant Output

    Client->>ReasoningParser: parse_delta(token_stream)
    activate ReasoningParser
    ReasoningParser->>ReasoningParser: Buffer delta text
    alt Detect <think> tag
        ReasoningParser->>ReasoningParser: Mark reasoning started
    end
    alt Detect </think> or tool_calls marker
        ReasoningParser->>ReasoningParser: Mark reasoning ended
        ReasoningParser->>Output: ReasoningParserResult(reasoning_content)
    end
    ReasoningParser->>Output: ReasoningParserResult(content)
    deactivate ReasoningParser

    Output->>ToolParser: parse_streaming_increment(content)
    activate ToolParser
    alt Detect <minimax:tool_call> start
        ToolParser->>ToolParser: Initialize tool invocation buffer
    end
    ToolParser->>ToolParser: Extract function name, parameters
    ToolParser->>ToolParser: Convert parameters with type inference
    alt Detect </minimax:tool_call> end
        ToolParser->>Output: StreamingParseResult(ToolCallItem)
    end
    deactivate ToolParser

    Output-->>Client: Combined reasoning + tool results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding support for interleaved thinking in trtllm-serve, which directly corresponds to the PR's primary objective of implementing the KimiK2ReasoningParser.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed PR description clearly explains the purpose, changes, supported patterns, and test coverage for interleaved thinking support in trtllm-serve.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/llmapi/reasoning_parser.py`:
- Around line 75-76: The MiniMax parser registrations ("minimax_m2" and
"minimax_m2_append_think") reuse DeepSeekR1Parser.parse_delta(), which fails
when a single delta contains both the post-reasoning tail and the next "<think>"
opener (e.g. "reason1</think>text1<think>reason2") because it emits the entire
tail as plain content instead of splitting and reopening a reasoning block;
update the parse_delta implementation used by these registrations (or add an
overriding wrapper) to detect a "</think>...<think>" pattern in the incoming
delta, split the tail into post-reasoning content and the reopened reasoning
segment, emit the post-reasoning text as content, then emit a token/event to
reopen a reasoning block before passing the remaining text back into the
existing parsing flow (use the register_reasoning_parser handlers and
DeepSeekR1Parser.parse_delta as reference points when inserting the
split-and-reopen logic).
- Around line 267-322: The parser currently discards any text before a found
<think> (and leaks partial start-tags) when self.in_reasoning is False; fix by
preserving the prefix as content and buffering partial start-tag suffixes.
Specifically, in the branch that computes begin_idx from self.reasoning_start,
when begin_idx != -1 set content = delta_text[:begin_idx] and set
reasoning_content = delta_text[begin_idx + len(self.reasoning_start):] (and set
self.in_reasoning True); when begin_idx == -1 do not always clear self._buffer —
detect a trailing partial prefix of self.reasoning_start or
self.tool_section_start (e.g. last '<' suffix) and set self._buffer to that
suffix while returning content=delta_text up to that suffix, otherwise clear
self._buffer and return content=delta_text; update uses of self.in_reasoning,
self._buffer, reasoning_content, begin_idx and self.reasoning_start accordingly.

In `@tensorrt_llm/serve/tool_parser/minimax_m2_parser.py`:
- Around line 104-110: The detect_and_parse method currently drops text after or
around <minimax:tool_call> by returning only the prefix or empty normal_text
when an opener exists; update detect_and_parse to preserve prefix (text before
the opener) as normal_text and also detect and include any suffix after the
closing tag when present, parsing tool call content between opener and closer
into calls; when an opener exists but no closer yet, keep the prefix in
normal_text (do not return ""), buffer the remainder for streaming updates, and
only remove the tool block once its closing tag is seen. Apply the same
preservation logic to the corresponding streaming/partial-parse handlers
referenced at the other ranges (the functions handling streaming deltas) so they
similarly retain text before the opener and surface suffix text after the closer
instead of dropping it.
- Around line 33-61: The function _parse_param_value currently JSON-parses and
coerces values before checking the declared param_type, causing values declared
as "string" in the schema to be mutated (e.g., "42" -> 42, "true" -> True). Fix
by short-circuiting when param_type == "string" (return the stripped value_str
unchanged) before any json.loads or numeric/boolean conversions; otherwise keep
the existing logic (JSON parse first, then numeric/boolean fallbacks) and
preserve the original behavior for non-string types.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6efa071d-da90-46ac-916d-81f3f4e3635a

📥 Commits

Reviewing files that changed from the base of the PR and between 3fb931a and 1c8a086.

📒 Files selected for processing (5)
  • tensorrt_llm/llmapi/reasoning_parser.py
  • tensorrt_llm/serve/tool_parser/minimax_m2_parser.py
  • tensorrt_llm/serve/tool_parser/tool_parser_factory.py
  • tests/unittest/llmapi/apps/test_tool_parsers.py
  • tests/unittest/llmapi/test_reasoning_parser.py

Comment thread tensorrt_llm/llmapi/reasoning_parser.py
Comment thread tensorrt_llm/llmapi/reasoning_parser.py
Comment thread tensorrt_llm/serve/tool_parser/minimax_m2_parser.py
Comment thread tensorrt_llm/serve/tool_parser/minimax_m2_parser.py Outdated
@JunyiXu-nv JunyiXu-nv removed the request for review from hchings March 13, 2026 12:53
Add support for interleaved thinking (reasoning between tool calls)
for MiniMax-M2 and GLM-4.7 model families in trtllm-serve.

- Add MiniMaxM2ToolParser for <minimax:tool_call> XML format with
  single/parallel tool calls and streaming support
- Add Glm47ToolParser extending Glm4ToolParser for GLM-4.7 models
  with optional arguments support
- Register new reasoning parsers (glm45, minimax_m2,
  minimax_m2_append_think) using existing DeepSeekR1Parser with
  reasoning_at_start=True for <think>...</think> format
- Register new tool parsers (glm47, minimax_m2) in ToolParserFactory
- Add comprehensive unit tests for new parsers including streaming,
  parallel tool calls, and interleaved thinking integration tests

Signed-off-by: Junyi Xi <junyix@nvidia.com>
Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
- Add docstrings to all methods in Glm47ToolParser and MiniMaxM2ToolParser
  to meet the 80% docstring coverage threshold required by CI pre-merge checks

Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
- Remove glm47_parser.py (GLM-4.7 is not in ticket scope)
- Remove glm45 reasoning parser registration
- Remove GLM-4.7 related tests
- Keep only Kimi-K2 and MiniMax-M2 as specified in ticket

Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
…nking support

- Add KimiK2ReasoningParser that extends DeepSeekR1Parser to handle
  reasoning content implicitly ended by tool call sections
  (<|tool_calls_section_begin|>) without explicit </think> tags
- Support standard <think>...</think>, tool-call-interrupted reasoning,
  and no-reasoning patterns in both streaming and non-streaming modes
- Add comprehensive unit tests for kimi_k2 parser (non-streaming,
  streaming, and interleaved thinking scenarios)
- Adapted from vLLM kimi_k2_reasoning_parser.py and sglang reasoning
  parser implementations

Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
- Fix KimiK2 streaming parser to buffer partial special tags (e.g.
  partial <think> or <|tool_calls_section_begin|>) instead of leaking
  them as content text
- Fix _parse_param_value to short-circuit for string-typed params,
  preventing json.loads from coercing values like "42" or "true"
- Fix MiniMaxM2ToolParser.detect_and_parse to preserve text after
  the closing </minimax:tool_call> tag
- Fix MiniMaxM2ToolParser.parse_streaming_increment to preserve
  prefix text before <minimax:tool_call> when both arrive in the
  same chunk
- Add tests covering partial tag buffering, string param preservation,
  suffix preservation, and streaming prefix handling

Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
@JunyiXu-nv JunyiXu-nv force-pushed the user/junyix/fix-trtllm-11357 branch from 5fececb to 4cf3e6d Compare March 18, 2026 09:09
@JunyiXu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39431 [ run ] triggered by Bot. Commit: 4cf3e6d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39431 [ run ] completed with state SUCCESS. Commit: 4cf3e6d
/LLM/main/L0_MergeRequest_PR pipeline #30658 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@JunyiXu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39543 [ run ] triggered by Bot. Commit: 4cf3e6d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39543 [ run ] completed with state SUCCESS. Commit: 4cf3e6d
/LLM/main/L0_MergeRequest_PR pipeline #30761 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

@JunyiXu-nv JunyiXu-nv requested review from LinPoly and syuoni March 19, 2026 13:22
Copy link
Copy Markdown
Collaborator

@LinPoly LinPoly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JunyiXu-nv JunyiXu-nv merged commit 7dd0865 into NVIDIA:main Mar 20, 2026
5 checks passed
longcheng-nv pushed a commit to longcheng-nv/TensorRT-LLM that referenced this pull request Mar 31, 2026
…VIDIA#12199)

Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.