[TRTLLM-11357][feat] Support interleaved thinking for trtllm-serve by JunyiXu-nv · Pull Request #12199 · NVIDIA/TensorRT-LLM

JunyiXu-nv · Mar 13, 2026

Summary by CodeRabbit

Release Notes

New Features
- Added support for MiniMax-M2 and Kimi K2 models with interleaved reasoning capabilities.
- Introduced streaming tool parsing for improved real-time tool invocation handling.
Tests
- Added comprehensive test coverage for new model parsers and streaming scenarios.

Description

Add support for interleaved thinking in trtllm-serve, specifically for the Kimi-K2-Thinking model. This addresses the case where reasoning content may be implicitly ended by a tool call section (<|tool_calls_section_begin|>) without an explicit </think> tag.

Changes:

New KimiK2ReasoningParser: Extends DeepSeekR1Parser to detect both </think> and <|tool_calls_section_begin|> as reasoning end markers. When a tool call section starts during reasoning, the reasoning is implicitly ended and the tool call section is passed through as content.
Both streaming and non-streaming support: The parser handles standard <think>...</think> patterns, tool-call-interrupted reasoning, and no-reasoning content in both parse() and parse_delta() modes.
Registered as kimi_k2: Uses reasoning_at_start=False (matching sglang's Qwen3Detector mapping), so the model must explicitly start reasoning with <think>.

Supported patterns:

<think>reasoning</think>content – standard thinking
<think>reasoning<|tool_calls_section_begin|>... – interleaved thinking (reasoning interrupted by tool call)
content (no <think>) – no reasoning

Adapted from:

vLLM vllm/reasoning/kimi_k2_reasoning_parser.py
sglang sglang/srt/parser/reasoning_parser.py

Test Coverage

test_kimi_k2_reasoning_parser – 8 parametrized non-streaming cases including tool call interruption
test_kimi_k2_reasoning_parser_stream – 7 parametrized streaming cases including buffered tool token handling
test_interleaved_thinking_stream – Cross-parser interleaved thinking tests for minimax_m2, deepseek-r1, qwen3, and kimi_k2 (including tool-call-interrupted reasoning)
All 69 reasoning parser tests pass (existing + new)

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why.
PR Follows TRT-LLM CODING GUIDELINES.
Test cases are provided for new code paths.
Please check this after reviewing the above items as appropriate for this PR.

coderabbitai · Mar 13, 2026

📝 Walkthrough

Walkthrough

The PR introduces support for MiniMax-M2 and Kimi K2 models through a new KimiK2ReasoningParser that handles interleaved thinking (where tool-call sections implicitly end reasoning), a corresponding MiniMaxM2ToolParser for parsing tool calls in an XML-like structure, factory registration, and comprehensive test coverage spanning reasoning and tool parsing scenarios.

Changes

Cohort / File(s)	Summary
Reasoning Parser Enhancements `tensorrt_llm/llmapi/reasoning_parser.py`	Adds decorator registrations for `minimax_m2` and `minimax_m2_append_think` on `DeepSeekR1Parser` with `reasoning_at_start=True`. Introduces new `KimiK2ReasoningParser` class extending `DeepSeekR1Parser` to support interleaved thinking with explicit (`</think>`) and implicit (`<\|tool_calls_section_begin\|>`) reasoning end markers. Implements `parse` for full-text parsing and `parse_delta` for incremental streaming with buffer management and partial token handling. Implementation appears duplicated later in file for NemotronV3 context.
Tool Parser Implementation `tensorrt_llm/serve/tool_parser/minimax_m2_parser.py`	Introduces `MiniMaxM2ToolParser` extending `BaseToolParser` to parse MiniMax-M2 XML-like tool calls. Includes helper functions `_get_param_types` and `_parse_param_value` for type inference and value conversion. Implements `detect_and_parse` for full parsing and `parse_streaming_increment` for incremental streaming with JSON parameter buffering, partial invocation handling, and error recovery.
Tool Parser Registration `tensorrt_llm/serve/tool_parser/tool_parser_factory.py`	Adds import and factory registration mapping `"minimax_m2"` to `MiniMaxM2ToolParser` in `ToolParserFactory.parsers` dictionary.
Test Coverage `tests/unittest/llmapi/apps/test_tool_parsers.py`, `tests/unittest/llmapi/test_reasoning_parser.py`	Comprehensive test suites for `MiniMaxM2ToolParser` (initialization, detection, single/multiple tool parsing, streaming, parameter types, structural tag support) and reasoning parsers (minimax_m2, minimax_m2_append_think, kimi_k2 across streaming/non-streaming modes with interleaved thinking and tool-call scenarios). Verifies factory registrations and parser coordination.

Sequence Diagram

sequenceDiagram
    participant Client
    participant ReasoningParser as KimiK2ReasoningParser
    participant ToolParser as MiniMaxM2ToolParser
    participant Output

    Client->>ReasoningParser: parse_delta(token_stream)
    activate ReasoningParser
    ReasoningParser->>ReasoningParser: Buffer delta text
    alt Detect <think> tag
        ReasoningParser->>ReasoningParser: Mark reasoning started
    end
    alt Detect </think> or tool_calls marker
        ReasoningParser->>ReasoningParser: Mark reasoning ended
        ReasoningParser->>Output: ReasoningParserResult(reasoning_content)
    end
    ReasoningParser->>Output: ReasoningParserResult(content)
    deactivate ReasoningParser

    Output->>ToolParser: parse_streaming_increment(content)
    activate ToolParser
    alt Detect <minimax:tool_call> start
        ToolParser->>ToolParser: Initialize tool invocation buffer
    end
    ToolParser->>ToolParser: Extract function name, parameters
    ToolParser->>ToolParser: Convert parameters with type inference
    alt Detect </minimax:tool_call> end
        ToolParser->>Output: StreamingParseResult(ToolCallItem)
    end
    deactivate ToolParser

    Output-->>Client: Combined reasoning + tool results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding support for interleaved thinking in trtllm-serve, which directly corresponds to the PR's primary objective of implementing the KimiK2ReasoningParser.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	PR description clearly explains the purpose, changes, supported patterns, and test coverage for interleaved thinking support in trtllm-serve.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/llmapi/reasoning_parser.py`:
- Around line 75-76: The MiniMax parser registrations ("minimax_m2" and
"minimax_m2_append_think") reuse DeepSeekR1Parser.parse_delta(), which fails
when a single delta contains both the post-reasoning tail and the next "<think>"
opener (e.g. "reason1</think>text1<think>reason2") because it emits the entire
tail as plain content instead of splitting and reopening a reasoning block;
update the parse_delta implementation used by these registrations (or add an
overriding wrapper) to detect a "</think>...<think>" pattern in the incoming
delta, split the tail into post-reasoning content and the reopened reasoning
segment, emit the post-reasoning text as content, then emit a token/event to
reopen a reasoning block before passing the remaining text back into the
existing parsing flow (use the register_reasoning_parser handlers and
DeepSeekR1Parser.parse_delta as reference points when inserting the
split-and-reopen logic).
- Around line 267-322: The parser currently discards any text before a found
<think> (and leaks partial start-tags) when self.in_reasoning is False; fix by
preserving the prefix as content and buffering partial start-tag suffixes.
Specifically, in the branch that computes begin_idx from self.reasoning_start,
when begin_idx != -1 set content = delta_text[:begin_idx] and set
reasoning_content = delta_text[begin_idx + len(self.reasoning_start):] (and set
self.in_reasoning True); when begin_idx == -1 do not always clear self._buffer —
detect a trailing partial prefix of self.reasoning_start or
self.tool_section_start (e.g. last '<' suffix) and set self._buffer to that
suffix while returning content=delta_text up to that suffix, otherwise clear
self._buffer and return content=delta_text; update uses of self.in_reasoning,
self._buffer, reasoning_content, begin_idx and self.reasoning_start accordingly.

In `@tensorrt_llm/serve/tool_parser/minimax_m2_parser.py`:
- Around line 104-110: The detect_and_parse method currently drops text after or
around <minimax:tool_call> by returning only the prefix or empty normal_text
when an opener exists; update detect_and_parse to preserve prefix (text before
the opener) as normal_text and also detect and include any suffix after the
closing tag when present, parsing tool call content between opener and closer
into calls; when an opener exists but no closer yet, keep the prefix in
normal_text (do not return ""), buffer the remainder for streaming updates, and
only remove the tool block once its closing tag is seen. Apply the same
preservation logic to the corresponding streaming/partial-parse handlers
referenced at the other ranges (the functions handling streaming deltas) so they
similarly retain text before the opener and surface suffix text after the closer
instead of dropping it.
- Around line 33-61: The function _parse_param_value currently JSON-parses and
coerces values before checking the declared param_type, causing values declared
as "string" in the schema to be mutated (e.g., "42" -> 42, "true" -> True). Fix
by short-circuiting when param_type == "string" (return the stripped value_str
unchanged) before any json.loads or numeric/boolean conversions; otherwise keep
the existing logic (JSON parse first, then numeric/boolean fallbacks) and
preserve the original behavior for non-string types.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6efa071d-da90-46ac-916d-81f3f4e3635a

📥 Commits

Reviewing files that changed from the base of the PR and between 3fb931a and 1c8a086.

📒 Files selected for processing (5)

tensorrt_llm/llmapi/reasoning_parser.py
tensorrt_llm/serve/tool_parser/minimax_m2_parser.py
tensorrt_llm/serve/tool_parser/tool_parser_factory.py
tests/unittest/llmapi/apps/test_tool_parsers.py
tests/unittest/llmapi/test_reasoning_parser.py

Add support for interleaved thinking (reasoning between tool calls) for MiniMax-M2 and GLM-4.7 model families in trtllm-serve. - Add MiniMaxM2ToolParser for <minimax:tool_call> XML format with single/parallel tool calls and streaming support - Add Glm47ToolParser extending Glm4ToolParser for GLM-4.7 models with optional arguments support - Register new reasoning parsers (glm45, minimax_m2, minimax_m2_append_think) using existing DeepSeekR1Parser with reasoning_at_start=True for <think>...</think> format - Register new tool parsers (glm47, minimax_m2) in ToolParserFactory - Add comprehensive unit tests for new parsers including streaming, parallel tool calls, and interleaved thinking integration tests Signed-off-by: Junyi Xi <junyix@nvidia.com> Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>

- Add docstrings to all methods in Glm47ToolParser and MiniMaxM2ToolParser to meet the 80% docstring coverage threshold required by CI pre-merge checks Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>

- Remove glm47_parser.py (GLM-4.7 is not in ticket scope) - Remove glm45 reasoning parser registration - Remove GLM-4.7 related tests - Keep only Kimi-K2 and MiniMax-M2 as specified in ticket Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>

…nking support - Add KimiK2ReasoningParser that extends DeepSeekR1Parser to handle reasoning content implicitly ended by tool call sections (<|tool_calls_section_begin|>) without explicit </think> tags - Support standard <think>...</think>, tool-call-interrupted reasoning, and no-reasoning patterns in both streaming and non-streaming modes - Add comprehensive unit tests for kimi_k2 parser (non-streaming, streaming, and interleaved thinking scenarios) - Adapted from vLLM kimi_k2_reasoning_parser.py and sglang reasoning parser implementations Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>

- Fix KimiK2 streaming parser to buffer partial special tags (e.g. partial <think> or <|tool_calls_section_begin|>) instead of leaking them as content text - Fix _parse_param_value to short-circuit for string-typed params, preventing json.loads from coercing values like "42" or "true" - Fix MiniMaxM2ToolParser.detect_and_parse to preserve text after the closing </minimax:tool_call> tag - Fix MiniMaxM2ToolParser.parse_streaming_increment to preserve prefix text before <minimax:tool_call> when both arrive in the same chunk - Add tests covering partial tag buffering, string param preservation, suffix preservation, and streaming prefix handling Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>

JunyiXu-nv · Mar 18, 2026

/bot run

tensorrt-cicd · Mar 18, 2026

PR_Github #39431 [ run ] triggered by Bot. Commit: 4cf3e6d Link to invocation

tensorrt-cicd · Mar 18, 2026

PR_Github #39431 [ run ] completed with state SUCCESS. Commit: 4cf3e6d
/LLM/main/L0_MergeRequest_PR pipeline #30658 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

JunyiXu-nv · Mar 19, 2026

/bot run

tensorrt-cicd · Mar 19, 2026

PR_Github #39543 [ run ] triggered by Bot. Commit: 4cf3e6d Link to invocation

tensorrt-cicd · Mar 19, 2026

PR_Github #39543 [ run ] completed with state SUCCESS. Commit: 4cf3e6d
/LLM/main/L0_MergeRequest_PR pipeline #30761 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

LinPoly

LGTM

…VIDIA#12199) Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>

JunyiXu-nv requested a review from a team as a code owner March 13, 2026 11:18

JunyiXu-nv requested a review from hchings March 13, 2026 11:18

github-actions Bot assigned JunyiXu-nv Mar 13, 2026

coderabbitai Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread tensorrt_llm/llmapi/reasoning_parser.py

Comment thread tensorrt_llm/llmapi/reasoning_parser.py

Comment thread tensorrt_llm/serve/tool_parser/minimax_m2_parser.py

Comment thread tensorrt_llm/serve/tool_parser/minimax_m2_parser.py Outdated

JunyiXu-nv removed the request for review from hchings March 13, 2026 12:53

JunyiXu-nv added 5 commits March 18, 2026 17:09

JunyiXu-nv force-pushed the user/junyix/fix-trtllm-11357 branch from 5fececb to 4cf3e6d Compare March 18, 2026 09:09

JunyiXu-nv requested review from LinPoly and syuoni March 19, 2026 13:22

LinPoly approved these changes Mar 19, 2026

View reviewed changes

syuoni approved these changes Mar 20, 2026

View reviewed changes

JunyiXu-nv merged commit 7dd0865 into NVIDIA:main Mar 20, 2026
5 checks passed

longcheng-nv pushed a commit to longcheng-nv/TensorRT-LLM that referenced this pull request Mar 31, 2026

[TRTLLM-11357][feat] Support interleaved thinking for trtllm-serve (N…

ff548ed

…VIDIA#12199) Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>

Search code, repositories, users, issues, pull requests...

Conversation

JunyiXu-nv commented Mar 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Changes:

Supported patterns:

Test Coverage

PR Checklist

Uh oh!

coderabbitai Bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JunyiXu-nv commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

JunyiXu-nv commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

LinPoly left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JunyiXu-nv commented Mar 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 13, 2026 •

edited

Loading