[https://nvbugs/5983390][fix] Remove redundant D2H sync to optimize perf#12445
[https://nvbugs/5983390][fix] Remove redundant D2H sync to optimize perf#12445hyukn merged 1 commit intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom hyukn:fix/5983390_remove_synchyukn/TensorRT-LLM:fix/5983390_remove_syncCopy head branch name to clipboard
Conversation
…erf. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #39905 [ run ] triggered by Bot. Commit: |
|
PR_Github #39905 [ run ] completed with state
|
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
/bot --help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
PR_Github #40011 [ run ] triggered by Bot. Commit: |
|
PR_Github #40011 [ run ] completed with state
|
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #40047 [ run ] triggered by Bot. Commit: |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA condition in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
PR_Github #40047 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40132 [ run ] triggered by Bot. Commit: |
|
PR_Github #40132 [ run ] completed with state |
Description
_compute_slot_mappings()guards a debug assert withnot is_current_stream_capturing(), which only skips it during CUDA graph capture. During eager execution,on_update_kv_lens()calls this function with GPU tensors from_preprocess_inputs()(model_engine.py:1507,1543), and.all()triggers areduce_kernel<bool>+ 1-byte D2H memcpy +cudaStreamSynchronizethat stalls 12-15ms waiting for the GPU queue to drain. This happens twice per context forward step (once unconditionally, once for overlap scheduler + MTP), adding ~25ms of GPU bubble per iteration with context requests. Decode-only steps are unaffected.Fix: Guard with
not block_indices_in_seq.is_cudainstead. The assert still fires on the CPU path (Indexer.prepare()) at zero cost, but is skipped on the GPU path (on_update_kv_lens()) where block offsets were already validated and clamped duringprepare().nsys evidence (DeepSeek-V3.2, TP8, piecewise CUDA graph, c16):
Test Coverage
Existing DSA accuracy tests cover correctness. The change only removes a redundant assert on the GPU path; the CPU-path assert is unchanged.
PR Checklist
Summary by CodeRabbit