[None][feat] Add --use-3rdparty-cache to accelerate cmake configuration of clean build by yuantailing · Pull Request #13942 · NVIDIA/TensorRT-LLM

yuantailing · May 9, 2026

Summary by CodeRabbit

New Features
- Added optional caching mechanism to accelerate third-party dependency retrieval in builds.
- Added --use-3rdparty-cache flag to enable local cache reuse during the build process.
- Added --configure-only flag to run CMake configuration without building or packaging.
Documentation
- Added detailed documentation describing the new dependency caching feature.
Tests
- Added comprehensive test suite for the caching mechanism.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

- CMakeLists.txt: resolve real git via separate find_program() to avoid GIT_EXECUTABLE self-recursion on re-configure. - fetch_cache_wrapper.py: complete Apache-2.0 header; pin cwd=top on per-submodule subprocess calls; turn _strict_parse error into an actionable message that names the parser to extend and shows examples. - fetch_cache.py: add per-repo fcntl.flock around _ensure_cache so concurrent updates of the same bare cache serialize; reapply SAFETY_CONFIG to existing caches; replace os.walk with directed scandir+recursion under .git/modules/. - build_wheel.py: drop dead try/except around best-effort run(); warn via warnings.warn when the cache update exits non-zero. Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

All git operations in _ensure_cache (init --bare, config, fetch) are already idempotent. The only non-idempotent step was rmtree-on-failure, which was also the source of the "rmtree races fetch" hazard that the flock locking was added to mitigate. Removing the rmtree lets us drop the lock entirely: a partial bare survives across retries and gets healed by the next update via the existing-cache path. Also drop the redundant `git submodule init` call in handle_submodule since `git submodule update --init -- <path>` inits each submodule implicitly before updating it. Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

_safe_fetch_into_cache builds an ephemeral .fc-standin-<random>/ per update. On a shared cache (Lustre/NFS/GPFS), the ~20 small files git init --bare writes, plus the packed-refs/alternates/shallow we add, plus the final rmtree, all pay a per-op metadata round-trip (~10 ms on Lustre vs. ~100 us on tmpfs). Measured on a warm NVTX shallow update on Lustre: 104 ms -> 62 ms per dep, from ~20 ms saved on init --bare and ~11 ms on the closing rmtree. Switch tempfile.mkdtemp() to the system temp dir by default; fall back to the cache_parent path byte-for-byte when the system temp is not writable (a Landlock-sandboxed caller whose ruleset covers the cache but not /tmp). mkdtemp's 0700 mode + unpredictable random suffix is what I3 actually relies on, not the parent directory, so moving off $CACHE_DIR changes no security invariant. Update fetch-cache.md: I3 standin bullet, performance-model table, new "Shared-filesystem optimizations" subsection, and the residual SIGKILL-cleanup note (tmpfs-on-reboot eviction now handles leaks). Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

The FetchContent_MakeAvailable override now invokes the writer with `--log-prefix <prefix> update --cache-dir <dir> --src <src>` in both the default and TRTLLM_FETCHCONTENT_UPDATE_CMD modes. fetch_cache.py itself becomes a valid TRTLLM_FETCHCONTENT_UPDATE_CMD value. Fold the cmake-side bookkeeping into one CACHE INTERNAL list: _FC_WRITER_CMD replaces _FC_UPDATE_CMD + _FC_CACHE_PY + _FC_PYTHON, which previously existed only to be reassembled inside _trtllm_fc_update_one. Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

`git clone --reference` only shares git objects through alternates; LFS-managed payloads live in a separate content store consulted by the smudge filter, so a `--reference` clone alone still refetches every LFS object over the network. Each bare now keeps a parallel pool at `<bare>/lfs/objects/<2>/<2>/<oid>`. Sandboxed consumers can pre-symlink the pool entries into their own gitdir; smudge follows symlinks transparently, and unlike hardlinks symlinks survive the cross-device case that git-LFS's `CopyOrLinkFile` turns into a full copy on EXDEV when the bare and the consumer gitdir live on different st_devs. Writer-side: `_walk_src_lfs_objects` walks the src lfs tree under the LFS-standard <2>/<2>/<64-hex> layout, rejecting symlinks at every level (I3). `_ingest_lfs_object` hardlinks (or copies on EXDEV) into a sibling tempfile, validates SHA-256 of the staged file, and atomic- renames into place (I2). Validating after staging keeps the bytes that get hashed equal to the bytes that land in the cache, even under a concurrent dentry swap on the src side. `_ensure_cache` also gains an up-to-date short-circuit: when src has no objects of its own (only `info/alternates`, the shape of every `--reference` consumer worktree), the standin/fetch/replicate pass would move zero git bytes. Skip the entire pass and run only the LFS pool walk — a consumer that ran `git lfs pull` after a `--reference` clone may carry new LFS payloads without new git objects. fetch-cache.md gains an LFS pool section under cache layout, threat- model coverage (SHA-256-CR on the trusted side, src lfs tree on the untrusted side), I2/I3/I4/I5 LFS extensions, the native git mechanisms table extended for SHA-256, and two new residual risks (TOCTOU size- cap bypass on the src dentry, .lfs-tmp leak after SIGKILL mid-ingest). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

The previous _ingest_lfs_object hardlinked src into the staged tempfile (falling back to copy on EXDEV), validated SHA-256, and renamed into the bare's lfs/objects/<2>/<2>/<oid>. After the rename, dst and src share an inode. An attacker who keeps writing to src under the same oid filename after ingest mutates the cache entry's bytes through that shared inode, breaking the oid-name <-> content-hash binding the LFS pool relies on. Hash validation at ingest does not survive subsequent src writes. Switch _ingest_lfs_object to a single copy path: mkstemp + shutil.copyfileobj into the tempfile, then SHA-256 reverify, then atomic rename. The cache entry's inode is independent of src's, so post-ingest writes to src cannot reach cached bytes. Cost is O(filesize) on every ingest, accepted: LFS payloads are usually hundreds of MB and ingest happens once per oid (already-present oids are skipped). The errno import and the EXDEV branch go away with the hardlink path. fetch-cache.md follows the code: - LFS pool writer-side: copy-only, with the inode-independence rationale moved here from the docstring (repeated design ideas live in the doc, not the code). - I2 LFS subitem: "after the link/copy" -> "after the copy". - TOCTOU residual risk: the dual hardlink/EXDEV-copy branching collapses to a single copy path; same fstat-based mitigation pointer. - I3 LFS symlink defense: os.link is gone; the only remaining src-side I/O is open(src_path, "rb"), so the symlink rejection walker now guards that open instead of os.link. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

- Drop past-version framing ("identical to before", "historically observed", "pre-change", "older version of this script") in favor of present-tense statements of current behavior. - Generalize Lustre/NFS/GPFS-specific perf claims and the Lustre-hosted measurement table to an abstract "network filesystem with high metadata latency" framing; remove the "Claude Code" brand reference from the threat model. - Trim duplication between fetch-cache.md and code comments: standin tmpfs/Landlock/mkdtemp rationale, submodule routing, and refname quoting rationale now live only in the doc; the comments cross-reference instead of restating. Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Adds tests/unittest/scripts/test_fetch_cache.py — four pytest cases covering the FetchContent cache writer end-to-end without network or GitHub access: - top-level dep, full and shallow modes: two update cycles each (initial + new commit), each verified by `git clone --reference <bare>` landing with zero git objects of its own; - parent dep with nested submodule: bare materializes for both parent and submodule, both consumer git dirs object-less; - LFS-tracked binary: oid lands in the bare's LFS pool with matching bytes; second update is idempotent. Picked up automatically by the existing `unittest/scripts` directory entry in l0_a10 (CPU-only, pre-merge). Updated the inline comment to reflect the bucket's actual contents — the suite is no longer just lint helpers, and it now runs in ~3s rather than the original ~0.3s. Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

yuantailing · May 9, 2026

/bot run --disable-fail-fast

coderabbitai · May 9, 2026

📝 Walkthrough

Walkthrough

This PR introduces an opt-in FetchContent caching mechanism that accelerates repeated third-party dependency checkouts in TensorRT-LLM. The implementation adds a Python cache manager, a CMake-injected git wrapper, CMake configuration to wire them together, and comprehensive end-to-end tests validating correct behavior across shallow/full modes, submodules, and LFS scenarios.

Changes

FetchContent Cache Acceleration

Layer / File(s)	Summary
Configuration & Public Contracts `.gitignore`, `scripts/build_wheel.py`	New `--use-3rdparty-cache` and `--configure-only` CLI flags added to `build_wheel.py`; cache directory `.cache_3rdparty/` added to `.gitignore`.
Design & Architecture Documentation `3rdparty/fetch-cache.md`	Comprehensive 628-line design doc covering cache layout (shallow/ vs full/), threat model, LFS pool mirroring, git safety hardening, concurrency guarantees, and performance model.
Cache Writer Implementation `3rdparty/fetch_cache.py`	New 935-line Python cache manager deriving cache names from source URLs, probing git mode (shallow/full), initializing/updating bare repos, validating LFS objects via SHA-256, safely fetching via standin-bare + alternates, replicating append-only ref anchors, and discovering nested submodules.
Cache Reader Wrapper `3rdparty/fetch_cache_wrapper.py`	New 306-line CMake-injected git wrapper detecting `git clone` and `git submodule update` commands, rewriting them to use cached bare repos via `--reference` (shallow/full selection based on `--depth`), and falling back transparently on cache misses.
CMake Integration `3rdparty/CMakeLists.txt`	Setup for non-Windows opt-in caching: configures wrapper script paths, locates real git/python3, injects `GIT_EXECUTABLE`, sets environment variables; overrides `FetchContent_MakeAvailable` macro to log per-dependency phases and trigger per-dependency cache update via `_trtllm_fc_update_one` helper.
Build System Wiring `scripts/build_wheel.py`	Forwards `--use-3rdparty-cache` flag into CMake variables (`TRTLLM_FETCHCONTENT_CACHE` and optional `TRTLLM_FETCHCONTENT_UPDATE_CMD`); adds `--configure-only` mode to exit after CMake configuration.
User Documentation `3rdparty/README.md`	Updated header/intro and added "FetchContent cache (`--use-3rdparty-cache`)" section explaining opt-in behavior, git clone --reference acceleration, and reference to design rationale.
End-to-End Tests `tests/unittest/scripts/test_fetch_cache.py`, `tests/integration/test_lists/test-db/l0_a10.yml`	New 357-line hermetic test suite validating cache writer behavior: top-level deps (full vs shallow), submodules, and LFS pool scenarios, all asserting object-less consumer clones; minor test-list comment update.

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description contains only the repository template boilerplate with no substantive content in the required sections (Description, Test Coverage).	Fill in the Description section explaining what the feature does and why it's needed, and document the Test Coverage section listing relevant tests.
Docstring Coverage	⚠️ Warning	Docstring coverage is 58.62% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding --use-3rdparty-cache to accelerate cmake configuration.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

3rdparty/fetch_cache_wrapper.py (1)

262-268: 💤 Low value

Consider logging when fallback path is taken.

When the --reference clone fails and falls back to plain submodule update --init, the failure is silently recovered. While correct for resilience, a log message would help debugging cache misses.

♻️ Optional: Add logging for fallback path

         ret = subprocess.run(cmd, cwd=top).returncode
         # Fallback without --reference if it failed
         if ret != 0 and ref:
+            print(f"{TAG} submodule {sub_path}: --reference failed, falling back",
+                  file=sys.stderr)
             subprocess.run(
                 [REAL_GIT, "submodule", "update", "--init", "--", sub_path],
                 cwd=top, check=False,
             )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@3rdparty/fetch_cache_wrapper.py` around lines 262 - 268, Add a log entry when
the fallback branch is taken after the `subprocess.run(cmd, cwd=top)` returns a
non-zero `ret` and `ref` is truthy; specifically, before calling
`subprocess.run([REAL_GIT, "submodule", "update", "--init", "--", sub_path],
cwd=top, check=False)` emit a warning (e.g., via the module logger or Python's
logging) that the `--reference` clone failed and you're falling back to plain
`submodule update --init`, including context like `ret`, the original `cmd` and
`sub_path`/`top` to aid debugging.

3rdparty/fetch-cache.md (1)

99-108: 💤 Low value

Consider adding language specifiers to fenced code blocks.

For consistency with markdown best practices, the directory structure examples could use a language specifier (e.g., text or plaintext).

♻️ Optional: Add language specifiers

-```
+```text
 $CACHE_DIR/
 ├── shallow/

-```
+```text
 <consumer-gitdir>/.git/lfs/objects/<2>/<2>/<oid>

Also applies to: 152-155

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@3rdparty/fetch-cache.md` around lines 99 - 108, Add a language specifier
(e.g., "text" or "plaintext") to the fenced code blocks that show directory
trees and git-lfs paths so they follow Markdown best practices; update the
fences around the $CACHE_DIR/ example (the block that starts with "$CACHE_DIR/")
and the block containing "<consumer-gitdir>/.git/lfs/objects/<2>/<2>/<oid>" to
use ```text instead of ``` so syntax highlighters and linters treat them as
plain text.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@3rdparty/fetch_cache_wrapper.py`:
- Around line 262-268: Add a log entry when the fallback branch is taken after
the `subprocess.run(cmd, cwd=top)` returns a non-zero `ret` and `ref` is truthy;
specifically, before calling `subprocess.run([REAL_GIT, "submodule", "update",
"--init", "--", sub_path], cwd=top, check=False)` emit a warning (e.g., via the
module logger or Python's logging) that the `--reference` clone failed and
you're falling back to plain `submodule update --init`, including context like
`ret`, the original `cmd` and `sub_path`/`top` to aid debugging.

In `@3rdparty/fetch-cache.md`:
- Around line 99-108: Add a language specifier (e.g., "text" or "plaintext") to
the fenced code blocks that show directory trees and git-lfs paths so they
follow Markdown best practices; update the fences around the $CACHE_DIR/ example
(the block that starts with "$CACHE_DIR/") and the block containing
"<consumer-gitdir>/.git/lfs/objects/<2>/<2>/<oid>" to use ```text instead of ```
so syntax highlighters and linters treat them as plain text.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2100e2ae-26a4-4813-8969-e2d3589f0f1d

📥 Commits

Reviewing files that changed from the base of the PR and between ded5e98 and 0cb6d1a.

📒 Files selected for processing (9)

.gitignore
3rdparty/CMakeLists.txt
3rdparty/README.md
3rdparty/fetch-cache.md
3rdparty/fetch_cache.py
3rdparty/fetch_cache_wrapper.py
scripts/build_wheel.py
tests/integration/test_lists/test-db/l0_a10.yml
tests/unittest/scripts/test_fetch_cache.py

tensorrt-cicd · May 9, 2026

PR_Github #47519 [ run ] triggered by Bot. Commit: 0cb6d1a Link to invocation

tensorrt-cicd · May 9, 2026

PR_Github #47519 [ run ] completed with state SUCCESS. Commit: 0cb6d1a
/LLM/main/L0_MergeRequest_PR pipeline #37436 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yuantailing and others added 30 commits May 9, 2026 12:39

Add cmake FetchContent cache

a344263

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Support submodules

2c47028

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Implement git command parser with Python

1043ab0

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Remove init

2f99c3a

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Fix for shallow

7073a18

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Log on finish

8af0591

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Fix for pre-commit

e4bd4fe

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Update comment

124346c

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Fix for shallow repo

8bfaa57

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Make updating earlier

ba9c024

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Add --use-3rdparty-cache and TRTLLM_FETCHCONTENT_UPDATE_CMD

0dd8be2

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Add logs to cmake FetchContent

68b4983

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Fix for shallow

11b6948

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Add --configure-only

812a358

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Fix according to the threat model

5d6f7df

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Polish logs

4ae0fee

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Avoid code-exec from <src_dir>/.git/config

5bc4dff

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Separate full repo and shallow repo

a0d7ad9

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Not verify refs in full repo if not updated

564ec6a

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Update docs

f0f2b96

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Update docs to reflect residual risks

350eefd

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

Update docs

c9975a7

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>

yuantailing requested a review from a team as a code owner May 9, 2026 12:50

github-actions Bot assigned yuantailing May 9, 2026

coderabbitai Bot reviewed May 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][feat] Add --use-3rdparty-cache to accelerate cmake configuration of clean build#13942

[None][feat] Add --use-3rdparty-cache to accelerate cmake configuration of clean build#13942
yuantailing wants to merge 30 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
yuantailing:fetch-cacheyuantailing/TensorRT-LLM:fetch-cacheCopy head branch name to clipboard

yuantailing commented May 9, 2026 •

edited

Loading

Uh oh!

yuantailing commented May 9, 2026

Uh oh!

coderabbitai Bot commented May 9, 2026

Walkthrough

Changes

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

tensorrt-cicd commented May 9, 2026

Uh oh!

tensorrt-cicd commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Search code, repositories, users, issues, pull requests...

Conversation

yuantailing commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

yuantailing commented May 9, 2026

Uh oh!

coderabbitai Bot commented May 9, 2026

Walkthrough

Changes

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented May 9, 2026

Uh oh!

tensorrt-cicd commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuantailing commented May 9, 2026 •

edited

Loading