Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[None][feat] Add --use-3rdparty-cache to accelerate cmake configuration of clean build#13942

Open
yuantailing wants to merge 30 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
yuantailing:fetch-cacheyuantailing/TensorRT-LLM:fetch-cacheCopy head branch name to clipboard
Open

[None][feat] Add --use-3rdparty-cache to accelerate cmake configuration of clean build#13942
yuantailing wants to merge 30 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
yuantailing:fetch-cacheyuantailing/TensorRT-LLM:fetch-cacheCopy head branch name to clipboard

Conversation

@yuantailing
Copy link
Copy Markdown
Member

@yuantailing yuantailing commented May 9, 2026

Summary by CodeRabbit

  • New Features

    • Added optional caching mechanism to accelerate third-party dependency retrieval in builds.
    • Added --use-3rdparty-cache flag to enable local cache reuse during the build process.
    • Added --configure-only flag to run CMake configuration without building or packaging.
  • Documentation

    • Added detailed documentation describing the new dependency caching feature.
  • Tests

    • Added comprehensive test suite for the caching mechanism.

Review Change Stack

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

yuantailing and others added 30 commits May 9, 2026 12:39
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
- CMakeLists.txt: resolve real git via separate find_program() to avoid
  GIT_EXECUTABLE self-recursion on re-configure.
- fetch_cache_wrapper.py: complete Apache-2.0 header; pin cwd=top on
  per-submodule subprocess calls; turn _strict_parse error into an
  actionable message that names the parser to extend and shows examples.
- fetch_cache.py: add per-repo fcntl.flock around _ensure_cache so
  concurrent updates of the same bare cache serialize; reapply
  SAFETY_CONFIG to existing caches; replace os.walk with directed
  scandir+recursion under .git/modules/.
- build_wheel.py: drop dead try/except around best-effort run(); warn
  via warnings.warn when the cache update exits non-zero.

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
All git operations in _ensure_cache (init --bare, config, fetch) are
already idempotent. The only non-idempotent step was rmtree-on-failure,
which was also the source of the "rmtree races fetch" hazard that the
flock locking was added to mitigate. Removing the rmtree lets us drop
the lock entirely: a partial bare survives across retries and gets
healed by the next update via the existing-cache path.

Also drop the redundant `git submodule init` call in handle_submodule
since `git submodule update --init -- <path>` inits each submodule
implicitly before updating it.

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
_safe_fetch_into_cache builds an ephemeral .fc-standin-<random>/ per
update. On a shared cache (Lustre/NFS/GPFS), the ~20 small files
git init --bare writes, plus the packed-refs/alternates/shallow
we add, plus the final rmtree, all pay a per-op metadata round-trip
(~10 ms on Lustre vs. ~100 us on tmpfs). Measured on a warm NVTX
shallow update on Lustre: 104 ms -> 62 ms per dep, from ~20 ms
saved on init --bare and ~11 ms on the closing rmtree.

Switch tempfile.mkdtemp() to the system temp dir by default; fall
back to the cache_parent path byte-for-byte when the system temp
is not writable (a Landlock-sandboxed caller whose ruleset covers
the cache but not /tmp). mkdtemp's 0700 mode + unpredictable random
suffix is what I3 actually relies on, not the parent directory, so
moving off $CACHE_DIR changes no security invariant.

Update fetch-cache.md: I3 standin bullet, performance-model table,
new "Shared-filesystem optimizations" subsection, and the residual
SIGKILL-cleanup note (tmpfs-on-reboot eviction now handles leaks).

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
The FetchContent_MakeAvailable override now invokes the writer with
`--log-prefix <prefix> update --cache-dir <dir> --src <src>` in both
the default and TRTLLM_FETCHCONTENT_UPDATE_CMD modes. fetch_cache.py
itself becomes a valid TRTLLM_FETCHCONTENT_UPDATE_CMD value.

Fold the cmake-side bookkeeping into one CACHE INTERNAL list:
_FC_WRITER_CMD replaces _FC_UPDATE_CMD + _FC_CACHE_PY + _FC_PYTHON,
which previously existed only to be reassembled inside
_trtllm_fc_update_one.

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
`git clone --reference` only shares git objects through alternates;
LFS-managed payloads live in a separate content store consulted by
the smudge filter, so a `--reference` clone alone still refetches
every LFS object over the network. Each bare now keeps a parallel
pool at `<bare>/lfs/objects/<2>/<2>/<oid>`. Sandboxed consumers can
pre-symlink the pool entries into their own gitdir; smudge follows
symlinks transparently, and unlike hardlinks symlinks survive the
cross-device case that git-LFS's `CopyOrLinkFile` turns into a full
copy on EXDEV when the bare and the consumer gitdir live on different
st_devs.

Writer-side: `_walk_src_lfs_objects` walks the src lfs tree under
the LFS-standard <2>/<2>/<64-hex> layout, rejecting symlinks at every
level (I3). `_ingest_lfs_object` hardlinks (or copies on EXDEV) into
a sibling tempfile, validates SHA-256 of the staged file, and atomic-
renames into place (I2). Validating after staging keeps the bytes
that get hashed equal to the bytes that land in the cache, even
under a concurrent dentry swap on the src side.

`_ensure_cache` also gains an up-to-date short-circuit: when src has
no objects of its own (only `info/alternates`, the shape of every
`--reference` consumer worktree), the standin/fetch/replicate pass
would move zero git bytes. Skip the entire pass and run only the LFS
pool walk — a consumer that ran `git lfs pull` after a `--reference`
clone may carry new LFS payloads without new git objects.

fetch-cache.md gains an LFS pool section under cache layout, threat-
model coverage (SHA-256-CR on the trusted side, src lfs tree on the
untrusted side), I2/I3/I4/I5 LFS extensions, the native git mechanisms
table extended for SHA-256, and two new residual risks (TOCTOU size-
cap bypass on the src dentry, .lfs-tmp leak after SIGKILL mid-ingest).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
The previous _ingest_lfs_object hardlinked src into the staged
tempfile (falling back to copy on EXDEV), validated SHA-256, and
renamed into the bare's lfs/objects/<2>/<2>/<oid>. After the rename,
dst and src share an inode. An attacker who keeps writing to src
under the same oid filename after ingest mutates the cache entry's
bytes through that shared inode, breaking the oid-name <->
content-hash binding the LFS pool relies on. Hash validation at
ingest does not survive subsequent src writes.

Switch _ingest_lfs_object to a single copy path: mkstemp +
shutil.copyfileobj into the tempfile, then SHA-256 reverify, then
atomic rename. The cache entry's inode is independent of src's, so
post-ingest writes to src cannot reach cached bytes. Cost is
O(filesize) on every ingest, accepted: LFS payloads are usually
hundreds of MB and ingest happens once per oid (already-present oids
are skipped). The errno import and the EXDEV branch go away with the
hardlink path.

fetch-cache.md follows the code:

- LFS pool writer-side: copy-only, with the inode-independence
  rationale moved here from the docstring (repeated design ideas
  live in the doc, not the code).
- I2 LFS subitem: "after the link/copy" -> "after the copy".
- TOCTOU residual risk: the dual hardlink/EXDEV-copy branching
  collapses to a single copy path; same fstat-based mitigation
  pointer.
- I3 LFS symlink defense: os.link is gone; the only remaining
  src-side I/O is open(src_path, "rb"), so the symlink rejection
  walker now guards that open instead of os.link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
- Drop past-version framing ("identical to before", "historically
  observed", "pre-change", "older version of this script") in favor of
  present-tense statements of current behavior.
- Generalize Lustre/NFS/GPFS-specific perf claims and the
  Lustre-hosted measurement table to an abstract "network filesystem
  with high metadata latency" framing; remove the "Claude Code" brand
  reference from the threat model.
- Trim duplication between fetch-cache.md and code comments: standin
  tmpfs/Landlock/mkdtemp rationale, submodule routing, and refname
  quoting rationale now live only in the doc; the comments
  cross-reference instead of restating.

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Adds tests/unittest/scripts/test_fetch_cache.py — four pytest cases
covering the FetchContent cache writer end-to-end without network or
GitHub access:

- top-level dep, full and shallow modes: two update cycles each
  (initial + new commit), each verified by `git clone --reference
  <bare>` landing with zero git objects of its own;
- parent dep with nested submodule: bare materializes for both
  parent and submodule, both consumer git dirs object-less;
- LFS-tracked binary: oid lands in the bare's LFS pool with
  matching bytes; second update is idempotent.

Picked up automatically by the existing `unittest/scripts` directory
entry in l0_a10 (CPU-only, pre-merge).  Updated the inline comment to
reflect the bucket's actual contents — the suite is no longer just lint
helpers, and it now runs in ~3s rather than the original ~0.3s.

Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
@yuantailing yuantailing requested a review from a team as a code owner May 9, 2026 12:50
@yuantailing
Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 9, 2026

📝 Walkthrough

Walkthrough

This PR introduces an opt-in FetchContent caching mechanism that accelerates repeated third-party dependency checkouts in TensorRT-LLM. The implementation adds a Python cache manager, a CMake-injected git wrapper, CMake configuration to wire them together, and comprehensive end-to-end tests validating correct behavior across shallow/full modes, submodules, and LFS scenarios.

Changes

FetchContent Cache Acceleration

Layer / File(s) Summary
Configuration & Public Contracts
.gitignore, scripts/build_wheel.py
New --use-3rdparty-cache and --configure-only CLI flags added to build_wheel.py; cache directory .cache_3rdparty/ added to .gitignore.
Design & Architecture Documentation
3rdparty/fetch-cache.md
Comprehensive 628-line design doc covering cache layout (shallow/ vs full/), threat model, LFS pool mirroring, git safety hardening, concurrency guarantees, and performance model.
Cache Writer Implementation
3rdparty/fetch_cache.py
New 935-line Python cache manager deriving cache names from source URLs, probing git mode (shallow/full), initializing/updating bare repos, validating LFS objects via SHA-256, safely fetching via standin-bare + alternates, replicating append-only ref anchors, and discovering nested submodules.
Cache Reader Wrapper
3rdparty/fetch_cache_wrapper.py
New 306-line CMake-injected git wrapper detecting git clone and git submodule update commands, rewriting them to use cached bare repos via --reference (shallow/full selection based on --depth), and falling back transparently on cache misses.
CMake Integration
3rdparty/CMakeLists.txt
Setup for non-Windows opt-in caching: configures wrapper script paths, locates real git/python3, injects GIT_EXECUTABLE, sets environment variables; overrides FetchContent_MakeAvailable macro to log per-dependency phases and trigger per-dependency cache update via _trtllm_fc_update_one helper.
Build System Wiring
scripts/build_wheel.py
Forwards --use-3rdparty-cache flag into CMake variables (TRTLLM_FETCHCONTENT_CACHE and optional TRTLLM_FETCHCONTENT_UPDATE_CMD); adds --configure-only mode to exit after CMake configuration.
User Documentation
3rdparty/README.md
Updated header/intro and added "FetchContent cache (--use-3rdparty-cache)" section explaining opt-in behavior, git clone --reference acceleration, and reference to design rationale.
End-to-End Tests
tests/unittest/scripts/test_fetch_cache.py, tests/integration/test_lists/test-db/l0_a10.yml
New 357-line hermetic test suite validating cache writer behavior: top-level deps (full vs shallow), submodules, and LFS pool scenarios, all asserting object-less consumer clones; minor test-list comment update.

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description contains only the repository template boilerplate with no substantive content in the required sections (Description, Test Coverage). Fill in the Description section explaining what the feature does and why it's needed, and document the Test Coverage section listing relevant tests.
Docstring Coverage ⚠️ Warning Docstring coverage is 58.62% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding --use-3rdparty-cache to accelerate cmake configuration.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
3rdparty/fetch_cache_wrapper.py (1)

262-268: 💤 Low value

Consider logging when fallback path is taken.

When the --reference clone fails and falls back to plain submodule update --init, the failure is silently recovered. While correct for resilience, a log message would help debugging cache misses.

♻️ Optional: Add logging for fallback path
         ret = subprocess.run(cmd, cwd=top).returncode
         # Fallback without --reference if it failed
         if ret != 0 and ref:
+            print(f"{TAG} submodule {sub_path}: --reference failed, falling back",
+                  file=sys.stderr)
             subprocess.run(
                 [REAL_GIT, "submodule", "update", "--init", "--", sub_path],
                 cwd=top, check=False,
             )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@3rdparty/fetch_cache_wrapper.py` around lines 262 - 268, Add a log entry when
the fallback branch is taken after the `subprocess.run(cmd, cwd=top)` returns a
non-zero `ret` and `ref` is truthy; specifically, before calling
`subprocess.run([REAL_GIT, "submodule", "update", "--init", "--", sub_path],
cwd=top, check=False)` emit a warning (e.g., via the module logger or Python's
logging) that the `--reference` clone failed and you're falling back to plain
`submodule update --init`, including context like `ret`, the original `cmd` and
`sub_path`/`top` to aid debugging.
3rdparty/fetch-cache.md (1)

99-108: 💤 Low value

Consider adding language specifiers to fenced code blocks.

For consistency with markdown best practices, the directory structure examples could use a language specifier (e.g., text or plaintext).

♻️ Optional: Add language specifiers
-```
+```text
 $CACHE_DIR/
 ├── shallow/
-```
+```text
 <consumer-gitdir>/.git/lfs/objects/<2>/<2>/<oid>

Also applies to: 152-155

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@3rdparty/fetch-cache.md` around lines 99 - 108, Add a language specifier
(e.g., "text" or "plaintext") to the fenced code blocks that show directory
trees and git-lfs paths so they follow Markdown best practices; update the
fences around the $CACHE_DIR/ example (the block that starts with "$CACHE_DIR/")
and the block containing "<consumer-gitdir>/.git/lfs/objects/<2>/<2>/<oid>" to
use ```text instead of ``` so syntax highlighters and linters treat them as
plain text.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@3rdparty/fetch_cache_wrapper.py`:
- Around line 262-268: Add a log entry when the fallback branch is taken after
the `subprocess.run(cmd, cwd=top)` returns a non-zero `ret` and `ref` is truthy;
specifically, before calling `subprocess.run([REAL_GIT, "submodule", "update",
"--init", "--", sub_path], cwd=top, check=False)` emit a warning (e.g., via the
module logger or Python's logging) that the `--reference` clone failed and
you're falling back to plain `submodule update --init`, including context like
`ret`, the original `cmd` and `sub_path`/`top` to aid debugging.

In `@3rdparty/fetch-cache.md`:
- Around line 99-108: Add a language specifier (e.g., "text" or "plaintext") to
the fenced code blocks that show directory trees and git-lfs paths so they
follow Markdown best practices; update the fences around the $CACHE_DIR/ example
(the block that starts with "$CACHE_DIR/") and the block containing
"<consumer-gitdir>/.git/lfs/objects/<2>/<2>/<oid>" to use ```text instead of ```
so syntax highlighters and linters treat them as plain text.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2100e2ae-26a4-4813-8969-e2d3589f0f1d

📥 Commits

Reviewing files that changed from the base of the PR and between ded5e98 and 0cb6d1a.

📒 Files selected for processing (9)
  • .gitignore
  • 3rdparty/CMakeLists.txt
  • 3rdparty/README.md
  • 3rdparty/fetch-cache.md
  • 3rdparty/fetch_cache.py
  • 3rdparty/fetch_cache_wrapper.py
  • scripts/build_wheel.py
  • tests/integration/test_lists/test-db/l0_a10.yml
  • tests/unittest/scripts/test_fetch_cache.py

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47519 [ run ] triggered by Bot. Commit: 0cb6d1a Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47519 [ run ] completed with state SUCCESS. Commit: 0cb6d1a
/LLM/main/L0_MergeRequest_PR pipeline #37436 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.