[None][feat] Add --use-3rdparty-cache to accelerate cmake configuration of clean build#13942
[None][feat] Add --use-3rdparty-cache to accelerate cmake configuration of clean build#13942yuantailing wants to merge 30 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
Conversation
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
- CMakeLists.txt: resolve real git via separate find_program() to avoid GIT_EXECUTABLE self-recursion on re-configure. - fetch_cache_wrapper.py: complete Apache-2.0 header; pin cwd=top on per-submodule subprocess calls; turn _strict_parse error into an actionable message that names the parser to extend and shows examples. - fetch_cache.py: add per-repo fcntl.flock around _ensure_cache so concurrent updates of the same bare cache serialize; reapply SAFETY_CONFIG to existing caches; replace os.walk with directed scandir+recursion under .git/modules/. - build_wheel.py: drop dead try/except around best-effort run(); warn via warnings.warn when the cache update exits non-zero. Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
All git operations in _ensure_cache (init --bare, config, fetch) are already idempotent. The only non-idempotent step was rmtree-on-failure, which was also the source of the "rmtree races fetch" hazard that the flock locking was added to mitigate. Removing the rmtree lets us drop the lock entirely: a partial bare survives across retries and gets healed by the next update via the existing-cache path. Also drop the redundant `git submodule init` call in handle_submodule since `git submodule update --init -- <path>` inits each submodule implicitly before updating it. Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
_safe_fetch_into_cache builds an ephemeral .fc-standin-<random>/ per update. On a shared cache (Lustre/NFS/GPFS), the ~20 small files git init --bare writes, plus the packed-refs/alternates/shallow we add, plus the final rmtree, all pay a per-op metadata round-trip (~10 ms on Lustre vs. ~100 us on tmpfs). Measured on a warm NVTX shallow update on Lustre: 104 ms -> 62 ms per dep, from ~20 ms saved on init --bare and ~11 ms on the closing rmtree. Switch tempfile.mkdtemp() to the system temp dir by default; fall back to the cache_parent path byte-for-byte when the system temp is not writable (a Landlock-sandboxed caller whose ruleset covers the cache but not /tmp). mkdtemp's 0700 mode + unpredictable random suffix is what I3 actually relies on, not the parent directory, so moving off $CACHE_DIR changes no security invariant. Update fetch-cache.md: I3 standin bullet, performance-model table, new "Shared-filesystem optimizations" subsection, and the residual SIGKILL-cleanup note (tmpfs-on-reboot eviction now handles leaks). Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
The FetchContent_MakeAvailable override now invokes the writer with `--log-prefix <prefix> update --cache-dir <dir> --src <src>` in both the default and TRTLLM_FETCHCONTENT_UPDATE_CMD modes. fetch_cache.py itself becomes a valid TRTLLM_FETCHCONTENT_UPDATE_CMD value. Fold the cmake-side bookkeeping into one CACHE INTERNAL list: _FC_WRITER_CMD replaces _FC_UPDATE_CMD + _FC_CACHE_PY + _FC_PYTHON, which previously existed only to be reassembled inside _trtllm_fc_update_one. Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
`git clone --reference` only shares git objects through alternates; LFS-managed payloads live in a separate content store consulted by the smudge filter, so a `--reference` clone alone still refetches every LFS object over the network. Each bare now keeps a parallel pool at `<bare>/lfs/objects/<2>/<2>/<oid>`. Sandboxed consumers can pre-symlink the pool entries into their own gitdir; smudge follows symlinks transparently, and unlike hardlinks symlinks survive the cross-device case that git-LFS's `CopyOrLinkFile` turns into a full copy on EXDEV when the bare and the consumer gitdir live on different st_devs. Writer-side: `_walk_src_lfs_objects` walks the src lfs tree under the LFS-standard <2>/<2>/<64-hex> layout, rejecting symlinks at every level (I3). `_ingest_lfs_object` hardlinks (or copies on EXDEV) into a sibling tempfile, validates SHA-256 of the staged file, and atomic- renames into place (I2). Validating after staging keeps the bytes that get hashed equal to the bytes that land in the cache, even under a concurrent dentry swap on the src side. `_ensure_cache` also gains an up-to-date short-circuit: when src has no objects of its own (only `info/alternates`, the shape of every `--reference` consumer worktree), the standin/fetch/replicate pass would move zero git bytes. Skip the entire pass and run only the LFS pool walk — a consumer that ran `git lfs pull` after a `--reference` clone may carry new LFS payloads without new git objects. fetch-cache.md gains an LFS pool section under cache layout, threat- model coverage (SHA-256-CR on the trusted side, src lfs tree on the untrusted side), I2/I3/I4/I5 LFS extensions, the native git mechanisms table extended for SHA-256, and two new residual risks (TOCTOU size- cap bypass on the src dentry, .lfs-tmp leak after SIGKILL mid-ingest). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
The previous _ingest_lfs_object hardlinked src into the staged tempfile (falling back to copy on EXDEV), validated SHA-256, and renamed into the bare's lfs/objects/<2>/<2>/<oid>. After the rename, dst and src share an inode. An attacker who keeps writing to src under the same oid filename after ingest mutates the cache entry's bytes through that shared inode, breaking the oid-name <-> content-hash binding the LFS pool relies on. Hash validation at ingest does not survive subsequent src writes. Switch _ingest_lfs_object to a single copy path: mkstemp + shutil.copyfileobj into the tempfile, then SHA-256 reverify, then atomic rename. The cache entry's inode is independent of src's, so post-ingest writes to src cannot reach cached bytes. Cost is O(filesize) on every ingest, accepted: LFS payloads are usually hundreds of MB and ingest happens once per oid (already-present oids are skipped). The errno import and the EXDEV branch go away with the hardlink path. fetch-cache.md follows the code: - LFS pool writer-side: copy-only, with the inode-independence rationale moved here from the docstring (repeated design ideas live in the doc, not the code). - I2 LFS subitem: "after the link/copy" -> "after the copy". - TOCTOU residual risk: the dual hardlink/EXDEV-copy branching collapses to a single copy path; same fstat-based mitigation pointer. - I3 LFS symlink defense: os.link is gone; the only remaining src-side I/O is open(src_path, "rb"), so the symlink rejection walker now guards that open instead of os.link. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
- Drop past-version framing ("identical to before", "historically
observed", "pre-change", "older version of this script") in favor of
present-tense statements of current behavior.
- Generalize Lustre/NFS/GPFS-specific perf claims and the
Lustre-hosted measurement table to an abstract "network filesystem
with high metadata latency" framing; remove the "Claude Code" brand
reference from the threat model.
- Trim duplication between fetch-cache.md and code comments: standin
tmpfs/Landlock/mkdtemp rationale, submodule routing, and refname
quoting rationale now live only in the doc; the comments
cross-reference instead of restating.
Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
Adds tests/unittest/scripts/test_fetch_cache.py — four pytest cases covering the FetchContent cache writer end-to-end without network or GitHub access: - top-level dep, full and shallow modes: two update cycles each (initial + new commit), each verified by `git clone --reference <bare>` landing with zero git objects of its own; - parent dep with nested submodule: bare materializes for both parent and submodule, both consumer git dirs object-less; - LFS-tracked binary: oid lands in the bare's LFS pool with matching bytes; second update is idempotent. Picked up automatically by the existing `unittest/scripts` directory entry in l0_a10 (CPU-only, pre-merge). Updated the inline comment to reflect the bucket's actual contents — the suite is no longer just lint helpers, and it now runs in ~3s rather than the original ~0.3s. Signed-off-by: Tailing Yuan <yuantailing@gmail.com>
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughThis PR introduces an opt-in FetchContent caching mechanism that accelerates repeated third-party dependency checkouts in TensorRT-LLM. The implementation adds a Python cache manager, a CMake-injected git wrapper, CMake configuration to wire them together, and comprehensive end-to-end tests validating correct behavior across shallow/full modes, submodules, and LFS scenarios. ChangesFetchContent Cache Acceleration
🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
3rdparty/fetch_cache_wrapper.py (1)
262-268: 💤 Low valueConsider logging when fallback path is taken.
When the
--referenceclone fails and falls back to plainsubmodule update --init, the failure is silently recovered. While correct for resilience, a log message would help debugging cache misses.♻️ Optional: Add logging for fallback path
ret = subprocess.run(cmd, cwd=top).returncode # Fallback without --reference if it failed if ret != 0 and ref: + print(f"{TAG} submodule {sub_path}: --reference failed, falling back", + file=sys.stderr) subprocess.run( [REAL_GIT, "submodule", "update", "--init", "--", sub_path], cwd=top, check=False, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@3rdparty/fetch_cache_wrapper.py` around lines 262 - 268, Add a log entry when the fallback branch is taken after the `subprocess.run(cmd, cwd=top)` returns a non-zero `ret` and `ref` is truthy; specifically, before calling `subprocess.run([REAL_GIT, "submodule", "update", "--init", "--", sub_path], cwd=top, check=False)` emit a warning (e.g., via the module logger or Python's logging) that the `--reference` clone failed and you're falling back to plain `submodule update --init`, including context like `ret`, the original `cmd` and `sub_path`/`top` to aid debugging.3rdparty/fetch-cache.md (1)
99-108: 💤 Low valueConsider adding language specifiers to fenced code blocks.
For consistency with markdown best practices, the directory structure examples could use a language specifier (e.g.,
textorplaintext).♻️ Optional: Add language specifiers
-``` +```text $CACHE_DIR/ ├── shallow/-``` +```text <consumer-gitdir>/.git/lfs/objects/<2>/<2>/<oid>Also applies to: 152-155
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@3rdparty/fetch-cache.md` around lines 99 - 108, Add a language specifier (e.g., "text" or "plaintext") to the fenced code blocks that show directory trees and git-lfs paths so they follow Markdown best practices; update the fences around the $CACHE_DIR/ example (the block that starts with "$CACHE_DIR/") and the block containing "<consumer-gitdir>/.git/lfs/objects/<2>/<2>/<oid>" to use ```text instead of ``` so syntax highlighters and linters treat them as plain text.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@3rdparty/fetch_cache_wrapper.py`:
- Around line 262-268: Add a log entry when the fallback branch is taken after
the `subprocess.run(cmd, cwd=top)` returns a non-zero `ret` and `ref` is truthy;
specifically, before calling `subprocess.run([REAL_GIT, "submodule", "update",
"--init", "--", sub_path], cwd=top, check=False)` emit a warning (e.g., via the
module logger or Python's logging) that the `--reference` clone failed and
you're falling back to plain `submodule update --init`, including context like
`ret`, the original `cmd` and `sub_path`/`top` to aid debugging.
In `@3rdparty/fetch-cache.md`:
- Around line 99-108: Add a language specifier (e.g., "text" or "plaintext") to
the fenced code blocks that show directory trees and git-lfs paths so they
follow Markdown best practices; update the fences around the $CACHE_DIR/ example
(the block that starts with "$CACHE_DIR/") and the block containing
"<consumer-gitdir>/.git/lfs/objects/<2>/<2>/<oid>" to use ```text instead of ```
so syntax highlighters and linters treat them as plain text.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2100e2ae-26a4-4813-8969-e2d3589f0f1d
📒 Files selected for processing (9)
.gitignore3rdparty/CMakeLists.txt3rdparty/README.md3rdparty/fetch-cache.md3rdparty/fetch_cache.py3rdparty/fetch_cache_wrapper.pyscripts/build_wheel.pytests/integration/test_lists/test-db/l0_a10.ymltests/unittest/scripts/test_fetch_cache.py
|
PR_Github #47519 [ run ] triggered by Bot. Commit: |
|
PR_Github #47519 [ run ] completed with state
|
Summary by CodeRabbit
New Features
--use-3rdparty-cacheflag to enable local cache reuse during the build process.--configure-onlyflag to run CMake configuration without building or packaging.Documentation
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.