Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

GH-3394: Cache FileStatus in Footer to reduce redundant NameNode RPC calls#3395

Open
wangyum wants to merge 1 commit intoapache:masterapache/parquet-java:masterfrom
wangyum:getFileStatuswangyum/parquet-java:getFileStatusCopy head branch name to clipboard
Open

GH-3394: Cache FileStatus in Footer to reduce redundant NameNode RPC calls#3395
wangyum wants to merge 1 commit intoapache:masterapache/parquet-java:masterfrom
wangyum:getFileStatuswangyum/parquet-java:getFileStatusCopy head branch name to clipboard

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Feb 16, 2026

Rationale for this change

When reading Parquet files from HDFS, getFileStatus() is called twice for each file:

  1. During footer reading in ParquetFileReader.readAllFootersInParallel()
  2. During split generation in ParquetInputFormat.getSplits()

This creates redundant NameNode RPC calls. For workloads processing thousands of files, this redundancy significantly increases NameNode load and job startup time.
This PR caches FileStatus in the Footer object to eliminate redundant RPC calls, reducing NameNode RPC calls during Parquet file processing.

What changes are included in this PR?

  1. Footer.java: Added FileStatus field with backward-compatible constructors
  2. ParquetFileReader.java: Pass FileStatus when creating Footer objects
  3. ParquetInputFormat.java: Reuse cached FileStatus instead of calling fs.getFileStatus() again
  4. TestFooterFileStatusCaching.java: New test suite with 5 tests proving RPC reduction

Are these changes tested?

Yes. Added comprehensive test suite TestFooterFileStatusCaching with 5 test cases:

  • ✅ Footer stores and returns FileStatus correctly
  • ✅ ParquetFileReader passes FileStatus to Footer
  • ✅ Cached FileStatus is reused (saves 3 RPCs in test)
  • ✅ Complete workflow verification (saves 5 RPCs in test)
  • ✅ Backward compatibility verified

Are there any user-facing changes?

No.

Closes #3394

@wangyum
Copy link
Member Author

wangyum commented Feb 16, 2026

cc @wgtmac

@wgtmac
Copy link
Member

wgtmac commented Feb 25, 2026

@steveloughran What do you think of this?

@wgtmac wgtmac changed the title HG-3394: Cache FileStatus in Footer to reduce redundant NameNode RPC calls GH-3394: Cache FileStatus in Footer to reduce redundant NameNode RPC calls Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache FileStatus in Footer to reduce redundant NameNode RPC calls

2 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.