Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

@soumya-ghosh
Copy link
Contributor

@soumya-ghosh soumya-ghosh commented Oct 20, 2024

Implements all_manifests metadata table - #1053

Have refactored the code tor re-use logic of manifests metadata table.

The schema of all_manifests contains an additional column as compared to manifests table - reference_snapshot_id which indicates the snapshot id those manifests are contained in.
Ref - Iceberg implementation - here and here

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I've added some comments!

pyiceberg/table/inspect.py Outdated Show resolved Hide resolved
import pyarrow as pa

all_manifests_schema = get_manifests_schema()
all_manifests_schema = all_manifests_schema.append(pa.field("reference_snapshot_id", pa.int64(), nullable=False))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not present in iceberg docs.

pyiceberg/table/inspect.py Show resolved Hide resolved
schema=get_all_manifests_schema() if is_all_manifests_table else get_manifests_schema(),
)

def manifests(self) -> "pa.Table":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt about adding an optional snapshot_id here? To allow users to look at the manifest for a specific snapshot, with the added benefit to iterate over all snapshot ids for all_manifests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am aligned with this.
But there are two parameters that I'm passing to _generate_manifests_table method - snapshot_id and a boolean flag whether the output is for all_manifests table which add the additional column to all_manifests table.
So I'll need to add this second parameter for manifests method as well. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i think thats fine since _generate_manifests_table is internal

Comment on lines 938 to 940
for column in df.column_names:
for left, right in zip(lhs[column].to_list(), rhs[column].to_list()):
assert left == right, f"Difference in column {column}: {left} != {right}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it possible to use assert_frame_equal here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, making the change.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding this metadata table!

@kevinjqliu kevinjqliu requested review from Fokko and HonahX November 5, 2024 22:46
@Fokko
Copy link
Contributor

Fokko commented Nov 19, 2024

@soumya-ghosh I see this one is still pending, are you still interested to get this in?

@soumya-ghosh
Copy link
Contributor Author

Hey @Fokko nothing major is pending on my side, awaiting your approval. I will resolve the conflicts shortly.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thank you for following up

schema=get_all_manifests_schema() if is_all_manifests_table else get_manifests_schema(),
)

def manifests(self) -> "pa.Table":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i think thats fine since _generate_manifests_table is internal

@soumya-ghosh
Copy link
Contributor Author

@Fokko bumping this up for review

@kevinjqliu
Copy link
Contributor

@soumya-ghosh do you mind resolving the conflict?

@soumya-ghosh
Copy link
Contributor Author

@kevinjqliu conflict is resolved. As the PR is approved but not merged for over a month now, hence merge conflicts happen occasionally.

@kevinjqliu kevinjqliu merged commit cad0ad7 into apache:main Jan 11, 2025
7 checks passed
@kevinjqliu
Copy link
Contributor

kevinjqliu commented Jan 11, 2025

Sorry about the delay, this dropped off my radar. Thanks again for the contribution @soumya-ghosh !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.