-
Notifications
You must be signed in to change notification settings - Fork 412
Add all_manifests metadata table with tests
#1241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I've added some comments!
pyiceberg/table/inspect.py
Outdated
| import pyarrow as pa | ||
|
|
||
| all_manifests_schema = get_manifests_schema() | ||
| all_manifests_schema = all_manifests_schema.append(pa.field("reference_snapshot_id", pa.int64(), nullable=False)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interestingly, this isnt in the documentation https://iceberg.apache.org/docs/latest/spark-queries/#all-manifests
but only in the code https://github.com/apache/iceberg/blame/2b55fef7cc2a249d864ac26d85a4923313d96a59/core/src/main/java/org/apache/iceberg/AllManifestsTable.java#L53-L54
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's not present in iceberg docs.
| schema=get_all_manifests_schema() if is_all_manifests_table else get_manifests_schema(), | ||
| ) | ||
|
|
||
| def manifests(self) -> "pa.Table": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdyt about adding an optional snapshot_id here? To allow users to look at the manifest for a specific snapshot, with the added benefit to iterate over all snapshot ids for all_manifests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am aligned with this.
But there are two parameters that I'm passing to _generate_manifests_table method - snapshot_id and a boolean flag whether the output is for all_manifests table which add the additional column to all_manifests table.
So I'll need to add this second parameter for manifests method as well. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i think thats fine since _generate_manifests_table is internal
| for column in df.column_names: | ||
| for left, right in zip(lhs[column].to_list(), rhs[column].to_list()): | ||
| assert left == right, f"Difference in column {column}: {left} != {right}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is it possible to use assert_frame_equal here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, making the change.
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for adding this metadata table!
|
@soumya-ghosh I see this one is still pending, are you still interested to get this in? |
|
Hey @Fokko nothing major is pending on my side, awaiting your approval. I will resolve the conflicts shortly. |
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thank you for following up
| schema=get_all_manifests_schema() if is_all_manifests_table else get_manifests_schema(), | ||
| ) | ||
|
|
||
| def manifests(self) -> "pa.Table": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i think thats fine since _generate_manifests_table is internal
|
@Fokko bumping this up for review |
|
@soumya-ghosh do you mind resolving the conflict? |
|
@kevinjqliu conflict is resolved. As the PR is approved but not merged for over a month now, hence merge conflicts happen occasionally. |
|
Sorry about the delay, this dropped off my radar. Thanks again for the contribution @soumya-ghosh ! |
Implements
all_manifestsmetadata table - #1053Have refactored the code tor re-use logic of
manifestsmetadata table.The schema of
all_manifestscontains an additional column as compared tomanifeststable - reference_snapshot_id which indicates the snapshot id those manifests are contained in.Ref - Iceberg implementation - here and here