Add `all_manifests` metadata table with tests #1241

soumya-ghosh · Oct 20, 2024

Implements all_manifests metadata table - #1053

Have refactored the code tor re-use logic of manifests metadata table.

The schema of all_manifests contains an additional column as compared to manifests table - reference_snapshot_id which indicates the snapshot id those manifests are contained in.
Ref - Iceberg implementation - here and here

kevinjqliu

Thanks for the PR! I've added some comments!

pyiceberg/table/inspect.py

kevinjqliu · Oct 28, 2024

pyiceberg/table/inspect.py

+    import pyarrow as pa
+
+    all_manifests_schema = get_manifests_schema()
+    all_manifests_schema = all_manifests_schema.append(pa.field("reference_snapshot_id", pa.int64(), nullable=False))


interestingly, this isnt in the documentation https://iceberg.apache.org/docs/latest/spark-queries/#all-manifests
but only in the code https://github.com/apache/iceberg/blame/2b55fef7cc2a249d864ac26d85a4923313d96a59/core/src/main/java/org/apache/iceberg/AllManifestsTable.java#L53-L54

Yes, it's not present in iceberg docs.

pyiceberg/table/inspect.py

kevinjqliu · Oct 28, 2024

pyiceberg/table/inspect.py

+            schema=get_all_manifests_schema() if is_all_manifests_table else get_manifests_schema(),
        )

+    def manifests(self) -> "pa.Table":


wdyt about adding an optional snapshot_id here? To allow users to look at the manifest for a specific snapshot, with the added benefit to iterate over all snapshot ids for all_manifests

Yeah, I am aligned with this.
But there are two parameters that I'm passing to _generate_manifests_table method - snapshot_id and a boolean flag whether the output is for all_manifests table which add the additional column to all_manifests table.
So I'll need to add this second parameter for manifests method as well. Thoughts?

yea i think thats fine since _generate_manifests_table is internal

kevinjqliu · Oct 28, 2024

tests/integration/test_inspect_table.py

+    for column in df.column_names:
+        for left, right in zip(lhs[column].to_list(), rhs[column].to_list()):
+            assert left == right, f"Difference in column {column}: {left} != {right}"


nit: is it possible to use assert_frame_equal here?

Yes, making the change.

…e class

kevinjqliu

LGTM! Thanks for adding this metadata table!

Fokko · Nov 19, 2024

@soumya-ghosh I see this one is still pending, are you still interested to get this in?

soumya-ghosh · Nov 20, 2024

Hey @Fokko nothing major is pending on my side, awaiting your approval. I will resolve the conflicts shortly.

kevinjqliu

LGTM! thank you for following up

kevinjqliu · Nov 20, 2024

pyiceberg/table/inspect.py

+            schema=get_all_manifests_schema() if is_all_manifests_table else get_manifests_schema(),
        )

+    def manifests(self) -> "pa.Table":


yea i think thats fine since _generate_manifests_table is internal

soumya-ghosh · Dec 11, 2024

@Fokko bumping this up for review

kevinjqliu · Jan 8, 2025

@soumya-ghosh do you mind resolving the conflict?

soumya-ghosh · Jan 8, 2025

@kevinjqliu conflict is resolved. As the PR is approved but not merged for over a month now, hence merge conflicts happen occasionally.

kevinjqliu · Jan 11, 2025

Sorry about the delay, this dropped off my radar. Thanks again for the contribution @soumya-ghosh !

Add all_manifests metadata table with tests

73e9bc8

soumya-ghosh mentioned this pull request Oct 20, 2024

[feat] add missing metadata tables #1053

Open

16 tasks

kevinjqliu reviewed Oct 28, 2024

View reviewed changes

soumya-ghosh added 3 commits October 28, 2024 18:28

Move get_manifests_schema and get_all_manifests_schema to InspectTabl…

044512e

…e class

Merge branch 'main' into feat/metadata_tables

952fa1c

Update tests for all_manifests table

ef991e6

kevinjqliu added this to the PyIceberg 0.9.0 release milestone Oct 30, 2024

soumya-ghosh requested a review from kevinjqliu November 5, 2024 14:29

kevinjqliu approved these changes Nov 5, 2024

View reviewed changes

kevinjqliu requested review from Fokko and HonahX November 5, 2024 22:46

soumya-ghosh added 2 commits November 20, 2024 14:13

Merge branch 'main' into feat/metadata_tables

3a920f9

Added linter changes in inspect.py

0b7747e

kevinjqliu approved these changes Nov 20, 2024

View reviewed changes

Merge branch 'main' into feat/metadata_tables

6be1a3c

kevinjqliu merged commit cad0ad7 into apache:main Jan 11, 2025
7 checks passed

Search code, repositories, users, issues, pull requests...

Add all_manifests metadata table with tests #1241

Add all_manifests metadata table with tests #1241

Uh oh!

Conversation

soumya-ghosh commented Oct 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko commented Nov 19, 2024

Uh oh!

soumya-ghosh commented Nov 20, 2024

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

soumya-ghosh commented Dec 11, 2024

Uh oh!

kevinjqliu commented Jan 8, 2025

Uh oh!

soumya-ghosh commented Jan 8, 2025

Uh oh!

Uh oh!

kevinjqliu commented Jan 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add `all_manifests` metadata table with tests #1241

Add `all_manifests` metadata table with tests #1241

soumya-ghosh commented Oct 20, 2024 •

edited

Loading

kevinjqliu commented Jan 11, 2025 •

edited

Loading