-
Notifications
You must be signed in to change notification settings - Fork 412
Add table statistics #1285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add table statistics #1285
Conversation
6f0bee0 to
a70edb2
Compare
a70edb2 to
384e229
Compare
|
I plan to move the set/remove statistics methods from the Transaction class to another class, such as ManageSnapshot. In the meantime, I’d like to confirm with everyone if I’m heading in the right direction with the current implementation. |
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Added a few comments. I think it would also be helpful to include integration tests
9b15c86 to
d16ef47
Compare
|
@kevinjqliu could you please review it once more? |
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments.
Do you know which engine currently can generate puffin files? would be great to add an integration with a spark generated puffin file
@kevinjqliu As far as I know, only Trino can generate them. What kind of test would you like to have? I believe we are covering all relevant cases for this PR. If PyIceberg could generate or read puffin files, then I agree it would be useful to add tests to check compatibility between engines. However, I think it only makes sense to test puffin files during reading, as testing generation would mean verifying the implementation of something that isn’t our responsibility. In this case, it’s just a metadata update. What do you think? |
d16ef47 to
11120bf
Compare
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for working on this!
Regarding the integration tests, since we're manipulating table metadata to add/remove table stats, it would be great to verify that another source can interact with these stats. Not a hard blocker
|
@ndrluis do you mind resolving the merge conflict here? |
11120bf to
adfc971
Compare
|
@kevinjqliu Done! |
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ndrluis Thanks for working on this, and sorry for leaving this hanging for so long. I have some small comments, but it looks good to me 👍
| if update.snapshot_id != update.statistics.snapshot_id: | ||
| raise ValueError("Snapshot id in statistics does not match the snapshot id in the update") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit of an awkward check, but something that we have to live with I guess.
2034836 to
ba64764
Compare
8fe9992 to
217bb95
Compare
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
217bb95 to
45c60cb
Compare
45c60cb to
fb9b2a2
Compare
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, thanks @ndrluis 🙌
The Java expire snapshot process expires table statistics and partition statistics. I am implementing a statistics table to make our expire snapshot compatible with the Java implementation.