Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

@ndrluis
Copy link
Collaborator

@ndrluis ndrluis commented Aug 8, 2024

Fixes #1020

@ndrluis ndrluis added this to the PyIceberg 0.7.1 release milestone Aug 8, 2024
@ndrluis ndrluis requested review from Fokko, HonahX and kevinjqliu August 8, 2024 13:23
@Minfante377
Copy link
Contributor

Just wanted to confirm that I tried this out with the table that caused my issue #1020 and it works as expected

tests/integration/test_writes/test_writes.py Outdated Show resolved Hide resolved
@sungwy
Copy link
Collaborator

sungwy commented Aug 8, 2024

Hi @ndrluis - thanks for testing and fixing this tricky issue.

@ndrluis ndrluis requested a review from sungwy August 8, 2024 14:35
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting catch! This is when we cannot delete the file through Iceberg metadata, but we drop all the content of the Parquet file anyway. Thanks for fixing this @ndrluis 🙌

@ndrluis ndrluis force-pushed the fix-overwrite-with-filter branch 2 times, most recently from 1d1f987 to ac7b4db Compare August 8, 2024 15:23
Comment on lines +596 to +598
if len(filtered_df) == 0:
replaced_files.append((original_file.file, []))
elif len(df) != len(filtered_df):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it more readable if inlined?

if filtered_df and len(df) != len(filtered_df):

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't see much of a difference.

tbl = _create_table(session_catalog, identifier, data=[data], schema=schema)
tbl.overwrite(data, In("id", ["1", "2", "3"]))

assert len(tbl.scan().to_arrow()) == 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since all data match the filter, the overwrite operation is a no-op, right? if so, can we assert that in the test? maybe show that the files are the same

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a no-op, it's deleting the whole file. The change is in the delete method, not in the overwrite method.

I believe that testing the behavior is enough.

Copy link
Contributor

@kevinjqliu kevinjqliu Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see. The change is to make delete a no-op.
Sequence of operation

  • pass in `overwrite_filter which matches the entire table
  • in delete, the overwrite_filter is inversed, preserve_row_filter
  • use preserve_row_filter on data files.
  • if the result is empty, then we don't include this data file in deletion

Previously, we end up trying to write an empty data file.

Copy link
Contributor

@Fokko Fokko Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly. One thing to note is that it would be even more correct to add this to a DELETE snapshot, it is not replaced, but just dropped. Please note that most engines just use OVERWRITE.

tests/integration/test_writes/test_writes.py Show resolved Hide resolved
@ndrluis ndrluis force-pushed the fix-overwrite-with-filter branch from ac7b4db to 486dd61 Compare August 8, 2024 15:50
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Fokko Fokko merged commit 8f2e787 into apache:main Aug 9, 2024
sungwy pushed a commit that referenced this pull request Aug 9, 2024
sungwy pushed a commit that referenced this pull request Aug 9, 2024
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Overwrite with filter division by zero error

5 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.