-
Notifications
You must be signed in to change notification settings - Fork 412
Implement write.metadata.delete-after-commit.enabled to clean up old metadata files
#1607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @kaushiksrini 🙌 This looks great.
I have one minor style comment. Could you also add this to the docs? Thanks!
tests/catalog/test_sql.py
Outdated
| transaction = table.transaction() | ||
| update = transaction.update_schema() | ||
| update.add_column(path=f"new_column_{i}", field_type=IntegerType()) | ||
| update.commit() | ||
| transaction.commit_transaction() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use the context managers? I think that's easier to read
| transaction = table.transaction() | |
| update = transaction.update_schema() | |
| update.add_column(path=f"new_column_{i}", field_type=IntegerType()) | |
| update.commit() | |
| transaction.commit_transaction() | |
| with table.transaction() as transaction: | |
| with transaction.update_schema() as update: | |
| update.add_column(path=f"new_column_{i}", field_type=IntegerType()) |
|
@Fokko thanks! used context managers and added to the documentation |
| | `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk | | ||
| | `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | | ||
| | `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | | ||
| | `write.metadata.delete-after-commit.enabled` | Boolean | False | Whether to automatically delete old *tracked* metadata files after each table commit. It will retain a number of the most recent metadata files, which can be set using property `write.metadata.previous-versions-max`. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyiceberg/catalog/__init__.py
Outdated
| if current_table is not None: | ||
| self._delete_old_metadata(io, current_table.metadata, updated_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the right place for this operation?
looking at the java implementation, changes are committed to the table before old metadata files are deleted
https://github.com/apache/iceberg/blob/d935460bbed5da0ac205a67912b70fa682ee84df/core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java#L125-L126
At this point, changes are only applied to the table metadata but is not yet committed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the java implementation uses deleteAfterCommit
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_delete_old_metadata should be after table is committed
|
@kevinjqliu thanks for the review! I moved the logic after the table is committed |
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM! Thanks for the PR. I think _do_commit is the right place to add this.
pyiceberg/table/__init__.py
Outdated
|
|
||
| # https://github.com/apache/iceberg/blob/f6faa58/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L527 | ||
| # delete old metadata if METADATA_DELETE_AFTER_COMMIT_ENABLED is set to true | ||
| self.catalog._delete_old_metadata(self.io, self.metadata, response.metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a comment here explaining how METADATA_PREVIOUS_VERSIONS_MAX is taken into account?
| TableProperties.METADATA_PREVIOUS_VERSIONS_MAX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also maybe we want to wrap this in try/catch and throw a warning as to not block the commit process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a try-catch, but the deleteFiles already has a warning for actually deleting the files. I could only see an exception arise from _delete_old_metadata's operations.
iceberg-python/pyiceberg/catalog/__init__.py
Lines 254 to 269 in dd175aa
| def delete_files(io: FileIO, files_to_delete: Set[str], file_type: str) -> None: | |
| """Delete files. | |
| Log warnings if failing to delete any file. | |
| Args: | |
| io: The FileIO used to delete the object. | |
| files_to_delete: A set of file paths to be deleted. | |
| file_type: The type of the file. | |
| """ | |
| for file in files_to_delete: | |
| try: | |
| io.delete(file) | |
| except OSError as exc: | |
| logger.warning(msg=f"Failed to delete {file_type} file {file}", exc_info=exc) | |
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for following up
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this locally, and it works great. Thanks @kaushiksrini for working on this, and thanks @kevinjqliu for the review 👍
|
@Fokko thanks for the review! and @kevinjqliu thanks for making the changes + review! |
|
Thanks @kaushiksrini I applied the simple test changes via github. Thanks @Fokko for the review |
write.metadata.delete-after-commit.enabled to clean up old metadata files
write.metadata.delete-after-commit.enabled to clean up old metadata fileswrite.metadata.delete-after-commit.enabled to clean up old metadata files
Implements property
write.metadata.delete-after-commit.enabledfrom https://iceberg.apache.org/docs/1.5.1/maintenance/#remove-old-metadata-files.Closes #1199