Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

@chinmay-bhat
Copy link
Contributor

@chinmay-bhat chinmay-bhat commented May 21, 2024

Creates ManageSnapshots() rollback and set snapshot APIs.
Relevant issue - #737

@chinmay-bhat chinmay-bhat force-pushed the rollback_set_current_snapshot_op branch from c60938f to 1af604a Compare June 16, 2024 04:09
@chinmay-bhat chinmay-bhat marked this pull request as ready for review June 16, 2024 04:11

def _commit_if_ref_updates_exist(self) -> None:
self.commit()
self._updates, self._requirements = (), ()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to Java implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only issue here is that self.commit will commit the transaction if the ManageSnapshot object comes from

def manage_snapshots(self) -> ManageSnapshots:
"""
Shorthand to run snapshot management operations like create branch, create tag, etc.
Use table.manage_snapshots().<operation>().commit() to run a specific operation.
Use table.manage_snapshots().<operation-one>().<operation-two>().commit() to run multiple operations.
Pending changes are applied on commit.
We can also use context managers to make more changes. For example,
with table.manage_snapshots() as ms:
ms.create_tag(snapshot_id1, "Tag_A").create_tag(snapshot_id2, "Tag_B")
"""
return ManageSnapshots(transaction=Transaction(self, autocommit=True))

where autocommit is set to true.

One possible way to fix this is that we can add additional parameters in transaction._apply to override the autocommit behavior and call that directly here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated! Now there's an extra parameter commit_transaction_now that defaults to True, and we override it to False when staged refs need to be applied without commiting the transaction.

Copy link
Contributor Author

@chinmay-bhat chinmay-bhat Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I'm re-opening this resolved conversation, since I don't think adding the additional parameter is enough.

Say, in the future, we have more APIs like:

branch_name, min_snapshots_to_keep = "test_branch_min_snapshots_to_keep", 2
with tbl.manage_snapshots() as ms:
        ms.create_branch(branch_name=branch_name, snapshot_id=snapshot_id)
        ms.set_min_snapshots_to_keep(branch_name=branch_name, min_snapshots_to_keep=min_snapshots_to_keep)

The updates and requirements would be :
(SetSnapshotRefUpdate(action='set-snapshot-ref', ref_name='test_branch_min_snapshots_to_keep', type='branch', snapshot_id=71191752302974125, max_ref_age_ms=None, max_snapshot_age_ms=None, min_snapshots_to_keep=None), SetSnapshotRefUpdate(action='set-snapshot-ref', ref_name='test_branch_min_snapshots_to_keep', type='branch', snapshot_id=71191752302974125, max_ref_age_ms=None, max_snapshot_age_ms=None, min_snapshots_to_keep=2))

(AssertRefSnapshotId(type='assert-ref-snapshot-id', ref='test_branch_min_snapshots_to_keep', snapshot_id=None), AssertRefSnapshotId(type='assert-ref-snapshot-id', ref='test_branch_min_snapshots_to_keep', snapshot_id=71191752302974125))

The 2nd requirement will fail with a CommitFailedException as the branch would be missing.
With _commit_if_ref_updates_exist() , the transaction.table_metadata would get updated, but when the transaction exits, it will try to commit_transaction() which runs _do_commit() which runs _commit_table().

In _commit_table(), for non-REST catalogs, we _update_and_stage_table() where we check the requirements with current table metadata, here the 2nd requirement fails.

To fix this, we might consider one of the following solutions:

  1. in transaction._apply identify the differences between current table metadata and staged metadata, and only pass those differences in self._updates, while not sending the ref updates requirements (since we've already validated them once in transaction._apply) OR
  2. improve _update_and_stage_table() to iteratively apply the update with corresponding requirement and always check the requirements with updated_metadata. This is easier than (1), but only serves non-REST catalogs. OR
  3. continue the original implementation, i.e. for every commit_if_ref_exists(), the Transaction commits to the table. This would be expensive IMO, but the result would remain atomic and correct, with minimal changes in the PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chinmay-bhat Thank you so much for digging into this issue! I think you've made a great point. I am thinking of a similar solution like your first point: to derive a list of requirements when we commit the transaction: https://github.com/apache/iceberg/blob/d69ba0568a2e07dfb5af233350ad5668d9aef134/core/src/main/java/org/apache/iceberg/UpdateRequirements.java#L50-L58

This will save us from manually specifying requirements for every UpdateTableMetadata definition and also prevent the problems described above.

Let me research more on this and get back to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HonahX, should I make a new issue for this? Since changing how we specify requirements is not strictly in the scope of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chinmay-bhat. Sorry for the long wait🙏. I was distracted by other stuff and some blocking issues for 0.7.0 release. Yes, please feel free to create an issue to further discuss it. I can reply to that when I get something.

pyiceberg/table/__init__.py Show resolved Hide resolved
@HonahX HonahX self-requested a review June 29, 2024 23:19
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chinmay-bhat. Sorry for the long wait...(again, my bad).. Thanks for the patience and the great work! I left some comments but I think we are close.


def _commit_if_ref_updates_exist(self) -> None:
self.commit()
self._updates, self._requirements = (), ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only issue here is that self.commit will commit the transaction if the ManageSnapshot object comes from

def manage_snapshots(self) -> ManageSnapshots:
"""
Shorthand to run snapshot management operations like create branch, create tag, etc.
Use table.manage_snapshots().<operation>().commit() to run a specific operation.
Use table.manage_snapshots().<operation-one>().<operation-two>().commit() to run multiple operations.
Pending changes are applied on commit.
We can also use context managers to make more changes. For example,
with table.manage_snapshots() as ms:
ms.create_tag(snapshot_id1, "Tag_A").create_tag(snapshot_id2, "Tag_B")
"""
return ManageSnapshots(transaction=Transaction(self, autocommit=True))

where autocommit is set to true.

One possible way to fix this is that we can add additional parameters in transaction._apply to override the autocommit behavior and call that directly here.

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
We don't need to find all the ancestors, we only need to validate that the snapshot is an ancestor, i.e if it was ever current.
we cannot use snapshot_as_of_timestamp() as it finds previously current snapshots but not necessarily an ancestor.
An example is here: https://iceberg.apache.org/docs/nightly/spark-queries/?h=ancestor#history
@chinmay-bhat chinmay-bhat force-pushed the rollback_set_current_snapshot_op branch from d7cee84 to 59f1626 Compare July 2, 2024 18:05
pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
@goober
Copy link

goober commented Jul 4, 2025

Great contribution @chinmay-bhat what would it take to get this out of the door? Any help that you need? We would love to be able to use this feature instead of relying on spark for this at the moment

@dyami-andrews-e3
Copy link

Thanks for all the great work on this! Similar follow up here @chinmay-bhat @kevinjqliu, would love to see this prioritized so that we can avoid rolling out spark in my org.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.