Create rollback and set snapshot APIs #758

chinmay-bhat · May 21, 2024

Creates ManageSnapshots() rollback and set snapshot APIs.
Relevant issue - #737

chinmay-bhat · Jun 16, 2024

pyiceberg/table/__init__.py


+    def _commit_if_ref_updates_exist(self) -> None:
+        self.commit()
+        self._updates, self._requirements = (), ()


Similar to Java implementation.

The only issue here is that self.commit will commit the transaction if the ManageSnapshot object comes from

iceberg-python/pyiceberg/table/__init__.py

Lines 1508 to 1521 in 2252e71

def manage_snapshots(self) -> ManageSnapshots:

"""

Shorthand to run snapshot management operations like create branch, create tag, etc.

Use table.manage_snapshots().<operation>().commit() to run a specific operation.

Use table.manage_snapshots().<operation-one>().<operation-two>().commit() to run multiple operations.

Pending changes are applied on commit.

We can also use context managers to make more changes. For example,

with table.manage_snapshots() as ms:

ms.create_tag(snapshot_id1, "Tag_A").create_tag(snapshot_id2, "Tag_B")

"""

return ManageSnapshots(transaction=Transaction(self, autocommit=True))

where autocommit is set to true.

One possible way to fix this is that we can add additional parameters in transaction._apply to override the autocommit behavior and call that directly here.

updated! Now there's an extra parameter commit_transaction_now that defaults to True, and we override it to False when staged refs need to be applied without commiting the transaction.

Hi, I'm re-opening this resolved conversation, since I don't think adding the additional parameter is enough.

Say, in the future, we have more APIs like:

branch_name, min_snapshots_to_keep = "test_branch_min_snapshots_to_keep", 2 with tbl.manage_snapshots() as ms: ms.create_branch(branch_name=branch_name, snapshot_id=snapshot_id) ms.set_min_snapshots_to_keep(branch_name=branch_name, min_snapshots_to_keep=min_snapshots_to_keep)

The updates and requirements would be :
(SetSnapshotRefUpdate(action='set-snapshot-ref', ref_name='test_branch_min_snapshots_to_keep', type='branch', snapshot_id=71191752302974125, max_ref_age_ms=None, max_snapshot_age_ms=None, min_snapshots_to_keep=None), SetSnapshotRefUpdate(action='set-snapshot-ref', ref_name='test_branch_min_snapshots_to_keep', type='branch', snapshot_id=71191752302974125, max_ref_age_ms=None, max_snapshot_age_ms=None, min_snapshots_to_keep=2))
(AssertRefSnapshotId(type='assert-ref-snapshot-id', ref='test_branch_min_snapshots_to_keep', snapshot_id=None), AssertRefSnapshotId(type='assert-ref-snapshot-id', ref='test_branch_min_snapshots_to_keep', snapshot_id=71191752302974125))

The 2nd requirement will fail with a CommitFailedException as the branch would be missing.
With _commit_if_ref_updates_exist() , the transaction.table_metadata would get updated, but when the transaction exits, it will try to commit_transaction() which runs _do_commit() which runs _commit_table().

In _commit_table(), for non-REST catalogs, we _update_and_stage_table() where we check the requirements with current table metadata, here the 2nd requirement fails.

To fix this, we might consider one of the following solutions:

in transaction._apply identify the differences between current table metadata and staged metadata, and only pass those differences in self._updates, while not sending the ref updates requirements (since we've already validated them once in transaction._apply) OR

improve _update_and_stage_table() to iteratively apply the update with corresponding requirement and always check the requirements with updated_metadata. This is easier than (1), but only serves non-REST catalogs. OR

continue the original implementation, i.e. for every commit_if_ref_exists(), the Transaction commits to the table. This would be expensive IMO, but the result would remain atomic and correct, with minimal changes in the PR.

@chinmay-bhat Thank you so much for digging into this issue! I think you've made a great point. I am thinking of a similar solution like your first point: to derive a list of requirements when we commit the transaction: https://github.com/apache/iceberg/blob/d69ba0568a2e07dfb5af233350ad5668d9aef134/core/src/main/java/org/apache/iceberg/UpdateRequirements.java#L50-L58

This will save us from manually specifying requirements for every UpdateTableMetadata definition and also prevent the problems described above.

Let me research more on this and get back to you.

Hi @HonahX, should I make a new issue for this? Since changing how we specify requirements is not strictly in the scope of this PR.

Hi @chinmay-bhat. Sorry for the long wait🙏. I was distracted by other stuff and some blocking issues for 0.7.0 release. Yes, please feel free to create an issue to further discuss it. I can reply to that when I get something.

pyiceberg/table/__init__.py

HonahX

Hi @chinmay-bhat. Sorry for the long wait...(again, my bad).. Thanks for the patience and the great work! I left some comments but I think we are close.

HonahX · Jun 29, 2024

pyiceberg/table/__init__.py


+    def _commit_if_ref_updates_exist(self) -> None:
+        self.commit()
+        self._updates, self._requirements = (), ()


The only issue here is that self.commit will commit the transaction if the ManageSnapshot object comes from

iceberg-python/pyiceberg/table/__init__.py

Lines 1508 to 1521 in 2252e71

def manage_snapshots(self) -> ManageSnapshots:

"""

Shorthand to run snapshot management operations like create branch, create tag, etc.

Use table.manage_snapshots().<operation>().commit() to run a specific operation.

Use table.manage_snapshots().<operation-one>().<operation-two>().commit() to run multiple operations.

Pending changes are applied on commit.

We can also use context managers to make more changes. For example,

with table.manage_snapshots() as ms:

ms.create_tag(snapshot_id1, "Tag_A").create_tag(snapshot_id2, "Tag_B")

"""

return ManageSnapshots(transaction=Transaction(self, autocommit=True))

where autocommit is set to true.

One possible way to fix this is that we can add additional parameters in transaction._apply to override the autocommit behavior and call that directly here.

pyiceberg/table/__init__.py

We don't need to find all the ancestors, we only need to validate that the snapshot is an ancestor, i.e if it was ever current.

This reverts commit f5d489c.

we cannot use snapshot_as_of_timestamp() as it finds previously current snapshots but not necessarily an ancestor. An example is here: https://iceberg.apache.org/docs/nightly/spark-queries/?h=ancestor#history

pyiceberg/table/__init__.py

goober · Jul 4, 2025

Great contribution @chinmay-bhat what would it take to get this out of the door? Any help that you need? We would love to be able to use this feature instead of relying on spark for this at the moment

dyami-andrews-e3 · Dec 4, 2025

Thanks for all the great work on this! Similar follow up here @chinmay-bhat @kevinjqliu, would love to see this prioritized so that we can avoid rolling out spark in my org.

HonahX mentioned this pull request May 22, 2024

Support Snapshot Management Operations #737

Open

chinmay-bhat force-pushed the rollback_set_current_snapshot_op branch from c60938f to 1af604a Compare June 16, 2024 04:09

chinmay-bhat marked this pull request as ready for review June 16, 2024 04:11

chinmay-bhat commented Jun 16, 2024

View reviewed changes

pyiceberg/table/__init__.py Show resolved Hide resolved

chinmay-bhat mentioned this pull request Jun 18, 2024

Support Remove Branch or Tag APIs #822

Closed

HonahX self-requested a review June 29, 2024 23:19

HonahX reviewed Jul 2, 2024

View reviewed changes

chinmay-bhat added 10 commits July 2, 2024 22:53

support rollback and set current snapshot operations

f22d46d

add tests

45c25db

use tbl.history() instead of ancestors_of()

ca63831

We don't need to find all the ancestors, we only need to validate that the snapshot is an ancestor, i.e if it was ever current.

improve docstrings

f7e192a

Revert "use tbl.history() instead of ancestors_of()"

6859fa4

This reverts commit f5d489c.

find ancestor before timestamp

ea0e645

we cannot use snapshot_as_of_timestamp() as it finds previously current snapshots but not necessarily an ancestor. An example is here: https://iceberg.apache.org/docs/nightly/spark-queries/?h=ancestor#history

update tests

dc4028b

small fix

7fba98b

fix test error

1f4a404

fixes based on review

59f1626

chinmay-bhat force-pushed the rollback_set_current_snapshot_op branch from d7cee84 to 59f1626 Compare July 2, 2024 18:05

chinmay-bhat added 2 commits July 3, 2024 00:31

add parameter to control when transaction is committed

7c7907b

move _set_ref_snapshot

e563b7e

HonahX reviewed Jul 4, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

chinmay-bhat added 3 commits July 4, 2024 12:15

changes after review

386496f

move test and use constants

5adccb9

fix docstring and returns

8885f78

chinmay-bhat mentioned this pull request Jul 23, 2024

Remove initial_change when dealing with table updates #950

Closed

kevinjqliu added this to the PyIceberg 0.9.0 release milestone Oct 30, 2024

kevinjqliu removed this from the PyIceberg 0.9.0 release milestone Feb 1, 2025

	def manage_snapshots(self) -> ManageSnapshots:
	"""
	Shorthand to run snapshot management operations like create branch, create tag, etc.

	Use table.manage_snapshots().<operation>().commit() to run a specific operation.
	Use table.manage_snapshots().<operation-one>().<operation-two>().commit() to run multiple operations.
	Pending changes are applied on commit.

	We can also use context managers to make more changes. For example,

	with table.manage_snapshots() as ms:
	ms.create_tag(snapshot_id1, "Tag_A").create_tag(snapshot_id2, "Tag_B")
	"""
	return ManageSnapshots(transaction=Transaction(self, autocommit=True))

Search code, repositories, users, issues, pull requests...

Create rollback and set snapshot APIs #758

Are you sure you want to change the base?

Create rollback and set snapshot APIs #758

Uh oh!

Conversation

chinmay-bhat commented May 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chinmay-bhat Jun 16, 2024

Choose a reason for hiding this comment

Uh oh!

HonahX Jun 29, 2024

Choose a reason for hiding this comment

Uh oh!

chinmay-bhat Jul 2, 2024

Choose a reason for hiding this comment

Uh oh!

chinmay-bhat Jul 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HonahX Jul 11, 2024

Choose a reason for hiding this comment

Uh oh!

chinmay-bhat Jul 23, 2024

Choose a reason for hiding this comment

Uh oh!

HonahX Jul 24, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HonahX left a comment

Choose a reason for hiding this comment

Uh oh!

HonahX Jun 29, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goober commented Jul 4, 2025

Uh oh!

dyami-andrews-e3 commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chinmay-bhat commented May 21, 2024 •

edited

Loading

chinmay-bhat Jul 7, 2024 •

edited

Loading