Dev/ds backup #16300

ieQu1 · Nov 17, 2025

Fixes

Release version:

Summary

There are several challenges to taking a backup of the DS:

Data consistency
Sharding
Side effects

Data consistency

Problem: information in different durable storages can be implicitly related: for example, sessions DB stores iterators pointing at messages DB.
This observation leads to the first constraint imposed on the design: backups must include all DBs.
Partial backups are useless.

In addition, the order of taking the backup must be followed.
Let us again consider the sessions - messages DB pair:
if backup of messages DB doesn't contain data that has been already read by sessions, then the restored state is inconsistent.
There's no general solution to this problem, as data is sharded in multiple dimensions and we don't have the ability to take a consistent snapshot of all shards.
But if we make a reasonable assumption that there are no circular logical references between the DBs (i.e. messages never contain iterators of sessions), then this problem is solved by imposing a deterministic order of taking the backup, defined by the business logic level.
In our example the order is the following:

sessions, shared_subs
messages

There are two possible APIs that can do that:

Add an integer backup_priority option to the DB settings.
Create a backup API that lets the user pass the list of DBs to back up.

The procedure of taking the backup should look like this on the high level:


              shard1                               shard1
  take       /      \    DB1          take        /      \   DB2
  DB1--->-----shard2--->-snapshot---->DB2------>---shard2--->snapshot
  snapshot   \      /    ready        snapshot    \      /   ready
              shard3      \                        shard3      \
                           "---> backup ---------------.        \
                                snapshots               \        \
                                                         \        "---> backup -------.
                                                          \            snapshots       \
                                                           "---------------------------->backup ready

Sequence should be enforced when taking snapshot:
backup must not proceed to the next DB until snapshots of all shards of the current DB are taken.

Taking backups of the shard snapshots (read: copying SST files) may be done asynchronously.

Sharding

Problem: Raft backend is designed to shard data across different sites.
This helps with horizontal scalability: to handle more data one can add more sites,
but it creates a challenge for the backup,
as no single site can have the entire dataset.

This can be addressed by having two types of backups: local and remote.

Local backups rely on a distributed file system (NFS, SMB, HDFS, ...) to gather data from all sites in one place.

Remote backups involve may involve an additional step of transferring (rsync) the local backup to the remote host.
We do not consider this type of backup for now.

Side effects

Problem: sometimes it's not sufficient to simply restore the DS.
For example, durable sessions make changes in the routing table.

Restoring the backup must involve user code that performs all necessary side effects.

Restoration with sites

Problem: site IDs must be taken into consideration when restoring a backup.

Solution: TODO

PR Checklist

For internal contributor: there is a jira ticket to track this change
The changes are covered with new or existing tests
Change log for changes visible by users has been added to changes/ee/(feat|perf|fix|breaking)-<PR-id>.en.md files
Schema changes are backward compatible or intentionally breaking (describe the changes and the reasoning in the summary)

ieQu1 · Dec 3, 2025

Solution 2: create a new generation in messages DB after restoring the backup. The biggest problem is message "overwrite" that could happen if messages go missing during backup and get overwritten. But if we create a new generation, session can detect missing messages, and new messages will go to the new generation.

ieQu1 added 2 commits November 14, 2025 11:58

feat(ds): Add backup_priority parameter

420a629

WIP: add backup supervisor

600ba91

ieQu1 changed the base branch from master to release-60 November 17, 2025 11:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dev/ds backup #16300

Dev/ds backup #16300

Uh oh!

ieQu1 commented Nov 17, 2025 •

edited

Loading

Uh oh!

ieQu1 commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Search code, repositories, users, issues, pull requests...

Dev/ds backup #16300

Are you sure you want to change the base?

Dev/ds backup #16300

Uh oh!

Conversation

ieQu1 commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Data consistency

Sharding

Side effects

Restoration with sites

PR Checklist

Uh oh!

ieQu1 commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ieQu1 commented Nov 17, 2025 •

edited

Loading