Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

@ieQu1
Copy link
Member

@ieQu1 ieQu1 commented Nov 17, 2025

Fixes

Release version:

Summary

There are several challenges to taking a backup of the DS:

  1. Data consistency
  2. Sharding
  3. Side effects

Data consistency

Problem: information in different durable storages can be implicitly related: for example, sessions DB stores iterators pointing at messages DB.
This observation leads to the first constraint imposed on the design: backups must include all DBs.
Partial backups are useless.

In addition, the order of taking the backup must be followed.
Let us again consider the sessions - messages DB pair:
if backup of messages DB doesn't contain data that has been already read by sessions, then the restored state is inconsistent.
There's no general solution to this problem, as data is sharded in multiple dimensions and we don't have the ability to take a consistent snapshot of all shards.
But if we make a reasonable assumption that there are no circular logical references between the DBs (i.e. messages never contain iterators of sessions), then this problem is solved by imposing a deterministic order of taking the backup, defined by the business logic level.
In our example the order is the following:

  1. sessions, shared_subs
  2. messages

There are two possible APIs that can do that:

  1. Add an integer backup_priority option to the DB settings.
  2. Create a backup API that lets the user pass the list of DBs to back up.

The procedure of taking the backup should look like this on the high level:


              shard1                               shard1
  take       /      \    DB1          take        /      \   DB2
  DB1--->-----shard2--->-snapshot---->DB2------>---shard2--->snapshot
  snapshot   \      /    ready        snapshot    \      /   ready
              shard3      \                        shard3      \
                           "---> backup ---------------.        \
                                snapshots               \        \
                                                         \        "---> backup -------.
                                                          \            snapshots       \
                                                           "---------------------------->backup ready

Sequence should be enforced when taking snapshot:
backup must not proceed to the next DB until snapshots of all shards of the current DB are taken.

Taking backups of the shard snapshots (read: copying SST files) may be done asynchronously.

Sharding

Problem: Raft backend is designed to shard data across different sites.
This helps with horizontal scalability: to handle more data one can add more sites,
but it creates a challenge for the backup,
as no single site can have the entire dataset.

This can be addressed by having two types of backups: local and remote.

Local backups rely on a distributed file system (NFS, SMB, HDFS, ...) to gather data from all sites in one place.

Remote backups involve may involve an additional step of transferring (rsync) the local backup to the remote host.
We do not consider this type of backup for now.

Side effects

Problem: sometimes it's not sufficient to simply restore the DS.
For example, durable sessions make changes in the routing table.

Restoring the backup must involve user code that performs all necessary side effects.

Restoration with sites

Problem: site IDs must be taken into consideration when restoring a backup.

Solution: TODO

PR Checklist

  • For internal contributor: there is a jira ticket to track this change
  • The changes are covered with new or existing tests
  • Change log for changes visible by users has been added to changes/ee/(feat|perf|fix|breaking)-<PR-id>.en.md files
  • Schema changes are backward compatible or intentionally breaking (describe the changes and the reasoning in the summary)

@ieQu1 ieQu1 changed the base branch from master to release-60 November 17, 2025 11:57
@ieQu1
Copy link
Member Author

ieQu1 commented Dec 3, 2025

Solution 2: create a new generation in messages DB after restoring the backup. The biggest problem is message "overwrite" that could happen if messages go missing during backup and get overwritten. But if we create a new generation, session can detect missing messages, and new messages will go to the new generation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Morty Proxy This is a proxified and sanitized view of the page, visit original site.