Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes
Release version:
Summary
There are several challenges to taking a backup of the DS:
Data consistency
Problem: information in different durable storages can be implicitly related: for example,
sessionsDB stores iterators pointing atmessagesDB.This observation leads to the first constraint imposed on the design: backups must include all DBs.
Partial backups are useless.
In addition, the order of taking the backup must be followed.
Let us again consider the
sessions-messagesDB pair:if backup of
messagesDB doesn't contain data that has been already read bysessions, then the restored state is inconsistent.There's no general solution to this problem, as data is sharded in multiple dimensions and we don't have the ability to take a consistent snapshot of all shards.
But if we make a reasonable assumption that there are no circular logical references between the DBs (i.e.
messagesnever contain iterators ofsessions), then this problem is solved by imposing a deterministic order of taking the backup, defined by the business logic level.In our example the order is the following:
sessions,shared_subsmessagesThere are two possible APIs that can do that:
backup_priorityoption to the DB settings.The procedure of taking the backup should look like this on the high level:
Sequence should be enforced when taking snapshot:
backup must not proceed to the next DB until snapshots of all shards of the current DB are taken.
Taking backups of the shard snapshots (read: copying SST files) may be done asynchronously.
Sharding
Problem: Raft backend is designed to shard data across different sites.
This helps with horizontal scalability: to handle more data one can add more sites,
but it creates a challenge for the backup,
as no single site can have the entire dataset.
This can be addressed by having two types of backups: local and remote.
Local backups rely on a distributed file system (NFS, SMB, HDFS, ...) to gather data from all sites in one place.
Remote backups involve may involve an additional step of transferring (
rsync) the local backup to the remote host.We do not consider this type of backup for now.
Side effects
Problem: sometimes it's not sufficient to simply restore the DS.
For example, durable sessions make changes in the routing table.
Restoring the backup must involve user code that performs all necessary side effects.
Restoration with sites
Problem: site IDs must be taken into consideration when restoring a backup.
Solution: TODO
PR Checklist
changes/ee/(feat|perf|fix|breaking)-<PR-id>.en.mdfiles