Skip to content

Conversation

Xuanwo
Copy link
Member

@Xuanwo Xuanwo commented Feb 21, 2025

Signed-off-by: Xuanwo [email protected]

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This PR will add bendsave: the DR tool for databend that can backup and restore databend data.

The RFC could be seen at: https://docs.databend.com/guides/community/rfcs/disaster-recovery

This PR tested in this way:

  • Build a new databend query node with tpch data.
  • Perform tpch sqllogictests.
  • Run backup.
  • Destory all data entirely, remove the metasrv's data.
  • Run restore.
  • Perform tpch sqllogictests again.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Feb 21, 2025
@Xuanwo Xuanwo marked this pull request as draft February 21, 2025 10:47
Signed-off-by: Xuanwo <[email protected]>
@Xuanwo Xuanwo marked this pull request as ready for review March 12, 2025 10:20
@drmingdrmer
Copy link
Member

  • Build a new databend query node with tpch data.
  • Perform tpch sqllogictests.
  • Run backup.
  • Distory all data entirely, remove the metasrv's data.
  • Run restore.
  • Perform tpch sqllogictests again.

Before conducting a thorough review, I have a question:
Is the second pass of SQL logic tests identical to the first pass?
Does it focus on reading the restored data or executing a new test on a recovered metadata/data set?

@Xuanwo
Copy link
Member Author

Xuanwo commented Mar 12, 2025

Is the second pass of SQL logic tests identical to the first pass?

Yes, exactly the same.

Does it focus on reading the restored data or executing a new test on a recovered metadata/data set?

I'm trying to run a new test on a recovered metadata/data set to ensure that all the data has been restored correctly.

I chose the TPC-H test for this since we are only executing queries and not inserting new data or performing DML operations, which do not need to be tested again.

Xuanwo added 2 commits March 13, 2025 20:32
Copy link
Member

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 24 of 24 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @dantengsky, @everpcpc, and @sundy-li)

@drmingdrmer
Copy link
Member

An excellent PR! Clean and concise!

@Xuanwo
Copy link
Member Author

Xuanwo commented Mar 13, 2025

An excellent PR! Clean and concise!

70% of me thanks you, and the remaining 30% (most on docs) comes from Claude 3.5 sonnet.

@dantengsky
Copy link
Member

Looks good to me.

As we continue improving this feature, here are a few considerations:

Given that the table data could grow quite large, we should probably give some extra thought to the efficiency of backup and restore operations. If I understand correctly, the current approach involves periodically backing up data from A to B, then restoring it to C, with B potentially storing multiple versions.

Even if we avoid redundant data objects across those versions at B, this could still lead to significant additional storage costs. The efficiency of restoring from B to C might also negatively impact the RTO.

Additionally, since the backed-up data already includes time travel data, this might raise questions about the necessity of multi-version backups. The vacuum operation on backups could introduce complexity in both implementation and maintenance.

Xuanwo added 15 commits March 18, 2025 16:08
The commit switches from using epochfs to a new direct storage-to-storage
backup implementation for bendsave, removing the checkpoint-based approach.

The main changes:
- Remove epochfs dependency and checkpoint-based backup/restore
- Add direct storage copy util function
- Update command line interface to remove checkpoint flag
- Rename storage functions for clarity
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
@Xuanwo
Copy link
Member Author

Xuanwo commented Mar 22, 2025

All CI passed. I'm going to merge this PR now.

@Xuanwo Xuanwo merged commit b83cb6e into databendlabs:main Mar 22, 2025
77 of 78 checks passed
@Xuanwo Xuanwo deleted the bendsave branch March 22, 2025 09:25
loloxwg pushed a commit to loloxwg/databend that referenced this pull request Apr 3, 2025
…ndlabs#17503)

* squash commits

Signed-off-by: Xuanwo <[email protected]>

* Fix force load

Signed-off-by: Xuanwo <[email protected]>

* backup works!

Signed-off-by: Xuanwo <[email protected]>

* Fully test

Signed-off-by: Xuanwo <[email protected]>

* Fix test

Signed-off-by: Xuanwo <[email protected]>

* Fix actions

Signed-off-by: Xuanwo <[email protected]>

* Try fix ci

Signed-off-by: Xuanwo <[email protected]>

* Fix insert in epochfs

Signed-off-by: Xuanwo <[email protected]>

* allow more time

Signed-off-by: Xuanwo <[email protected]>

* Add readme

Signed-off-by: Xuanwo <[email protected]>

* fix typo

Signed-off-by: Xuanwo <[email protected]>

* Update cargo.lock

Signed-off-by: Xuanwo <[email protected]>

* remove unneeded changes

Signed-off-by: Xuanwo <[email protected]>

* Add license check to backup tool bendsave

* Replace epochfs with new backup implementation

The commit switches from using epochfs to a new direct storage-to-storage
backup implementation for bendsave, removing the checkpoint-based approach.

The main changes:
- Remove epochfs dependency and checkpoint-based backup/restore
- Add direct storage copy util function
- Update command line interface to remove checkpoint flag
- Rename storage functions for clarity

* Remove checkpoint flag from bendsave restore

* Fix typo

Signed-off-by: Xuanwo <[email protected]>

* Avoid check license while restore

Signed-off-by: Xuanwo <[email protected]>

* Init query first

Signed-off-by: Xuanwo <[email protected]>

* Fix build

Signed-off-by: Xuanwo <[email protected]>

* Fix init query

Signed-off-by: Xuanwo <[email protected]>

* Fix clippy

Signed-off-by: Xuanwo <[email protected]>

---------

Signed-off-by: Xuanwo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants