-
Notifications
You must be signed in to change notification settings - Fork 603
OCPBUGS-60588: podman-etcd: enhance etcd data backup with snapshots and retention #2095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Replace basic data directory backup with proper etcd database snapshot functionality. The new implementation: - Creates timestamped snapshot files instead of moving the entire data directory - Stores backups in a non-volatile location (backup_location parameter) instead of the previous volatile HA_RSCTMP directory - Validates backup file existence and size after creation - Implements configurable retention policy via max_backup_snapshots parameter - Automatically cleans up old snapshots to control storage usage Default retention is set to 3 snapshots, with backups stored in /var/lib/etcd by default. This provides better backup reliability, persistence across reboots, and storage management for etcd databases.
|
Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2095/1/input |
|
How does these interface with the existing etcd backup / restore docs? I'm not worried about us competing with the existing automated backup instructions (although it would be good to verify that we don't completely break the existing instructions). What I'm more interested is - are the steps to the restore the same? If not, could we achieve parity in how we restore so that our instructions are (at least mostly) resuable? The restoring instructions are here: https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html-single/backup_and_restore/index#dr-restoring-cluster-state |
jaypoulz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but I'm wondering if we need to also save the static kuberesources in order to be compatible with the cluster-restore script.
https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html-single/backup_and_restore/index#dr-scenario-2-restoring-cluster-state_dr-restoring-cluster-state
jaypoulz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably shouldn't be the main approver. We should get Allen or Thomas to sanity check this one.
|
|
||
| # Validate max_snapshots is a positive integer | ||
| if ! echo "$max_snapshots" | grep -q '^[1-9][0-9]*$'; then | ||
| ocf_log warn "invalid max_backup_snapshots value: $max_snapshots, skipping cleanup" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to update the error message to give more info about the regex valitdation. Something like "invalid max_backup_snapshots value: $max_snapshots: it should be a number between 0 and 99. Skipping cleanup"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I wonder if we want to allow 0 snapshots, to disable the feature 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as it's not the default and we document it, I think that's a good option to give, and easy to add at this point :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we must take into account that "configurability" now means we need to change CEO to configure the agent differently 😁
In the future, there might be some real configurability and this option will be useful
| dest_dir_name="members-snapshot-$(date +%Y%M%d%H%M%S)" | ||
| if [ ! -d $data_dir ]; then | ||
| ocf_log info "no data dir to backup" | ||
| # Ensure backup directory exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make sure there is enough disk space before committing to the backup? I worry about users setting a high rotation threshold and filling the disk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
I wonder how to estimate the threshold. From one side, I can use the other backups (if any, of course) + some guard, but if 1 single new backup can fill up the disk, it is probably too late already 😁. I wonder if we want to use a percentage of the disk available instead 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm torn. The simple "will we fill the disk" check is very straightforward, but as you say, might be useless and doesn't give any options to the user.
Adding a config parameter for the threshold would be the more complete option, but it's a lot more code. Just the check function might be something like this (Claude dixit), and you still need to call it, document it, etcd
# Check if filesystem has sufficient space for backup based on usage threshold
#
# Args:
# $1 - Directory path to check
#
# Returns:
# 0 - Sufficient space available
# 1 - Insufficient space (usage above threshold)
check_backup_disk_space()
{
local backup_dir="$1"
local threshold="$OCF_RESKEY_backup_disk_usage_threshold"
local usage_percent
local fs_path
# Validate threshold is a number between 0-100
if ! echo "$threshold" | grep -q '^[0-9]\{1,3\}$' || [ "$threshold" -gt 100 ]; then
ocf_log warn "invalid backup_disk_usage_threshold value: $threshold (must be 0-100), using default"
threshold="$OCF_RESKEY_backup_disk_usage_threshold_default"
fi
# Get filesystem usage percentage (without % sign)
# df output: Filesystem Size Used Avail Use% Mounted
usage_percent=$(df -P "$backup_dir" 2>/dev/null | awk 'NR==2 {sub(/%/, "", $5); print $5}')
if [ -z "$usage_percent" ]; then
ocf_log warn "could not determine disk usage for $backup_dir, skipping space check"
return 0
fi
if [ "$usage_percent" -ge "$threshold" ]; then
ocf_log warn "backup filesystem usage ${usage_percent}% exceeds threshold ${threshold}%, skipping backup"
return 1
fi
ocf_log debug "backup filesystem usage ${usage_percent}% is below threshold ${threshold}%"
return 0
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can use the other backups (if any, of course) + some guard
of course I was hallucinating here... we are backing up 1 specific file, I know what size it has 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simple "will we fill the disk" check is very straightforward, but as you say, might be useless and doesn't give any options to the user.
Considering that this is going to go on a kubernetes node that has better ways to detect disk pressure, I don't think we need to complicate this change further. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, what about just making we're not the ones filling the disk with something like:
# Check available disk space before creating backup
local available_kb
available_kb=$(df -P "$backup_dir" | awk 'NR==2 {print $4}')
local db_size_kb
db_size_kb=$(($(stat -c %s "$etcd_db_path") / 1024))
if [ "$available_kb" -lt "$((db_size_kb * 2))" ]; then
ocf_log warn "insufficient disk space in $backup_dir (available: ${available_kb}KB, needed: ~$((db_size_kb * 2))KB), skipping backup"
return $OCF_SUCCESS
fiand it's out of our hands!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came up with a similar code 😁 🤖
TL;DR: yes 😁 Longer answer: Our documentation suggests using the The copy might indeed be broken. It depends on how the agent stopped etcd in the previous cycle (gracefully or not), but it is kind of a safety net. EDIT: |
Replace basic data directory backup with proper etcd database snapshot functionality. The new implementation:
Default retention is set to 3 snapshots, with backups stored in /var/lib/etcd by default. This provides better backup reliability, persistence across reboots, and storage management for etcd databases.