Skip to content

Conversation

@clobrano
Copy link
Collaborator

@clobrano clobrano commented Nov 4, 2025

Replace basic data directory backup with proper etcd database snapshot functionality. The new implementation:

  • Creates timestamped snapshot files instead of moving the entire data directory
  • Stores backups in a non-volatile location (backup_location parameter) instead of the previous volatile HA_RSCTMP directory
  • Validates backup file existence and size after creation
  • Implements configurable retention policy via max_backup_snapshots parameter
  • Automatically cleans up old snapshots to control storage usage

Default retention is set to 3 snapshots, with backups stored in /var/lib/etcd by default. This provides better backup reliability, persistence across reboots, and storage management for etcd databases.

Replace basic data directory backup with proper etcd database snapshot
functionality. The new implementation:
- Creates timestamped snapshot files instead of moving the entire data directory
- Stores backups in a non-volatile location (backup_location parameter) instead
  of the previous volatile HA_RSCTMP directory
- Validates backup file existence and size after creation
- Implements configurable retention policy via max_backup_snapshots parameter
- Automatically cleans up old snapshots to control storage usage

Default retention is set to 3 snapshots, with backups stored in /var/lib/etcd
by default. This provides better backup reliability, persistence across reboots,
and storage management for etcd databases.
@knet-jenkins
Copy link

knet-jenkins bot commented Nov 4, 2025

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2095/1/input

@jaypoulz
Copy link

jaypoulz commented Nov 10, 2025

How does these interface with the existing etcd backup / restore docs?
https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/etcd/backing-up-and-restoring-etcd-data#backing-up-etcd

I'm not worried about us competing with the existing automated backup instructions (although it would be good to verify that we don't completely break the existing instructions).

What I'm more interested is - are the steps to the restore the same? If not, could we achieve parity in how we restore so that our instructions are (at least mostly) resuable?

The restoring instructions are here: https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html-single/backup_and_restore/index#dr-restoring-cluster-state

Copy link

@jaypoulz jaypoulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I'm wondering if we need to also save the static kuberesources in order to be compatible with the cluster-restore script.
https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html-single/backup_and_restore/index#dr-scenario-2-restoring-cluster-state_dr-restoring-cluster-state

Copy link

@jaypoulz jaypoulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably shouldn't be the main approver. We should get Allen or Thomas to sanity check this one.

@clobrano clobrano marked this pull request as ready for review November 11, 2025 09:18

# Validate max_snapshots is a positive integer
if ! echo "$max_snapshots" | grep -q '^[1-9][0-9]*$'; then
ocf_log warn "invalid max_backup_snapshots value: $max_snapshots, skipping cleanup"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to update the error message to give more info about the regex valitdation. Something like "invalid max_backup_snapshots value: $max_snapshots: it should be a number between 0 and 99. Skipping cleanup"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I wonder if we want to allow 0 snapshots, to disable the feature 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it's not the default and we document it, I think that's a good option to give, and easy to add at this point :)

Copy link
Collaborator Author

@clobrano clobrano Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we must take into account that "configurability" now means we need to change CEO to configure the agent differently 😁
In the future, there might be some real configurability and this option will be useful

dest_dir_name="members-snapshot-$(date +%Y%M%d%H%M%S)"
if [ ! -d $data_dir ]; then
ocf_log info "no data dir to backup"
# Ensure backup directory exists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make sure there is enough disk space before committing to the backup? I worry about users setting a high rotation threshold and filling the disk

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!
I wonder how to estimate the threshold. From one side, I can use the other backups (if any, of course) + some guard, but if 1 single new backup can fill up the disk, it is probably too late already 😁. I wonder if we want to use a percentage of the disk available instead 🤔

Copy link
Contributor

@fonta-rh fonta-rh Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn. The simple "will we fill the disk" check is very straightforward, but as you say, might be useless and doesn't give any options to the user.

Adding a config parameter for the threshold would be the more complete option, but it's a lot more code. Just the check function might be something like this (Claude dixit), and you still need to call it, document it, etcd

# Check if filesystem has sufficient space for backup based on usage threshold
#
# Args:
#   $1 - Directory path to check
#
# Returns:
#   0 - Sufficient space available
#   1 - Insufficient space (usage above threshold)
check_backup_disk_space()
{
	local backup_dir="$1"
	local threshold="$OCF_RESKEY_backup_disk_usage_threshold"
	local usage_percent
	local fs_path

	# Validate threshold is a number between 0-100
	if ! echo "$threshold" | grep -q '^[0-9]\{1,3\}$' || [ "$threshold" -gt 100 ]; then
		ocf_log warn "invalid backup_disk_usage_threshold value: $threshold (must be 0-100), using default"
		threshold="$OCF_RESKEY_backup_disk_usage_threshold_default"
	fi

	# Get filesystem usage percentage (without % sign)
	# df output: Filesystem Size Used Avail Use% Mounted
	usage_percent=$(df -P "$backup_dir" 2>/dev/null | awk 'NR==2 {sub(/%/, "", $5); print $5}')

	if [ -z "$usage_percent" ]; then
		ocf_log warn "could not determine disk usage for $backup_dir, skipping space check"
		return 0
	fi

	if [ "$usage_percent" -ge "$threshold" ]; then
		ocf_log warn "backup filesystem usage ${usage_percent}% exceeds threshold ${threshold}%, skipping backup"
		return 1
	fi

	ocf_log debug "backup filesystem usage ${usage_percent}% is below threshold ${threshold}%"
	return 0
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can use the other backups (if any, of course) + some guard

of course I was hallucinating here... we are backing up 1 specific file, I know what size it has 🤦

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simple "will we fill the disk" check is very straightforward, but as you say, might be useless and doesn't give any options to the user.

Considering that this is going to go on a kubernetes node that has better ways to detect disk pressure, I don't think we need to complicate this change further. What do you think?

Copy link
Contributor

@fonta-rh fonta-rh Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, what about just making we're not the ones filling the disk with something like:

# Check available disk space before creating backup
local available_kb
available_kb=$(df -P "$backup_dir" | awk 'NR==2 {print $4}')
local db_size_kb
db_size_kb=$(($(stat -c %s "$etcd_db_path") / 1024))

if [ "$available_kb" -lt "$((db_size_kb * 2))" ]; then
    ocf_log warn "insufficient disk space in $backup_dir (available: ${available_kb}KB, needed: ~$((db_size_kb * 2))KB), skipping backup"
    return $OCF_SUCCESS
fi

and it's out of our hands!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came up with a similar code 😁 🤖

@clobrano
Copy link
Collaborator Author

clobrano commented Nov 11, 2025

are the steps to the restore the same?

TL;DR: yes 😁

Longer answer:
To avoid misunderstandings, this podman-etcd backup feature exists not to take regular backups, but to prevent total data loss before we wipe the current etcd data, which is necessary for the learner to join the existing cluster.

Our documentation suggests using the cluster-backup.sh, which backs up live etcd data using etcdctl snapshot save utility (https://etcd.io/docs/v3.6/tasks/operator/how-to-save-database/). However, podman-etcd saves the database before etcd even starts on the node. In this situation, as there is no etcd server running, podman-etcd uses the alternative approach of "copying the member/snap/db file from an etcd data directory" (see https://etcd.io/docs/v3.5/op-guide/recovery/, section "Snapshotting the keyspace").

The copy might indeed be broken. It depends on how the agent stopped etcd in the previous cycle (gracefully or not), but it is kind of a safety net.

EDIT:
What I want to say is that this data is not intended to be restored, but it can be restored if something very bad happens on the node that is forcing the new cluster imposing its own data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants