OCPBUGS-60588: podman-etcd: enhance etcd data backup with snapshots and retention #2095

clobrano · 2025-11-04T14:47:07Z

Replace basic data directory backup with proper etcd database snapshot functionality. The new implementation:

Creates timestamped snapshot files instead of moving the entire data directory
Stores backups in a non-volatile location (backup_location parameter) instead of the previous volatile HA_RSCTMP directory
Validates backup file existence and size after creation
Implements configurable retention policy via max_backup_snapshots parameter
Automatically cleans up old snapshots to control storage usage

Default retention is set to 3 snapshots, with backups stored in /var/lib/etcd by default. This provides better backup reliability, persistence across reboots, and storage management for etcd databases.

Replace basic data directory backup with proper etcd database snapshot functionality. The new implementation: - Creates timestamped snapshot files instead of moving the entire data directory - Stores backups in a non-volatile location (backup_location parameter) instead of the previous volatile HA_RSCTMP directory - Validates backup file existence and size after creation - Implements configurable retention policy via max_backup_snapshots parameter - Automatically cleans up old snapshots to control storage usage Default retention is set to 3 snapshots, with backups stored in /var/lib/etcd by default. This provides better backup reliability, persistence across reboots, and storage management for etcd databases.

knet-jenkins · 2025-11-04T14:47:53Z

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2095/1/input

jaypoulz · 2025-11-10T20:45:08Z

How does these interface with the existing etcd backup / restore docs?
https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/etcd/backing-up-and-restoring-etcd-data#backing-up-etcd

I'm not worried about us competing with the existing automated backup instructions (although it would be good to verify that we don't completely break the existing instructions).

What I'm more interested is - are the steps to the restore the same? If not, could we achieve parity in how we restore so that our instructions are (at least mostly) resuable?

The restoring instructions are here: https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html-single/backup_and_restore/index#dr-restoring-cluster-state

jaypoulz

Looks good to me, but I'm wondering if we need to also save the static kuberesources in order to be compatible with the cluster-restore script.
https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html-single/backup_and_restore/index#dr-scenario-2-restoring-cluster-state_dr-restoring-cluster-state

jaypoulz

I probably shouldn't be the main approver. We should get Allen or Thomas to sanity check this one.

fonta-rh · 2025-11-11T09:38:19Z

heartbeat/podman-etcd

+
+	# Validate max_snapshots is a positive integer
+	if ! echo "$max_snapshots" | grep -q '^[1-9][0-9]*$'; then
+		ocf_log warn "invalid max_backup_snapshots value: $max_snapshots, skipping cleanup"


We might want to update the error message to give more info about the regex valitdation. Something like "invalid max_backup_snapshots value: $max_snapshots: it should be a number between 0 and 99. Skipping cleanup"

Actually, I wonder if we want to allow 0 snapshots, to disable the feature 🤔

As long as it's not the default and we document it, I think that's a good option to give, and easy to add at this point :)

yeah, we must take into account that "configurability" now means we need to change CEO to configure the agent differently 😁
In the future, there might be some real configurability and this option will be useful

fonta-rh · 2025-11-11T09:39:56Z

heartbeat/podman-etcd

-	dest_dir_name="members-snapshot-$(date +%Y%M%d%H%M%S)"
-	if [ ! -d $data_dir ]; then
-		ocf_log info "no data dir to backup"
+	# Ensure backup directory exists


Do we want to make sure there is enough disk space before committing to the backup? I worry about users setting a high rotation threshold and filling the disk

Good point!
I wonder how to estimate the threshold. From one side, I can use the other backups (if any, of course) + some guard, but if 1 single new backup can fill up the disk, it is probably too late already 😁. I wonder if we want to use a percentage of the disk available instead 🤔

I'm torn. The simple "will we fill the disk" check is very straightforward, but as you say, might be useless and doesn't give any options to the user.

Adding a config parameter for the threshold would be the more complete option, but it's a lot more code. Just the check function might be something like this (Claude dixit), and you still need to call it, document it, etcd

# Check if filesystem has sufficient space for backup based on usage threshold # # Args: # $1 - Directory path to check # # Returns: # 0 - Sufficient space available # 1 - Insufficient space (usage above threshold) check_backup_disk_space() { local backup_dir="$1" local threshold="$OCF_RESKEY_backup_disk_usage_threshold" local usage_percent local fs_path # Validate threshold is a number between 0-100 if ! echo "$threshold" | grep -q '^[0-9]\{1,3\}$' || [ "$threshold" -gt 100 ]; then ocf_log warn "invalid backup_disk_usage_threshold value: $threshold (must be 0-100), using default" threshold="$OCF_RESKEY_backup_disk_usage_threshold_default" fi # Get filesystem usage percentage (without % sign) # df output: Filesystem Size Used Avail Use% Mounted usage_percent=$(df -P "$backup_dir" 2>/dev/null | awk 'NR==2 {sub(/%/, "", $5); print $5}') if [ -z "$usage_percent" ]; then ocf_log warn "could not determine disk usage for $backup_dir, skipping space check" return 0 fi if [ "$usage_percent" -ge "$threshold" ]; then ocf_log warn "backup filesystem usage ${usage_percent}% exceeds threshold ${threshold}%, skipping backup" return 1 fi ocf_log debug "backup filesystem usage ${usage_percent}% is below threshold ${threshold}%" return 0 }

I can use the other backups (if any, of course) + some guard

of course I was hallucinating here... we are backing up 1 specific file, I know what size it has 🤦

The simple "will we fill the disk" check is very straightforward, but as you say, might be useless and doesn't give any options to the user.

Considering that this is going to go on a kubernetes node that has better ways to detect disk pressure, I don't think we need to complicate this change further. What do you think?

Yeah, what about just making we're not the ones filling the disk with something like:

# Check available disk space before creating backup local available_kb available_kb=$(df -P "$backup_dir" | awk 'NR==2 {print $4}') local db_size_kb db_size_kb=$(($(stat -c %s "$etcd_db_path") / 1024)) if [ "$available_kb" -lt "$((db_size_kb * 2))" ]; then ocf_log warn "insufficient disk space in $backup_dir (available: ${available_kb}KB, needed: ~$((db_size_kb * 2))KB), skipping backup" return $OCF_SUCCESS fi

and it's out of our hands!

I came up with a similar code 😁 🤖

clobrano · 2025-11-11T10:11:16Z

are the steps to the restore the same?

TL;DR: yes 😁

Longer answer:
To avoid misunderstandings, this podman-etcd backup feature exists not to take regular backups, but to prevent total data loss before we wipe the current etcd data, which is necessary for the learner to join the existing cluster.

Our documentation suggests using the cluster-backup.sh, which backs up live etcd data using etcdctl snapshot save utility (https://etcd.io/docs/v3.6/tasks/operator/how-to-save-database/). However, podman-etcd saves the database before etcd even starts on the node. In this situation, as there is no etcd server running, podman-etcd uses the alternative approach of "copying the member/snap/db file from an etcd data directory" (see https://etcd.io/docs/v3.5/op-guide/recovery/, section "Snapshotting the keyspace").

The copy might indeed be broken. It depends on how the agent stopped etcd in the previous cycle (gracefully or not), but it is kind of a safety net.

EDIT:
What I want to say is that this data is not intended to be restored, but it can be restored if something very bad happens on the node that is forcing the new cluster imposing its own data.

jaypoulz approved these changes Nov 10, 2025

View reviewed changes

jaypoulz reviewed Nov 10, 2025

View reviewed changes

clobrano marked this pull request as ready for review November 11, 2025 09:18

fonta-rh reviewed Nov 11, 2025

View reviewed changes

OCPBUGS-60588: podman-etcd: enhance etcd data backup with snapshots and retention #2095

Are you sure you want to change the base?

OCPBUGS-60588: podman-etcd: enhance etcd data backup with snapshots and retention #2095

Conversation

clobrano commented Nov 4, 2025

Uh oh!

knet-jenkins bot commented Nov 4, 2025

Uh oh!

jaypoulz commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaypoulz left a comment

Choose a reason for hiding this comment

Uh oh!

jaypoulz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clobrano Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fonta-rh Nov 11, 2025 • edited by clobrano Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fonta-rh Nov 11, 2025 • edited by clobrano Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clobrano commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jaypoulz commented Nov 10, 2025 •

edited

Loading

clobrano Nov 11, 2025 •

edited

Loading

fonta-rh Nov 11, 2025 •

edited by clobrano

Loading

fonta-rh Nov 11, 2025 •

edited by clobrano

Loading

clobrano commented Nov 11, 2025 •

edited

Loading