You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
:page-categories: Management, High Availability, Disaster Recovery, Emergency Response
6
+
// tag::single-source[]
6
7
8
+
9
+
ifndef::env-cloud[]
7
10
[NOTE]
8
11
====
9
12
include::shared:partial$enterprise-license.adoc[]
10
13
====
14
+
endif::[]
11
15
12
16
This guide provides step-by-step procedures for emergency failover when your primary Redpanda cluster becomes unavailable. Follow these procedures only during active disasters when immediate failover is required.
13
-
14
17
// TODO: All command output examples in this guide need verification by running actual commands in test environment
15
18
16
19
[IMPORTANT]
17
20
====
18
21
This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see xref:./failover.adoc[]. Ensure you have completed the xref:manage:disaster-recovery/shadowing/overview.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs.
19
22
====
20
23
24
+
ifdef::env-cloud[]
25
+
NOTE: Shadowing is supported on BYOC and Dedicated clusters running Redpanda version 25.3 and later.
26
+
endif::[]
27
+
21
28
== Emergency failover procedure
22
29
23
30
Follow these steps during an active disaster:
@@ -86,7 +93,7 @@ Verify that the following conditions exist before proceeding with failover:
86
93
* Topics should be in `ACTIVE` state (not `FAULTED`).
87
94
* Replication lag should be reasonable for your RPO requirements.
88
95
89
-
**Understanding replication lag:**
96
+
==== Understanding replication lag
90
97
91
98
Use xref:reference:rpk/rpk-shadow/rpk-shadow-status.adoc[`rpk shadow status`] to check lag, which shows the message count difference between source and shadow partitions:
92
99
@@ -128,11 +135,11 @@ Name: <topic-name>, State: ACTIVE
128
135
1 2345 2579 2568 11
129
136
----
130
137
131
-
The partition information shows:
138
+
The partition information shows the following:
132
139
133
-
* **SRC_LSO**: Source partition Last Stable Offset
134
-
* **SRC_HWM**: Source partition High Watermark
135
-
* **DST_HWM**: Shadow (destination) partition High Watermark
140
+
* **SRC_LSO**: Source partition last stable offset
141
+
* **SRC_HWM**: Source partition high watermark
142
+
* **DST_HWM**: Shadow (destination) partition high watermark
136
143
* **Lag**: Message count difference between source and shadow partitions
137
144
138
145
[IMPORTANT]
@@ -290,4 +297,6 @@ After successful failover, focus on recovery planning and process improvement. B
290
297
1. **Document the incident**: Record timeline, impact, and lessons learned
291
298
2. **Update runbooks**: Improve procedures based on what you learned
Failover is the process of modifying shadow topics or an entire shadow cluster from read-only replicas to fully writable resources, and ceasing replication from the source cluster. You can fail over individual topics for selective workload migration or fail over the entire cluster for comprehensive disaster recovery. This critical operation transforms your shadow resources into operational production assets, allowing you to redirect application traffic when the source cluster becomes unavailable.
12
15
16
+
ifdef::env-cloud[]
17
+
You can failover a shadow link using the Redpanda Cloud UI, `rpk`, or the Data Plane API.
18
+
endif::[]
19
+
20
+
ifndef::env-cloud[]
21
+
You can failover a shadow link using Redpanda Console, `rpk`, or the Admin API.
NOTE: Shadowing is supported on BYOC and Dedicated clusters running Redpanda version 25.3 and later.
28
+
endif::[]
29
+
15
30
== Failover behavior
16
31
17
32
When you initiate failover, Redpanda performs the following operations:
@@ -22,13 +37,15 @@ When you initiate failover, Redpanda performs the following operations:
22
37
23
38
Topic failover is irreversible. Once failed over, topics cannot return to shadow mode, and automatic fallback to the original source cluster is not supported.
24
39
40
+
NOTE: To avoid a split-brain scenario after failover, ensure that all clients are reconfigured to point to the shadow cluster before resuming write activity.
41
+
25
42
== Failover commands
26
43
27
44
You can perform failover at different levels of granularity to match your disaster recovery needs:
28
45
29
46
=== Individual topic failover
30
47
31
-
To fail over a specific shadow topic while leaving other topics in the shadow link still replicating:
48
+
To fail over a specific shadow topic while leaving other topics in the shadow link still replicating, run:
32
49
33
50
[,bash]
34
51
----
@@ -39,7 +56,7 @@ Use this approach when you need to selectively failover specific workloads or wh
39
56
40
57
=== Complete shadow link failover (cluster failover)
41
58
42
-
To fail over all shadow topics associated with the shadow link simultaneously:
59
+
To fail over all shadow topics associated with the shadow link simultaneously, run:
43
60
44
61
[,bash]
45
62
----
@@ -67,6 +84,7 @@ Force deleting a shadow link is irreversible and immediately fails over all topi
67
84
The shadow link itself has a simple state model:
68
85
69
86
* **`ACTIVE`**: Shadow link is operating normally, replicating data
87
+
* **`PAUSED`**: Shadow link replication is temporarily halted by user action
70
88
71
89
Shadow links do not have dedicated failover states. Instead, the link's operational status is determined by the collective state of its shadow topics.
72
90
@@ -78,10 +96,11 @@ Individual shadow topics progress through specific states during failover:
78
96
* **`FAULTED`**: Shadow topic has encountered an error and is not replicating
* **`PAUSED`**: Replication temporarily halted by user action
81
100
82
101
== Monitor failover progress
83
102
84
-
Monitor failover progress using the status command:
103
+
To monitor failover progress using the status command, run:
85
104
86
105
[,bash]
87
106
----
@@ -90,7 +109,7 @@ rpk shadow status <shadow-link-name>
90
109
91
110
The output shows individual topic states and any issues encountered during the failover process. For detailed command options, see xref:reference:rpk/rpk-shadow/rpk-shadow-status.adoc[`rpk shadow status`].
92
111
93
-
**Task states during monitoring:**
112
+
Task states during monitoring:
94
113
95
114
* **`ACTIVE`**: Task is operating normally and replicating data
96
115
* **`FAULTED`**: Task encountered an error and requires attention
@@ -125,6 +144,8 @@ After successful failover, your shadow cluster exhibits the following characteri
125
144
126
145
== Failover considerations and limitations
127
146
147
+
Before implementing failover procedures, understand these key considerations that affect your disaster recovery strategy and operational planning.
148
+
128
149
**Data consistency:**
129
150
130
151
* Some data loss may occur due to replication lag at the time of failover.
@@ -151,4 +172,6 @@ After completing failover:
151
172
* Verify that applications can produce and consume messages normally
152
173
* Consider deleting the shadow link if failover was successful and permanent
153
174
154
-
For emergency situations, see xref:./failover-runbook.adoc[Failover Runbook].
175
+
For emergency situations, see xref:./failover-runbook.adoc[Failover Runbook].
Monitor your shadow links to ensure proper replication performance and understand your disaster recovery readiness. Use `rpk` commands, metrics, and status information to track shadow link health and troubleshoot issues.
For detailed command options, see xref:reference:rpk/rpk-shadow/rpk-shadow-list.adoc[`rpk shadow list`] and xref:reference:rpk/rpk-shadow/rpk-shadow-describe.adoc[`rpk shadow describe`].
34
+
For detailed command options, see xref:reference:rpk/rpk-shadow/rpk-shadow-list.adoc[`rpk shadow list`] and xref:reference:rpk/rpk-shadow/rpk-shadow-describe.adoc[`rpk shadow describe`]. This command shows the complete configuration of the shadow link, including connection settings, filters, and synchronization options.
32
35
33
-
This command shows the complete configuration of the shadow link, including connection settings, filters, and synchronization options.
34
-
35
-
Check your shadow link status to ensure proper operation:
36
+
To check your shadow link status and ensure proper operation, run:
36
37
37
38
[,bash]
38
39
----
39
40
rpk shadow status <shadow-link-name>
40
41
----
41
42
42
-
**Status command options:**
43
+
For troubleshooting specific issues, you can use command options to show individual status sections. See xref:reference:rpk/rpk-shadow/rpk-shadow-status.adoc[`rpk shadow status`] for available status options. The status output includes the following:
43
44
44
-
[,bash]
45
-
----
46
-
rpk shadow status <shadow-link-name>
47
-
----
48
-
49
-
For troubleshooting specific issues, you can use command options to show individual status sections. See xref:reference:rpk/rpk-shadow/rpk-shadow-status.adoc[`rpk shadow status`] for available status options.
50
-
51
-
The status output includes:
52
-
53
-
* **Shadow link state**: Overall operational state (`ACTIVE`)
54
-
* **Individual topic states**: Current state of each replicated topic (`ACTIVE`, `FAULTED`, `FAILING_OVER`, `FAILED_OVER`)
45
+
* **Shadow link state**: Overall operational state (`ACTIVE`, `PAUSED`).
46
+
* **Individual topic states**: Current state of each replicated topic (`ACTIVE`, `FAULTED`, `FAILING_OVER`, `FAILED_OVER`, `PAUSED`).
55
47
* **Task status**: Health of replication tasks across brokers (`ACTIVE`, `FAULTED`, `NOT_RUNNING`, `LINK_UNAVAILABLE`). For details about shadow link tasks, see xref:manage:disaster-recovery/shadowing/setup.adoc#shadow-link-tasks[Shadow link tasks].
56
-
* **Lag information**: Replication lag per partition showing source vs shadow high watermarks (HWM)
48
+
* **Lag information**: Replication lag per partition showing source vs shadow high watermarks (HWM).
57
49
58
50
[[shadow-link-metrics]]
59
51
== Metrics
60
52
61
-
Shadowing provides comprehensive metrics to track replication performance and health:
53
+
Shadowing provides comprehensive metrics to track replication performance and health with the xref:reference:public-metrics-reference.adoc[`public_metrics`] endpoint.
62
54
63
55
[cols="1,1,2"]
64
56
|===
@@ -110,9 +102,9 @@ rpk shadow list | grep -v "ACTIVE" || echo "All shadow links healthy"
110
102
rpk shadow status <shadow-link-name> | grep -E "LAG|Lag"
111
103
----
112
104
113
-
=== Alert thresholds
105
+
=== Alert conditions
114
106
115
-
Configure monitoring alerts for:
107
+
Configure monitoring alerts for the following conditions, which indicate problems with Shadowing:
116
108
117
109
* **High replication lag**: When `redpanda_shadow_link_shadow_lag` exceeds your RPO requirements
118
110
* **Connection errors**: When `redpanda_shadow_link_client_errors` increases rapidly
* **Link unavailability**: When tasks show `LINK_UNAVAILABLE` indicating source cluster connectivity issues
123
115
+
124
116
For more information about shadow link tasks and their states, see xref:manage:disaster-recovery/shadowing/setup.adoc#shadow-link-tasks[Shadow link tasks].
0 commit comments