Skip to content

Commit 2a5e33a

Browse files
micheleRPpaulohtb6
andauthored
DOC-1667 Document Shadow Link in Cloud (#1491)
Co-authored-by: Paulo Borges <[email protected]>
1 parent 328b1f2 commit 2a5e33a

File tree

10 files changed

+370
-278
lines changed

10 files changed

+370
-278
lines changed

modules/ROOT/nav.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -200,7 +200,7 @@
200200
**** xref:manage:disaster-recovery/shadowing/overview.adoc[Overview]
201201
**** xref:manage:disaster-recovery/shadowing/setup.adoc[Configure Shadowing]
202202
**** xref:manage:disaster-recovery/shadowing/monitor.adoc[Monitor Shadowing]
203-
**** xref:manage:disaster-recovery/shadowing/failover.adoc[Configure Failover]
203+
**** xref:manage:disaster-recovery/shadowing/failover.adoc[Failover]
204204
**** xref:manage:disaster-recovery/shadowing/failover-runbook.adoc[Failover Runbook]
205205
*** xref:manage:disaster-recovery/whole-cluster-restore.adoc[Whole Cluster Restore]
206206
*** xref:manage:disaster-recovery/topic-recovery.adoc[Topic Recovery]

modules/manage/pages/disaster-recovery/shadowing/failover-runbook.adoc

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,21 +3,28 @@
33
:page-aliases: deploy:redpanda/manual/resilience/shadowing-guide.adoc, deploy:redpanda/manual/disaster-recovery/shadowing/failover-runbook.adoc
44
:env-linux: true
55
:page-categories: Management, High Availability, Disaster Recovery, Emergency Response
6+
// tag::single-source[]
67

8+
9+
ifndef::env-cloud[]
710
[NOTE]
811
====
912
include::shared:partial$enterprise-license.adoc[]
1013
====
14+
endif::[]
1115

1216
This guide provides step-by-step procedures for emergency failover when your primary Redpanda cluster becomes unavailable. Follow these procedures only during active disasters when immediate failover is required.
13-
1417
// TODO: All command output examples in this guide need verification by running actual commands in test environment
1518

1619
[IMPORTANT]
1720
====
1821
This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see xref:./failover.adoc[]. Ensure you have completed the xref:manage:disaster-recovery/shadowing/overview.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs.
1922
====
2023

24+
ifdef::env-cloud[]
25+
NOTE: Shadowing is supported on BYOC and Dedicated clusters running Redpanda version 25.3 and later.
26+
endif::[]
27+
2128
== Emergency failover procedure
2229

2330
Follow these steps during an active disaster:
@@ -86,7 +93,7 @@ Verify that the following conditions exist before proceeding with failover:
8693
* Topics should be in `ACTIVE` state (not `FAULTED`).
8794
* Replication lag should be reasonable for your RPO requirements.
8895

89-
**Understanding replication lag:**
96+
==== Understanding replication lag
9097

9198
Use xref:reference:rpk/rpk-shadow/rpk-shadow-status.adoc[`rpk shadow status`] to check lag, which shows the message count difference between source and shadow partitions:
9299

@@ -128,11 +135,11 @@ Name: <topic-name>, State: ACTIVE
128135
1 2345 2579 2568 11
129136
----
130137

131-
The partition information shows:
138+
The partition information shows the following:
132139

133-
* **SRC_LSO**: Source partition Last Stable Offset
134-
* **SRC_HWM**: Source partition High Watermark
135-
* **DST_HWM**: Shadow (destination) partition High Watermark
140+
* **SRC_LSO**: Source partition last stable offset
141+
* **SRC_HWM**: Source partition high watermark
142+
* **DST_HWM**: Shadow (destination) partition high watermark
136143
* **Lag**: Message count difference between source and shadow partitions
137144

138145
[IMPORTANT]
@@ -290,4 +297,6 @@ After successful failover, focus on recovery planning and process improvement. B
290297
1. **Document the incident**: Record timeline, impact, and lessons learned
291298
2. **Update runbooks**: Improve procedures based on what you learned
292299
3. **Test regularly**: Schedule regular disaster recovery drills
293-
4. **Review monitoring**: Ensure monitoring caught the issue appropriately
300+
4. **Review monitoring**: Ensure monitoring caught the issue appropriately
301+
302+
// end::single-source[]

modules/manage/pages/disaster-recovery/shadowing/failover.adoc

Lines changed: 29 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,32 @@
1-
= Configure Failover
1+
= Failover
22
:description: Learn how failover can transform shadow topics into fully writable resources during disasters.
33
:page-categories: Management, High Availability, Disaster Recovery
44
:page-aliases: deploy:redpanda/manual/disaster-recovery/shadowing/failover.adoc
5+
// tag::single-source[]
56

7+
ifndef::env-cloud[]
68
[NOTE]
79
====
810
include::shared:partial$enterprise-license.adoc[]
911
====
12+
endif::[]
1013

1114
Failover is the process of modifying shadow topics or an entire shadow cluster from read-only replicas to fully writable resources, and ceasing replication from the source cluster. You can fail over individual topics for selective workload migration or fail over the entire cluster for comprehensive disaster recovery. This critical operation transforms your shadow resources into operational production assets, allowing you to redirect application traffic when the source cluster becomes unavailable.
1215

16+
ifdef::env-cloud[]
17+
You can failover a shadow link using the Redpanda Cloud UI, `rpk`, or the Data Plane API.
18+
endif::[]
19+
20+
ifndef::env-cloud[]
21+
You can failover a shadow link using Redpanda Console, `rpk`, or the Admin API.
22+
endif::[]
23+
1324
include::shared:partial$emergency-shadowing-callout.adoc[]
1425

26+
ifdef::env-cloud[]
27+
NOTE: Shadowing is supported on BYOC and Dedicated clusters running Redpanda version 25.3 and later.
28+
endif::[]
29+
1530
== Failover behavior
1631

1732
When you initiate failover, Redpanda performs the following operations:
@@ -22,13 +37,15 @@ When you initiate failover, Redpanda performs the following operations:
2237

2338
Topic failover is irreversible. Once failed over, topics cannot return to shadow mode, and automatic fallback to the original source cluster is not supported.
2439

40+
NOTE: To avoid a split-brain scenario after failover, ensure that all clients are reconfigured to point to the shadow cluster before resuming write activity.
41+
2542
== Failover commands
2643

2744
You can perform failover at different levels of granularity to match your disaster recovery needs:
2845

2946
=== Individual topic failover
3047

31-
To fail over a specific shadow topic while leaving other topics in the shadow link still replicating:
48+
To fail over a specific shadow topic while leaving other topics in the shadow link still replicating, run:
3249

3350
[,bash]
3451
----
@@ -39,7 +56,7 @@ Use this approach when you need to selectively failover specific workloads or wh
3956

4057
=== Complete shadow link failover (cluster failover)
4158

42-
To fail over all shadow topics associated with the shadow link simultaneously:
59+
To fail over all shadow topics associated with the shadow link simultaneously, run:
4360

4461
[,bash]
4562
----
@@ -67,6 +84,7 @@ Force deleting a shadow link is irreversible and immediately fails over all topi
6784
The shadow link itself has a simple state model:
6885

6986
* **`ACTIVE`**: Shadow link is operating normally, replicating data
87+
* **`PAUSED`**: Shadow link replication is temporarily halted by user action
7088

7189
Shadow links do not have dedicated failover states. Instead, the link's operational status is determined by the collective state of its shadow topics.
7290

@@ -78,10 +96,11 @@ Individual shadow topics progress through specific states during failover:
7896
* **`FAULTED`**: Shadow topic has encountered an error and is not replicating
7997
* **`FAILING_OVER`**: Failover initiated, replication stopping
8098
* **`FAILED_OVER`**: Failover completed successfully, topic fully writable
99+
* **`PAUSED`**: Replication temporarily halted by user action
81100

82101
== Monitor failover progress
83102

84-
Monitor failover progress using the status command:
103+
To monitor failover progress using the status command, run:
85104

86105
[,bash]
87106
----
@@ -90,7 +109,7 @@ rpk shadow status <shadow-link-name>
90109

91110
The output shows individual topic states and any issues encountered during the failover process. For detailed command options, see xref:reference:rpk/rpk-shadow/rpk-shadow-status.adoc[`rpk shadow status`].
92111

93-
**Task states during monitoring:**
112+
Task states during monitoring:
94113

95114
* **`ACTIVE`**: Task is operating normally and replicating data
96115
* **`FAULTED`**: Task encountered an error and requires attention
@@ -125,6 +144,8 @@ After successful failover, your shadow cluster exhibits the following characteri
125144

126145
== Failover considerations and limitations
127146

147+
Before implementing failover procedures, understand these key considerations that affect your disaster recovery strategy and operational planning.
148+
128149
**Data consistency:**
129150

130151
* Some data loss may occur due to replication lag at the time of failover.
@@ -151,4 +172,6 @@ After completing failover:
151172
* Verify that applications can produce and consume messages normally
152173
* Consider deleting the shadow link if failover was successful and permanent
153174

154-
For emergency situations, see xref:./failover-runbook.adoc[Failover Runbook].
175+
For emergency situations, see xref:./failover-runbook.adoc[Failover Runbook].
176+
177+
// end::single-source[]

modules/manage/pages/disaster-recovery/shadowing/monitor.adoc

Lines changed: 16 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,63 +2,55 @@
22
:description: Monitor Shadowing health with status commands, metrics, and best practices for tracking replication performance.
33
:page-categories: Management, Monitoring, Disaster Recovery
44
:page-aliases: deploy:redpanda/manual/disaster-recovery/shadowing/monitor.adoc
5+
// tag::single-source[]
56

7+
ifndef::env-cloud[]
68
[NOTE]
79
====
810
include::shared:partial$enterprise-license.adoc[]
911
====
12+
endif::[]
1013

1114
Monitor your shadow links to ensure proper replication performance and understand your disaster recovery readiness. Use `rpk` commands, metrics, and status information to track shadow link health and troubleshoot issues.
1215

1316
include::shared:partial$emergency-shadowing-callout.adoc[]
1417

1518
== Status commands
1619

17-
List existing shadow links:
20+
To list existing shadow links, run:
1821

1922
[,bash]
2023
----
2124
rpk shadow list
2225
----
2326

24-
View shadow link configuration details:
27+
To view shadow link configuration details, run:
2528

2629
[,bash]
2730
----
2831
rpk shadow describe <my-disaster-recovery-link>
2932
----
3033

31-
For detailed command options, see xref:reference:rpk/rpk-shadow/rpk-shadow-list.adoc[`rpk shadow list`] and xref:reference:rpk/rpk-shadow/rpk-shadow-describe.adoc[`rpk shadow describe`].
34+
For detailed command options, see xref:reference:rpk/rpk-shadow/rpk-shadow-list.adoc[`rpk shadow list`] and xref:reference:rpk/rpk-shadow/rpk-shadow-describe.adoc[`rpk shadow describe`]. This command shows the complete configuration of the shadow link, including connection settings, filters, and synchronization options.
3235

33-
This command shows the complete configuration of the shadow link, including connection settings, filters, and synchronization options.
34-
35-
Check your shadow link status to ensure proper operation:
36+
To check your shadow link status and ensure proper operation, run:
3637

3738
[,bash]
3839
----
3940
rpk shadow status <shadow-link-name>
4041
----
4142

42-
**Status command options:**
43+
For troubleshooting specific issues, you can use command options to show individual status sections. See xref:reference:rpk/rpk-shadow/rpk-shadow-status.adoc[`rpk shadow status`] for available status options. The status output includes the following:
4344

44-
[,bash]
45-
----
46-
rpk shadow status <shadow-link-name>
47-
----
48-
49-
For troubleshooting specific issues, you can use command options to show individual status sections. See xref:reference:rpk/rpk-shadow/rpk-shadow-status.adoc[`rpk shadow status`] for available status options.
50-
51-
The status output includes:
52-
53-
* **Shadow link state**: Overall operational state (`ACTIVE`)
54-
* **Individual topic states**: Current state of each replicated topic (`ACTIVE`, `FAULTED`, `FAILING_OVER`, `FAILED_OVER`)
45+
* **Shadow link state**: Overall operational state (`ACTIVE`, `PAUSED`).
46+
* **Individual topic states**: Current state of each replicated topic (`ACTIVE`, `FAULTED`, `FAILING_OVER`, `FAILED_OVER`, `PAUSED`).
5547
* **Task status**: Health of replication tasks across brokers (`ACTIVE`, `FAULTED`, `NOT_RUNNING`, `LINK_UNAVAILABLE`). For details about shadow link tasks, see xref:manage:disaster-recovery/shadowing/setup.adoc#shadow-link-tasks[Shadow link tasks].
56-
* **Lag information**: Replication lag per partition showing source vs shadow high watermarks (HWM)
48+
* **Lag information**: Replication lag per partition showing source vs shadow high watermarks (HWM).
5749

5850
[[shadow-link-metrics]]
5951
== Metrics
6052

61-
Shadowing provides comprehensive metrics to track replication performance and health:
53+
Shadowing provides comprehensive metrics to track replication performance and health with the xref:reference:public-metrics-reference.adoc[`public_metrics`] endpoint.
6254

6355
[cols="1,1,2"]
6456
|===
@@ -110,9 +102,9 @@ rpk shadow list | grep -v "ACTIVE" || echo "All shadow links healthy"
110102
rpk shadow status <shadow-link-name> | grep -E "LAG|Lag"
111103
----
112104

113-
=== Alert thresholds
105+
=== Alert conditions
114106

115-
Configure monitoring alerts for:
107+
Configure monitoring alerts for the following conditions, which indicate problems with Shadowing:
116108

117109
* **High replication lag**: When `redpanda_shadow_link_shadow_lag` exceeds your RPO requirements
118110
* **Connection errors**: When `redpanda_shadow_link_client_errors` increases rapidly
@@ -122,3 +114,5 @@ Configure monitoring alerts for:
122114
* **Link unavailability**: When tasks show `LINK_UNAVAILABLE` indicating source cluster connectivity issues
123115
+
124116
For more information about shadow link tasks and their states, see xref:manage:disaster-recovery/shadowing/setup.adoc#shadow-link-tasks[Shadow link tasks].
117+
118+
// end::single-source[]

0 commit comments

Comments
 (0)