Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,7 @@
**** xref:manage:disaster-recovery/shadowing/overview.adoc[Overview]
**** xref:manage:disaster-recovery/shadowing/setup.adoc[Configure Shadowing]
**** xref:manage:disaster-recovery/shadowing/monitor.adoc[Monitor Shadowing]
**** xref:manage:disaster-recovery/shadowing/failover.adoc[Configure Failover]
**** xref:manage:disaster-recovery/shadowing/failover.adoc[Failover]
**** xref:manage:disaster-recovery/shadowing/failover-runbook.adoc[Failover Runbook]
*** xref:manage:disaster-recovery/whole-cluster-restore.adoc[Whole Cluster Restore]
*** xref:manage:disaster-recovery/topic-recovery.adoc[Topic Recovery]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,28 @@
:page-aliases: deploy:redpanda/manual/resilience/shadowing-guide.adoc, deploy:redpanda/manual/disaster-recovery/shadowing/failover-runbook.adoc
:env-linux: true
:page-categories: Management, High Availability, Disaster Recovery, Emergency Response
// tag::single-source[]


ifndef::env-cloud[]
[NOTE]
====
include::shared:partial$enterprise-license.adoc[]
====
endif::[]

This guide provides step-by-step procedures for emergency failover when your primary Redpanda cluster becomes unavailable. Follow these procedures only during active disasters when immediate failover is required.

// TODO: All command output examples in this guide need verification by running actual commands in test environment

[IMPORTANT]
====
This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see xref:./failover.adoc[]. Ensure you have completed the xref:manage:disaster-recovery/shadowing/overview.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs.
====

ifdef::env-cloud[]
NOTE: Shadowing is supported on BYOC and Dedicated clusters running Redpanda version 25.3 and later.
endif::[]

== Emergency failover procedure

Follow these steps during an active disaster:
Expand Down Expand Up @@ -86,7 +93,7 @@ Verify that the following conditions exist before proceeding with failover:
* Topics should be in `ACTIVE` state (not `FAULTED`).
* Replication lag should be reasonable for your RPO requirements.

**Understanding replication lag:**
==== Understanding replication lag

Use xref:reference:rpk/rpk-shadow/rpk-shadow-status.adoc[`rpk shadow status`] to check lag, which shows the message count difference between source and shadow partitions:

Expand Down Expand Up @@ -128,11 +135,11 @@ Name: <topic-name>, State: ACTIVE
1 2345 2579 2568 11
----

The partition information shows:
The partition information shows the following:

* **SRC_LSO**: Source partition Last Stable Offset
* **SRC_HWM**: Source partition High Watermark
* **DST_HWM**: Shadow (destination) partition High Watermark
* **SRC_LSO**: Source partition last stable offset
* **SRC_HWM**: Source partition high watermark
* **DST_HWM**: Shadow (destination) partition high watermark
* **Lag**: Message count difference between source and shadow partitions

[IMPORTANT]
Expand Down Expand Up @@ -290,4 +297,6 @@ After successful failover, focus on recovery planning and process improvement. B
1. **Document the incident**: Record timeline, impact, and lessons learned
2. **Update runbooks**: Improve procedures based on what you learned
3. **Test regularly**: Schedule regular disaster recovery drills
4. **Review monitoring**: Ensure monitoring caught the issue appropriately
4. **Review monitoring**: Ensure monitoring caught the issue appropriately

// end::single-source[]
35 changes: 29 additions & 6 deletions modules/manage/pages/disaster-recovery/shadowing/failover.adoc
Original file line number Diff line number Diff line change
@@ -1,17 +1,32 @@
= Configure Failover
= Failover
:description: Learn how failover can transform shadow topics into fully writable resources during disasters.
:page-categories: Management, High Availability, Disaster Recovery
:page-aliases: deploy:redpanda/manual/disaster-recovery/shadowing/failover.adoc
// tag::single-source[]

ifndef::env-cloud[]
[NOTE]
====
include::shared:partial$enterprise-license.adoc[]
====
endif::[]

Failover is the process of modifying shadow topics or an entire shadow cluster from read-only replicas to fully writable resources, and ceasing replication from the source cluster. You can fail over individual topics for selective workload migration or fail over the entire cluster for comprehensive disaster recovery. This critical operation transforms your shadow resources into operational production assets, allowing you to redirect application traffic when the source cluster becomes unavailable.

ifdef::env-cloud[]
You can failover a shadow link using the Redpanda Cloud UI, `rpk`, or the Data Plane API.
endif::[]

ifndef::env-cloud[]
You can failover a shadow link using Redpanda Console, `rpk`, or the Admin API.
endif::[]

include::shared:partial$emergency-shadowing-callout.adoc[]

ifdef::env-cloud[]
NOTE: Shadowing is supported on BYOC and Dedicated clusters running Redpanda version 25.3 and later.
endif::[]

== Failover behavior

When you initiate failover, Redpanda performs the following operations:
Expand All @@ -22,13 +37,15 @@ When you initiate failover, Redpanda performs the following operations:

Topic failover is irreversible. Once failed over, topics cannot return to shadow mode, and automatic fallback to the original source cluster is not supported.

NOTE: To avoid a split-brain scenario after failover, ensure that all clients are reconfigured to point to the shadow cluster before resuming write activity.

== Failover commands

You can perform failover at different levels of granularity to match your disaster recovery needs:

=== Individual topic failover

To fail over a specific shadow topic while leaving other topics in the shadow link still replicating:
To fail over a specific shadow topic while leaving other topics in the shadow link still replicating, run:

[,bash]
----
Expand All @@ -39,7 +56,7 @@ Use this approach when you need to selectively failover specific workloads or wh

=== Complete shadow link failover (cluster failover)

To fail over all shadow topics associated with the shadow link simultaneously:
To fail over all shadow topics associated with the shadow link simultaneously, run:

[,bash]
----
Expand Down Expand Up @@ -67,6 +84,7 @@ Force deleting a shadow link is irreversible and immediately fails over all topi
The shadow link itself has a simple state model:

* **`ACTIVE`**: Shadow link is operating normally, replicating data
* **`PAUSED`**: Shadow link replication is temporarily halted by user action

Shadow links do not have dedicated failover states. Instead, the link's operational status is determined by the collective state of its shadow topics.

Expand All @@ -78,10 +96,11 @@ Individual shadow topics progress through specific states during failover:
* **`FAULTED`**: Shadow topic has encountered an error and is not replicating
* **`FAILING_OVER`**: Failover initiated, replication stopping
* **`FAILED_OVER`**: Failover completed successfully, topic fully writable
* **`PAUSED`**: Replication temporarily halted by user action

== Monitor failover progress

Monitor failover progress using the status command:
To monitor failover progress using the status command, run:

[,bash]
----
Expand All @@ -90,7 +109,7 @@ rpk shadow status <shadow-link-name>

The output shows individual topic states and any issues encountered during the failover process. For detailed command options, see xref:reference:rpk/rpk-shadow/rpk-shadow-status.adoc[`rpk shadow status`].

**Task states during monitoring:**
Task states during monitoring:

* **`ACTIVE`**: Task is operating normally and replicating data
* **`FAULTED`**: Task encountered an error and requires attention
Expand Down Expand Up @@ -125,6 +144,8 @@ After successful failover, your shadow cluster exhibits the following characteri

== Failover considerations and limitations

Before implementing failover procedures, understand these key considerations that affect your disaster recovery strategy and operational planning.

**Data consistency:**

* Some data loss may occur due to replication lag at the time of failover.
Expand All @@ -151,4 +172,6 @@ After completing failover:
* Verify that applications can produce and consume messages normally
* Consider deleting the shadow link if failover was successful and permanent

For emergency situations, see xref:./failover-runbook.adoc[Failover Runbook].
For emergency situations, see xref:./failover-runbook.adoc[Failover Runbook].

// end::single-source[]
38 changes: 16 additions & 22 deletions modules/manage/pages/disaster-recovery/shadowing/monitor.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,63 +2,55 @@
:description: Monitor Shadowing health with status commands, metrics, and best practices for tracking replication performance.
:page-categories: Management, Monitoring, Disaster Recovery
:page-aliases: deploy:redpanda/manual/disaster-recovery/shadowing/monitor.adoc
// tag::single-source[]

ifndef::env-cloud[]
[NOTE]
====
include::shared:partial$enterprise-license.adoc[]
====
endif::[]

Monitor your shadow links to ensure proper replication performance and understand your disaster recovery readiness. Use `rpk` commands, metrics, and status information to track shadow link health and troubleshoot issues.

include::shared:partial$emergency-shadowing-callout.adoc[]

== Status commands

List existing shadow links:
To list existing shadow links, run:

[,bash]
----
rpk shadow list
----

View shadow link configuration details:
To view shadow link configuration details, run:

[,bash]
----
rpk shadow describe <my-disaster-recovery-link>
----

For detailed command options, see xref:reference:rpk/rpk-shadow/rpk-shadow-list.adoc[`rpk shadow list`] and xref:reference:rpk/rpk-shadow/rpk-shadow-describe.adoc[`rpk shadow describe`].
For detailed command options, see xref:reference:rpk/rpk-shadow/rpk-shadow-list.adoc[`rpk shadow list`] and xref:reference:rpk/rpk-shadow/rpk-shadow-describe.adoc[`rpk shadow describe`]. This command shows the complete configuration of the shadow link, including connection settings, filters, and synchronization options.

This command shows the complete configuration of the shadow link, including connection settings, filters, and synchronization options.

Check your shadow link status to ensure proper operation:
To check your shadow link status and ensure proper operation, run:

[,bash]
----
rpk shadow status <shadow-link-name>
----

**Status command options:**
For troubleshooting specific issues, you can use command options to show individual status sections. See xref:reference:rpk/rpk-shadow/rpk-shadow-status.adoc[`rpk shadow status`] for available status options. The status output includes the following:

[,bash]
----
rpk shadow status <shadow-link-name>
----

For troubleshooting specific issues, you can use command options to show individual status sections. See xref:reference:rpk/rpk-shadow/rpk-shadow-status.adoc[`rpk shadow status`] for available status options.

The status output includes:

* **Shadow link state**: Overall operational state (`ACTIVE`)
* **Individual topic states**: Current state of each replicated topic (`ACTIVE`, `FAULTED`, `FAILING_OVER`, `FAILED_OVER`)
* **Shadow link state**: Overall operational state (`ACTIVE`, `PAUSED`).
* **Individual topic states**: Current state of each replicated topic (`ACTIVE`, `FAULTED`, `FAILING_OVER`, `FAILED_OVER`, `PAUSED`).
* **Task status**: Health of replication tasks across brokers (`ACTIVE`, `FAULTED`, `NOT_RUNNING`, `LINK_UNAVAILABLE`). For details about shadow link tasks, see xref:manage:disaster-recovery/shadowing/setup.adoc#shadow-link-tasks[Shadow link tasks].
* **Lag information**: Replication lag per partition showing source vs shadow high watermarks (HWM)
* **Lag information**: Replication lag per partition showing source vs shadow high watermarks (HWM).

[[shadow-link-metrics]]
== Metrics

Shadowing provides comprehensive metrics to track replication performance and health:
Shadowing provides comprehensive metrics to track replication performance and health with the xref:reference:public-metrics-reference.adoc[`public_metrics`] endpoint.

[cols="1,1,2"]
|===
Expand Down Expand Up @@ -110,9 +102,9 @@ rpk shadow list | grep -v "ACTIVE" || echo "All shadow links healthy"
rpk shadow status <shadow-link-name> | grep -E "LAG|Lag"
----

=== Alert thresholds
=== Alert conditions

Configure monitoring alerts for:
Configure monitoring alerts for the following conditions, which indicate problems with Shadowing:

* **High replication lag**: When `redpanda_shadow_link_shadow_lag` exceeds your RPO requirements
* **Connection errors**: When `redpanda_shadow_link_client_errors` increases rapidly
Expand All @@ -122,3 +114,5 @@ Configure monitoring alerts for:
* **Link unavailability**: When tasks show `LINK_UNAVAILABLE` indicating source cluster connectivity issues
+
For more information about shadow link tasks and their states, see xref:manage:disaster-recovery/shadowing/setup.adoc#shadow-link-tasks[Shadow link tasks].

// end::single-source[]
Loading