Skip to content

Commit ed4c77b

Browse files
committed
Docs for PCR on Cloud cluster v25.2
1 parent 1181a54 commit ed4c77b

File tree

4 files changed

+273
-1
lines changed

4 files changed

+273
-1
lines changed

src/current/_includes/v25.1/sidebar-data/cloud-deployments.json

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -574,6 +574,12 @@
574574
}
575575
]
576576
},
577+
{
578+
"title": "Physical Cluster Replication",
579+
"urls": [
580+
"/cockroachcloud/physical-cluster-replication.html"
581+
]
582+
},
577583
{
578584
"title": "Billing Management",
579585
"urls": [
Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1-
- Physical cluster replication is supported in CockroachDB {{ site.data.products.core }} clusters on v23.2 or later. The primary cluster can be a [new]({% link {{ page.version.version }}/set-up-physical-cluster-replication.md %}#step-1-create-the-primary-cluster) or [existing]({% link {{ page.version.version }}/set-up-physical-cluster-replication.md %}#set-up-pcr-from-an-existing-cluster) cluster. The standby cluster must be a [new cluster started with the `--virtualized-empty` flag]({% link {{ page.version.version }}/set-up-physical-cluster-replication.md %}#step-2-create-the-standby-cluster).
1+
- Physical cluster replication is supported in:
2+
- CockroachDB self-hosted in new clusters on v23.2 or above. Physical Cluster Replication cannot be enabled on clusters that have been upgraded from a previous version of CockroachDB.
3+
- CockroachDB {{ site.data.products.advanced }} in new clusters on v25.1 with the `"support_physical_cluster_replication"` field enabled.
4+
- The primary and standby clusters must have the same [zone configurations]({% link {{ page.version.version }}/configure-replication-zones.md %}) in CockroachDB self-hosted.
25
- The primary and standby clusters must have the same [zone configurations]({% link {{ page.version.version }}/configure-replication-zones.md %}).
36
- Before failover to the standby, the standby cluster does not support running [backups]({% link {{ page.version.version }}/backup-and-restore-overview.md %}) or [changefeeds]({% link {{ page.version.version }}/change-data-capture-overview.md %}).

src/current/_includes/v25.2/sidebar-data/cloud-deployments.json

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -569,6 +569,12 @@
569569
}
570570
]
571571
},
572+
{
573+
"title": "Physical Cluster Replication",
574+
"urls": [
575+
"/cockroachcloud/physical-cluster-replication.html"
576+
]
577+
},
572578
{
573579
"title": "Billing Management",
574580
"urls": [
Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
---
2+
title: Physical Cluster Replication
3+
summary: Set up physical cluster replication (PCR) in a Cloud deployment.
4+
toc: true
5+
---
6+
7+
{{site.data.alerts.callout_info}}
8+
{% include feature-phases/preview.md %}
9+
{{site.data.alerts.end}}
10+
11+
CockroachDB **physical cluster replication (PCR)** continuously sends all data at the byte level from a _primary_ cluster to an independent _standby_ cluster. Existing data and ongoing changes on the active primary cluster, which is serving application data, replicate asynchronously to the passive standby cluster.
12+
13+
In a disaster recovery scenario, you can [fail over](#step-3-fail-over-to-the-standby-cluster) from the unavailable primary cluster to the standby cluster. This will stop the PCR stream, reset the standby cluster to a point in time where all ingested data is consistent, and mark the standby as ready to accept application traffic.
14+
15+
## Set up PCR on CockroachDB {{ site.data.products.advanced }}
16+
17+
In this guide, you'll use the [{{ site.data.products.cloud }} API]({% link cockroachcloud/cloud-api.md %}) to set up PCR from a primary cluster to a standby cluster, monitor the PCR stream, and fail over from the primary to the standby cluster.
18+
19+
{{site.data.alerts.callout_info}}
20+
PCR is supported on CockroachDB {{ site.data.products.advanced }} and CockroachDB self-hosted clusters. For a guide to setting up PCR on CockroachDB self-hosted, refer to the [Set Up Physical Cluster Replication]({% link {{ site.current_cloud_version }}/set-up-physical-cluster-replication.md %}) tutorial.
21+
{{site.data.alerts.end}}
22+
23+
### Before you begin
24+
25+
You'll need the following:
26+
27+
- **Two CockroachDB {{ site.data.products.advanced }} clusters.** To set up PCR successfully, configure your clusters as per the following:
28+
- Clusters must be in the same cloud (AWS, GCP, or Azure).
29+
- Clusters must be single [region]({% link cockroachcloud/regions.md %}) (multiple availability zones per clusteris supported).
30+
- The primary and standby cluster in AWS and Azure must be in different regions.
31+
- The primary and standby cluster in GCP can be in the same region, but must not have overlapping CIDR ranges.
32+
- Clusters can have different [node topology]({% link cockroachcloud/plan-your-cluster-advanced.md %}#cluster-topology) and [hardware configurations]({% link cockroachcloud/plan-your-cluster-advanced.md %}#cluster-sizing-and-scaling). To avoid performance constraints (failover and redirecting application traffic to a standby), we recommend configuring the primary and standby clusters with similar hardware.
33+
34+
{{site.data.alerts.callout_success}}
35+
We recommend [enabling Prometheus metrics export]({% link cockroachcloud/export-metrics.md %}) on your cluster before starting a PCR stream. For details on metrics to track, refer to [Monitor the PCR stream](#step-2-monitor-the-pcr-stream).
36+
{{site.data.alerts.end}}
37+
- **[Cloud API Access]({% link cockroachcloud/managing-access.md %}#api-access).**
38+
39+
To set up and manage PCR on CockroachDB {{ site.data.products.advanced }} clusters, you'll use the `'https://cockroachlabs.cloud/api/v1/replication-streams'` endpoint. Access to the `replication-streams` endpoint requires a valid CockroachDB {{ site.data.products.cloud }} [service account]({% link cockroachcloud/managing-access.md %}#manage-service-accounts) with the correct permissions.
40+
41+
The following describes the required roles for the `replication-streams` endpoint methods. These can be assigned at the [organization]({% link cockroachcloud/authorization.md %}#organization-user-roles), [folder]({% link cockroachcloud/folders.md %}), or cluster scope:
42+
43+
Method | Required roles | Description
44+
-------+----------------+------------
45+
`POST` | [Cluster Administrator]({% link cockroachcloud/authorization.md %}#cluster-administrator) | Create a PCR stream. Required on the primary and standby clusters.
46+
`GET` | [Cluster Administrator]({% link cockroachcloud/authorization.md %}#cluster-administrator), [Cluster Operator]({% link cockroachcloud/authorization.md %}#cluster-operator), [Cluster Developer]({% link cockroachcloud/authorization.md %}#cluster-developer) | Retrieve information for the PCR stream. Required on either the primary or standby cluster.
47+
`PATCH` | [Cluster Administrator]({% link cockroachcloud/authorization.md %}#cluster-administrator) | Update the PCR stream to fail over. Required on either the primary or standby cluster.
48+
49+
{{site.data.alerts.callout_success}}
50+
We recommend creating service accounts with the [principle of least privilege](https://wikipedia.org/wiki/Principle_of_least_privilege), and giving each application that accesses the API its own service account and API key. This allows fine-grained access to the cluster and PCR streams.
51+
{{site.data.alerts.end}}
52+
53+
### Step 1. Start the PCR stream
54+
55+
{{site.data.alerts.callout_info}}
56+
We recommend using an empty standby cluster when starting PCR. When you initiate the PCR stream, CockroachDB {{ site.data.products.cloud }} will take a full cluster backup of the standby cluster, delete all data from the standby, and then start the PCR stream, which ensures that the standby will be fully consistent with the primary during PCR.
57+
{{site.data.alerts.end}}
58+
59+
With the primary and standby clusters set up, you can now start a PCR stream.
60+
61+
1. Send a `POST` request to the `/v1/replication-streams` endpoint to start the PCR stream:
62+
63+
{% include_cached copy-clipboard.html %}
64+
~~~ shell
65+
curl --request POST --url 'https://cockroachlabs.cloud/api/v1/replication-streams' --header "Authorization: Bearer api_secret_key" --json '{"source_cluster_id": "primary_cluster_id","target_cluster_id": "standby_cluster_id"}'
66+
~~~
67+
68+
Replace:
69+
70+
- `api_secret_key` with your API secret key.
71+
- `primary_cluster_id` with the cluster ID returned after creating the primary cluster.
72+
- `standby_cluster_id` with the cluster ID returned after creating the standby cluster.
73+
74+
You can find the cluster IDs in the cluster creation output, or in the URL of the single cluster overview page: `https://cockroachlabs.cloud/cluster/{your_cluster_id}/overview`. The ID will resemble `ad1e8630-729a-40f3-87e4-9f72eb3347a0`.
75+
76+
{{site.data.alerts.callout_info}}
77+
Once you have started PCR, the standby cluster cannot accept writes and reads, therefore the [Cloud Console]({% link cockroachcloud/cluster-overview-page.md %}) and SQL shell will be unavailable prior to failover.
78+
{{site.data.alerts.end}}
79+
80+
You will receive the response:
81+
82+
~~~ json
83+
{
84+
"id": "c3d35c84-a4ea-41b3-8452-553c5ded3b85",
85+
"status": "STARTING",
86+
"source_cluster_id": "3fabc29e-5ced-48d9-b31e-000000000000",
87+
"target_cluster_id": "f9b1d580-9be3-47f8-ac28-000000000000",
88+
"created_at": "2025-05-01T18:57:50.038137Z"
89+
}
90+
~~~
91+
92+
- `"id"`: The PCR stream's job ID.
93+
- `"status"`: The status of the PCR stream. For descriptions, refer to [Status](#status).
94+
- `"source_cluster_id"`, `"target_cluster_id"`: The cluster IDs of the primary and standby clusters.
95+
- `"created_at"`: The time at which the PCR stream was created.
96+
97+
To start PCR between clusters, CockroachDB {{ site.data.products.cloud }} sets up VPC peering between clusters and validates the connectivity. As a result, it may take around 5 minutes to initialize the PCR job during which the status will be `STARTING`.
98+
99+
### Step 2. Monitor the PCR stream
100+
101+
For monitoring the current status of the PCR stream, send a `GET` request to the `/v1/replication-streams` endpoint along with the ID of the PCR stream:
102+
103+
{% include_cached copy-clipboard.html %}
104+
~~~ shell
105+
curl --request GET "https://cockroachlabs.cloud/api/v1/replication-streams/job_id" --header "Authorization: Bearer api_secret_key"
106+
~~~
107+
108+
Replace:
109+
110+
- `api_secret_key` with your API secret key.
111+
- `job_id` with the PCR job's ID. You can find this in the response from when you created the PCR stream.
112+
113+
This will return a response similar to:
114+
115+
~~~json
116+
{
117+
"id": "c3d35c84-a4ea-41b3-8452-553c5ded3b85",
118+
"status": "REPLICATING",
119+
"source_cluster_id": "3fabc29e-5ced-48d9-b31e-000000000000",
120+
"target_cluster_id": "f9b1d580-9be3-47f8-ac28-000000000000",
121+
"created_at": "2025-05-01T18:57:50.038137Z",
122+
"retained_time": "2025-05-01T19:02:36.462825Z",
123+
"replicated_time": "2025-05-01T19:05:25Z",
124+
"replication_lag_seconds": 9
125+
}
126+
~~~
127+
128+
- `"id"`: The ID of the PCR stream.
129+
- `"status"`: The status of the PCR stream. For descriptions, refer to [Status](#status).
130+
- `"source_cluster_id"`, `"target_cluster_id"`: The cluster IDs of the primary and standby clusters.
131+
- `"created_at"`: The time at which the PCR stream was created.
132+
- `"retained_time"`: The timestamp indicating the lower bound that the PCR stream can failover to. The tracked replicated time and the advancing [protected timestamp]({% link {{ site.current_cloud_version }}/architecture/storage-layer.md %}#protected-timestamps) allows PCR to also track [_retained time_](#technical-reference).
133+
- `"replicated_time"`: The latest time at which the standby cluster has consistent data. This field will be present when the PCR stream is in the `REPLICATING` [state](#status).
134+
- `"replication_lag_seconds"`: The [_replication lag_](#technical-reference) in seconds. This field will be present when the PCR stream is in the `REPLICATING` [state](#status).
135+
136+
You can also list PCR streams and query using different parameters, refer to the [CockroachDB Cloud API Reference](https://www.cockroachlabs.com/docs/api/cloud/v1.html#get-/api/v1/replication-streams) for more details.
137+
138+
#### Status
139+
140+
Status | Description
141+
-------+------------
142+
`STARTING` | Setting up VPC peering between clusters and validating the connectivity.
143+
`REPLICATING` | Completing an initial scan and then continuing ongoing replication between the primary and standby clusters.
144+
`FAILING_OVER` | Initiating the failover from the primary to the standby cluster.
145+
`COMPLETED` | The failover is complete and the standby cluster is now independent from the primary cluster.
146+
147+
#### Metrics
148+
149+
For continual monitoring of PCR, track the following metrics with [Prometheus]({% link cockroachcloud/export-metrics.md %}):
150+
151+
- `physical_replication.logical_bytes`: The logical bytes (the sum of all keys and values) ingested by all PCR streams.
152+
- `physical_replication.sst_bytes`: The SST bytes (compressed) sent to the [KV layer]({% link {{ site.current_cloud_version }}/architecture/storage-layer.md %}) by all PCR streams.
153+
- `physical_replication.replicated_time_seconds`: The replicated time of the PCR stream in seconds since the Unix epoch.
154+
155+
### Step 3. Fail over to the standby cluster
156+
157+
Failing over from the primary cluster to the standby cluster will stop the PCR stream, reset the standby cluster to a point in time where all ingested data is consistent, and mark the standby as ready to accept application traffic. You can schedule the failover to:
158+
159+
- The latest consistent time.
160+
- A time in the past within the [`retained_time`](#technical-reference).
161+
- A time up to 1 hour in the future.
162+
163+
#### Fail over to the latest consistent time
164+
165+
To fail over to the latest consistent time, you only need to include `"status": "FAILING_OVER"` in your request with the PCR stream ID:
166+
167+
{% include_cached copy-clipboard.html %}
168+
~~~ shell
169+
curl --request PATCH --url "https://cockroachlabs.cloud/api/v1/replication-streams/7487d7a6-868b-4c6f-aa60-cc306cc525fe" --header "Authorization: Bearer api_secret_key" --json '{"status": "FAILING_OVER"}'
170+
~~~
171+
~~~json
172+
{
173+
"id": "c3d35c84-a4ea-41b3-8452-553c5ded3b85",
174+
"status": "FAILING_OVER",
175+
"source_cluster_id": "3fabc29e-5ced-48d9-b31e-8cc02e8da594",
176+
"target_cluster_id": "f9b1d580-9be3-47f8-ac28-ed2f943ad5e9",
177+
"created_at": "2025-05-01T18:57:50.038137Z"
178+
}
179+
~~~
180+
181+
#### Fail over to a specific time
182+
183+
To specify a timestamp, send a `PATCH` request to the `/v1/replication-streams` endpoint along with the primary cluster, standby cluster, or the ID of the PCR stream. Include the `failover_at` field with your required timestamp:
184+
185+
{% include_cached copy-clipboard.html %}
186+
~~~ shell
187+
curl --request PATCH "https://cockroachlabs.cloud/api/v1/replication-streams/job_id" --header "Authorization: Bearer api_secret_key" --json '{"status": "STARTING", "failover_at": "2025-05-01T19:39:39.731939Z"}'
188+
~~~
189+
~~~json
190+
{
191+
"id": "30cb4c91-9a46-4b62-865f-d0a035278ef8",
192+
"status": "FAILING_OVER",
193+
"source_cluster_id": "3fabc29e-5ced-48d9-b31e-8cc02e8da594",
194+
"target_cluster_id": "f9b1d580-9be3-47f8-ac28-ed2f943ad5e9",
195+
"created_at": "2025-05-01T19:39:31.306821Z",
196+
"failover_at": "2025-05-01T19:39:39.731939Z"
197+
}
198+
~~~
199+
200+
- `failover_at`: The requested timestamp for failover. If you used `"status":"FAILING_OVER"` to initiate the failover and omitted `failover_at`, the failover time will default to the latest consistent replicated time.
201+
202+
After the failover is complete, both clusters can receive traffic and operate as separate clusters. It is necessary to redirect application traffic manually.
203+
204+
Run a `GET` request to check when the failover is complete:
205+
206+
{% include_cached copy-clipboard.html %}
207+
~~~ shell
208+
curl --request GET "https://cockroachlabs.cloud/api/v1/replication-streams/job_id" --header "Authorization: Bearer api_secret_key"
209+
~~~
210+
~~~json
211+
{
212+
"id": "c3d35c84-a4ea-41b3-8452-553c5ded3b85",
213+
"status": "COMPLETED",
214+
"source_cluster_id": "3fabc29e-5ced-48d9-b31e-8cc02e8da594",
215+
"target_cluster_id": "f9b1d580-9be3-47f8-ac28-ed2f943ad5e9",
216+
"created_at": "2025-05-01T18:57:50.038137Z",
217+
"activation_at": "2025-05-01T19:28:10Z"
218+
}
219+
~~~
220+
221+
- `activation_at`: The CockroachDB system time at which failover is finalized, which could be different from the time that failover was requested. This field will return a response when the PCR stream is in [`COMPLETED` status](#status).
222+
223+
{{site.data.alerts.callout_info}}
224+
PCR replicates on the cluster level, which means that the job also replicates all system tables. Users that need to access the standby cluster after failover should use the user roles for the primary cluster, because the standby cluster is a copy of the primary cluster. PCR overwrites all previous system tables on the standby cluster.
225+
{{site.data.alerts.end}}
226+
227+
### Fail back to the primary cluster
228+
229+
To fail back from the standby to the primary cluster, start another PCR stream with the standby cluster as the `sourceClusterId` and the original primary cluster as the `target_cluster_id`.
230+
231+
## Technical reference
232+
233+
The replication happens at the byte level, which means that the job is unaware of databases, tables, row boundaries, and so on. However, when a [failover](#step-3-fail-over-to-the-standby-cluster) to the standby cluster is initiated, the PCR job ensures that the cluster is in a transactionally consistent state as of a certain point in time. Beyond the application data, the job will also replicate users, privileges, and schema changes.
234+
235+
At startup, the PCR job will set up VPC peering between the primary and standby {{ site.data.products.advanced }} clusters and validate the connectivity.
236+
237+
During the job, [rangefeeds]({% link {{ site.current_cloud_version }}/create-and-configure-changefeeds.md %}#enable-rangefeeds) are periodically emitting _resolved timestamps_, which is the time where the ingested data is known to be consistent. Resolved timestamps provide a guarantee that there are no new writes from before that timestamp. This allows the [protected timestamp]({% link {{ site.current_cloud_version }}/architecture/storage-layer.md %}#protected-timestamps) to move forward as the replicated timestamp advances, which permits the [garbage collection]({% link {{ site.current_cloud_version }}/architecture/storage-layer.md %}#garbage-collection) to continue as the PCR stream on the standby cluster advances.
238+
239+
{{site.data.alerts.callout_info}}
240+
If the primary cluster does not receive replicated time information from the standby after 24 hours, it cancels the replication job. This ensures that an inactive replication job will not prevent garbage collection.
241+
{{site.data.alerts.end}}
242+
243+
The tracked replicated time and the advancing protected timestamp provide the PCR job with enough information to track _retained time_, which is a timestamp in the past indicating the lower bound that the PCR stream could fail over to. Therefore, the _failover window_ for a PCR stream falls between the retained time and the replicated time.
244+
245+
<img src="{{ 'images/v25.1/failover.svg' | relative_url }}" alt="Timeline showing how the failover window is between the retained time and replicated time." style="border:0px solid #eee;width:100%" />
246+
247+
_Replication lag_ is the time between the most up-to-date replicated time and the actual time. While the PCR stream keeps as current as possible to the actual time, this replication lag window is where there is potential for data loss.
248+
249+
For the [failover process](#step-3-fail-over-to-the-standby-cluster), the standby cluster waits until it has reached the specified failover time, which can be in the past (up to the retained time), the latest consistent timestamp, or in the future (up to 1 hour). Once that timestamp has been reached, the PCR stream stops and any data in the standby cluster that is **above** the failover time is removed. Depending on how much data the standby needs to revert, this can affect the duration of [RTO (recovery time objective)]({% link {{ site.current_cloud_version }}/disaster-recovery-overview.md %}).
250+
251+
After reverting any necessary data, the standby cluster is promoted as available to serve traffic.
252+
253+
## See also
254+
255+
- [Physical Cluster Replication Overview]({% link {{ site.current_cloud_version }}/physical-cluster-replication-overview.md %})
256+
- [CockroachDB {{ site.data.products.cloud }} API reference](https://www.cockroachlabs.com/docs/api/cloud/v1)
257+
- [Disaster Recovery Overview]({% link {{ site.current_cloud_version }}/disaster-recovery-overview.md %})

0 commit comments

Comments
 (0)