Skip to content

Commit d5db77e

Browse files
authored
(DOCSP-46410) Address tech review comments for Disaster Recovery docs (#104)
* Periodic commit * Periodic commit * periodic commit * commit * (DOCSP-46610) Disaster recovery review comments added
1 parent 9dbd855 commit d5db77e

File tree

1 file changed

+89
-35
lines changed

1 file changed

+89
-35
lines changed

source/disaster-recovery.txt

Lines changed: 89 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -52,16 +52,21 @@ Use the following proactive configuration recommendations to configure your
5252
Members of the Same Replica Sets Should Not Share Resources
5353
```````````````````````````````````````````````````````````
5454

55-
MongoDB provides high availability by having multiple copies of data in replica sets. Members of the same replica set should not share the same resources. For example, members of the same replica set should not
56-
share the same physical hosts and disks. You can ensure that replica
57-
sets don't share resources by :ref:`distributing data across data centers <arch-center-distribute-data>`.
55+
MongoDB provides high availability by having multiple copies of data in replica sets.
56+
Members of the same replica set don't share the same resources. For example, members
57+
of the same replica set don't share the same physical hosts and disks.
58+
|service| satisfies this requirement by default: it deploys nodes in
59+
different availability zones, on different physical hosts and disks.
60+
61+
Ensure that replica sets don't share resources by :ref:`distributing data across data centers <arch-center-distribute-data>`.
5862

5963
Use an Odd Number of Replica Set Members
6064
````````````````````````````````````````
6165

6266
To elect a :manual:`primary </core/replica-set-members>`, you need a majority of :manual:`voting </core/replica-set-elections>` replica set members available. We recommend that you create replica sets with an
6367
odd number of voting replica set members. There is no benefit in having
64-
an even number of voting replica set members.
68+
an even number of voting replica set members. |service| satisfies this
69+
requirement by default, as |service| requires having 3,5, or 7 nodes.
6570

6671
Fault tolerance is the number of replica set members that can become
6772
unavailable with enough members still available for a primary election.
@@ -73,12 +78,20 @@ voting, see :manual:`Replica Set Elections </core/replica-set-elections>`.
7378

7479
.. _arch-center-distribute-data:
7580

76-
Distribute Data Across At Least Three Data Centers
77-
``````````````````````````````````````````````````
81+
Distribute Data Across At Least Three Data Centers in Different Availability Zones
82+
````````````````````````````````````````````````````````````````````````````````````
7883

7984
To guarantee that a replica set can elect a primary if a data center
8085
becomes unavailable, you must distribute nodes across at least three
81-
data centers.
86+
data centers, but we recommend that you use five data centers.
87+
88+
If you choose a region for your data centers that supports
89+
availability zones, you can distribute nodes in data centers in different
90+
availability zones. This way you can have multiple separate physical data
91+
centers, each in its own availability zone and in the same region.
92+
93+
This section aims to illustrate the need for a deployment with five data centers.
94+
To begin, we will consider deployments with two and three data centers first.
8295

8396
Consider the following diagram, which shows data distributed across
8497
two data centers:
@@ -101,10 +114,23 @@ When you distribute nodes across three data centers, if one data
101114
center becomes unavailable, you still have two out of three replica set
102115
members available, which maintains a majority to elect a primary.
103116

104-
You can distribute data across at least three data centers within the same region by choosing a region with at least three availability zones. Availability zones consist of one or more discrete data centers, each with redundant power, networking and connectivity, housed in separate facilities.
117+
In addition to ensuring high availability, we recommend that you ensure the continuity of write
118+
operations. For this reason, we recommend that you deploy five data centers,
119+
to achieve a 2+2+1 topology required for the majority write concern.
120+
See the following section on :ref:`majority write concern <arch-center-majority-write-concern>` in this topic for detailed explanations of this
121+
requirement.
122+
123+
You can distribute data across at least three data centers within the same region by choosing a region with at least three availability zones. Each
124+
availability zone contains one or more discrete data centers, each with redundant power, networking and connectivity, often housed in separate
125+
facilities.
105126

106127
{+service+} uses availability zones for all cloud providers
107-
automatically when you deploy a dedicated cluster to a region that supports availability zones. Atlas splits the cluster's nodes across availability zones. For example, for a three-node replica set {+cluster+} deployed to a three-availability-zone region, {+service+} deploys one node in each zone. A local failure in the data center hosting one node doesn't impact the operation of data centers hosting the other nodes.
128+
automatically when you deploy a dedicated cluster to a region that supports availability zones. |service| splits the cluster's nodes across
129+
availability zones. For example, for a three-node replica set {+cluster+} deployed to a three-availability-zone region, {+service+} deploys one node
130+
in each zone. A local failure in the data center hosting one node doesn't
131+
impact the operation of data centers hosting the other nodes because MongoDB
132+
performs automatic failover and leader election. Applications will
133+
automatically recover in the event of local failures.
108134

109135
We recommend that you deploy replica sets to the following regions because they support at least three availability zones:
110136

@@ -127,8 +153,9 @@ Use ``mongos`` Redundancy for Sharded {+Clusters+}
127153
```````````````````````````````````````````````````
128154

129155
When a client connects to a sharded {+cluster+}, we recommend that you include multiple :manual:`mongos </reference/program/mongos/>`
130-
processes in the connection URI. This allows
131-
operations to route to different ``mongos`` instances for load
156+
processes, separated by commas, in the connection URI. To learn more,
157+
see :manual:`MongoDB Connection String Examples </reference/connection-string-examples/#self-hosted-replica-set-with-members-on-different-machines>`.
158+
This allows operations to route to different ``mongos`` instances for load
132159
balancing, but it is also important for disaster recovery.
133160

134161
Consider the following diagram, which shows a sharded {+cluster+}
@@ -141,9 +168,11 @@ processes in the other data centers.
141168

142169
You can use
143170
:manual:`retryable reads </core/retryable-reads/>` and
144-
:manual:`retryable writes</core/retryable-writes/>` to simplify the required error handling for the previous
171+
:manual:`retryable writes</core/retryable-writes/>` to simplify the required error handling for the ``mongos``
145172
configuration.
146173

174+
.. _arch-center-majority-write-concern:
175+
147176
Use ``majority`` Write Concern
148177
``````````````````````````````
149178

@@ -153,16 +182,20 @@ for write operations by using :manual:`write concern
153182
you had a three-node replica set and had a write concern of
154183
``majority``, every write operation would need to be persisted on
155184
two nodes before an acknowledgment of completion sends to the driver
156-
that issued said write operation. For protection during a regional
157-
outage, we recommend that you set the write concern to ``majority``.
185+
that issued said write operation. For the best protection from a regional
186+
node outage, we recommend that you set the write concern to ``majority``.
187+
188+
Even though using ``majority`` write concern increases latency, compared
189+
with write concern ``1``, we recommend that you use ``majority`` write
190+
concern because it allows your data centers to continue having write
191+
operations even if a replica set loses the primary.
158192

159-
To understand the importance of ``majorty`` write concern, imagine a
193+
To understand the importance of ``majority`` write concern, consider a
160194
five-node replica set spread across three separate regions with a 2-2-1
161195
topology (two regions with two nodes and one region with one node),
162-
with a write concern of ``4``. If one of the regions
163-
with two nodes becomes unavailable due to an outage and only three
164-
nodes are available, no write operations complete and the
165-
operation hangs because it is unable to persist data on four nodes.
196+
with a write concern of ``4``. If one of the regions with two nodes becomes
197+
unavailable due to an outage and only three nodes are available, no write
198+
operations complete and the operation hangs because it is unable to persist data on four nodes.
166199
In this scenario, despite the availability of the majority of nodes in the replica set, the database
167200
behaves the same as if a majority of the nodes in the replica set were
168201
unavailable. If you use ``majority`` write concern rather than a numeric value, it prevents this scenario.
@@ -180,7 +213,7 @@ We recommend that you:
180213
source data.
181214
- Test your backup recovery process to ensure that you can restore
182215
backups in a repeatable and timely manner.
183-
- Ensure that your {+clusters+} run the same MongoDB versions for
216+
- Confirm that your {+clusters+} run the same MongoDB versions for
184217
compatibility during restore.
185218
- Configure a :atlas:`backup compliance policy
186219
</backup/cloud-backup/backup-compliance-policy/#std-label-backup-compliance-policy>` to prevent deleting backup
@@ -195,6 +228,10 @@ To avoid resource capacity issues, we recommend that you monitor
195228
resource utilization and hold regular capacity planning sessions.
196229
MongoDB Professional Services offers these sessions.
197230

231+
Over-utilized clusters could fail causing a disaster.
232+
Scale up clusters to higher tiers if your utilization is regularly alerting at a steady state,
233+
such as above 60%+ utilization for system CPU and system memory.
234+
198235
To view your resource utilization, see :atlas:`Monitor Real-Time Performance </real-time-performance-panel>`. To view metrics with the {+atlas-admin-api+}, see :oas-atlas-tag:`Monitoring and Logs </Monitoring-and-Logs>`.
199236

200237
To learn best practices for alerts and monitoring for resource
@@ -205,11 +242,17 @@ If you encounter resource capacity issues, see :ref:`arch-center-resource-capaci
205242
Plan Your MongoDB Version Changes
206243
`````````````````````````````````
207244

208-
Ensure that you perform MongoDB major version upgrades far before
209-
your current version reaches `end of life <https://www.mongodb.com/legal/support-policy/lifecycles>`__.
245+
We recommend that you run the latest MongoDB version as it allows you to
246+
take advantage of new features and provides improved security guarantees
247+
compared with previous versions.
210248

211-
You can't downgrade your MongoDB version using the {+atlas-ui+}, so we recommend that you work directly with MongoDB Professional or Technical
212-
Services when planning and executing a major version upgrade to avoid any issues that might occur during the upgrade process.
249+
Ensure that you perform MongoDB major version upgrades far before your
250+
current version reaches `end of life <https://www.mongodb.com/legal/support-policy/lifecycles>`__.
251+
252+
You can't downgrade your MongoDB version using the {+atlas-ui+}. Because of this,
253+
we recommend that you work directly with MongoDB Professional or Technical Services
254+
when planning and executing a major version upgrade. This will help you avoid
255+
any issues that might occur during the upgrade process.
213256

214257
Disaster Recovery Recommendations
215258
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -259,7 +302,7 @@ If a single region outage or multi-region outage degrades the state of your {+cl
259302

260303
.. step:: Identify how many nodes are still online
261304

262-
You can find info about {+cluster+} health in the {+cluster+}\'s :guilabel:`Overview` tab of the
305+
You can find information about {+cluster+} health in the {+cluster+}\'s :guilabel:`Overview` tab of the
263306
{+atlas-ui+}.
264307

265308
.. step:: Determine how many nodes you require
@@ -303,7 +346,9 @@ unavailable, follow these steps to bring your deployment back online:
303346
.. procedure::
304347
:style: normal
305348

306-
.. step:: Determine when the cloud provider outage began
349+
.. step:: Determine when the cloud provider outage began.
350+
351+
You will need this information later in this procedure to restore your deployment.
307352

308353
.. step:: Identify the alternative cloud provider you would like to deploy your new {+cluster+} on
309354

@@ -319,25 +364,28 @@ unavailable, follow these steps to bring your deployment back online:
319364

320365
Your new {+cluster+} must have an identical topology of the original cluster.
321366

367+
Alternatively, instead of creating a full new cluster, you can also add new
368+
nodes hosted by an alternative cloud provider to the existing cluster.
369+
322370
.. step:: Restore the most recent snapshot from the previous step into the new {+cluster+}
323371

324372
To learn how to restore your snapshot, see :atlas:`Restore Your Cluster </backup/cloud-backup/restore-overview/>`.
325373

326374
.. step:: Switch any applications that connect to the old {+cluster+} to the newly-created {+cluster+}
327375

328376
To find the new connection string, see :atlas:`Connect via Drivers </driver-connection>`.
377+
Review your application stack as you will likely need to redeploy it onto the new cloud provider.
329378

330379
.. _arch-center-atlas-outage:
331380

332381
{+service+} Outage
333382
``````````````````
334383

335-
In the highly unlikely event that the {+service+} Control Plane is
336-
unavailable, open a high-priority :atlas:`support ticket </support/#request-support>`.
337-
338-
Your {+cluster+}
339-
might still be available and accessible even if the {+atlas-ui+} is
340-
unavailable.
384+
In the highly unlikely event that the {+service+} Control Plane and
385+
the {+atlas-ui+} are unavailable, your {+cluster+} is still available and accessible.
386+
To lear more, see `Platform Reliability <https://www.mongodb.com/products/platform/trust#reliability>`__.
387+
Open a high-priority :atlas:`support ticket </support/#request-support>`
388+
to investigate this further.
341389

342390
.. _arch-center-resource-capacity:
343391

@@ -402,6 +450,9 @@ Deletion of Production Data
402450

403451
Production data might be accidentally deleted due to human error or a bug
404452
in the application built on top of the database.
453+
If the cluster itself was accidentally deleted, Atlas may have retained
454+
the volume temporarily.
455+
405456

406457
If the contents of a collection or database have been deleted, follow these steps to restore your data:
407458

@@ -416,8 +467,9 @@ If the contents of a collection or database have been deleted, follow these step
416467

417468
.. step:: Restore your data
418469

419-
If the deletion occurred within the last 72 hours, use Point in Time (PIT) restore to restore from the point in time right before the deletion occurred.
420-
470+
If the deletion occurred within the last 72 hours, and you configured continuous backup,
471+
use Point in Time (PIT) restore to restore from the point in time right before the deletion occurred.
472+
421473
If the deletion did not occur in the past 72 hours, restore the most recent backup from before the deletion occurred into the cluster.
422474

423475
To learn more, see :atlas:`Restore Your Cluster </backup/cloud-backup/restore-overview/>`.
@@ -447,7 +499,9 @@ If a driver fails, follow these steps:
447499

448500
.. step:: Evaluate if any other changes are required to move to an earlier driver version
449501

450-
This might include application code or query changes.
502+
This might include application code or query changes. For example,
503+
there might be a breaking change if you are moving between
504+
major and minor versions.
451505

452506
.. step:: Test the changes in a non-production environment
453507

0 commit comments

Comments
 (0)