You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MongoDB provides high availability by having multiple copies of data in replica sets. Members of the same replica set should not share the same resources. For example, members of the same replica set should not
56
-
share the same physical hosts and disks. You can ensure that replica
57
-
sets don't share resources by :ref:`distributing data across data centers <arch-center-distribute-data>`.
55
+
MongoDB provides high availability by having multiple copies of data in replica sets.
56
+
Members of the same replica set don't share the same resources. For example, members
57
+
of the same replica set don't share the same physical hosts and disks.
58
+
|service| satisfies this requirement by default: it deploys nodes in
59
+
different availability zones, on different physical hosts and disks.
60
+
61
+
Ensure that replica sets don't share resources by :ref:`distributing data across data centers <arch-center-distribute-data>`.
58
62
59
63
Use an Odd Number of Replica Set Members
60
64
````````````````````````````````````````
61
65
62
66
To elect a :manual:`primary </core/replica-set-members>`, you need a majority of :manual:`voting </core/replica-set-elections>` replica set members available. We recommend that you create replica sets with an
63
67
odd number of voting replica set members. There is no benefit in having
64
-
an even number of voting replica set members.
68
+
an even number of voting replica set members. |service| satisfies this
69
+
requirement by default, as |service| requires having 3,5, or 7 nodes.
65
70
66
71
Fault tolerance is the number of replica set members that can become
67
72
unavailable with enough members still available for a primary election.
@@ -73,12 +78,20 @@ voting, see :manual:`Replica Set Elections </core/replica-set-elections>`.
73
78
74
79
.. _arch-center-distribute-data:
75
80
76
-
Distribute Data Across At Least Three Data Centers
To guarantee that a replica set can elect a primary if a data center
80
85
becomes unavailable, you must distribute nodes across at least three
81
-
data centers.
86
+
data centers, but we recommend that you use five data centers.
87
+
88
+
If you choose a region for your data centers that supports
89
+
availability zones, you can distribute nodes in data centers in different
90
+
availability zones. This way you can have multiple separate physical data
91
+
centers, each in its own availability zone and in the same region.
92
+
93
+
This section aims to illustrate the need for a deployment with five data centers.
94
+
To begin, we will consider deployments with two and three data centers first.
82
95
83
96
Consider the following diagram, which shows data distributed across
84
97
two data centers:
@@ -101,10 +114,23 @@ When you distribute nodes across three data centers, if one data
101
114
center becomes unavailable, you still have two out of three replica set
102
115
members available, which maintains a majority to elect a primary.
103
116
104
-
You can distribute data across at least three data centers within the same region by choosing a region with at least three availability zones. Availability zones consist of one or more discrete data centers, each with redundant power, networking and connectivity, housed in separate facilities.
117
+
In addition to ensuring high availability, we recommend that you ensure the continuity of write
118
+
operations. For this reason, we recommend that you deploy five data centers,
119
+
to achieve a 2+2+1 topology required for the majority write concern.
120
+
See the following section on :ref:`majority write concern <arch-center-majority-write-concern>` in this topic for detailed explanations of this
121
+
requirement.
122
+
123
+
You can distribute data across at least three data centers within the same region by choosing a region with at least three availability zones. Each
124
+
availability zone contains one or more discrete data centers, each with redundant power, networking and connectivity, often housed in separate
125
+
facilities.
105
126
106
127
{+service+} uses availability zones for all cloud providers
107
-
automatically when you deploy a dedicated cluster to a region that supports availability zones. Atlas splits the cluster's nodes across availability zones. For example, for a three-node replica set {+cluster+} deployed to a three-availability-zone region, {+service+} deploys one node in each zone. A local failure in the data center hosting one node doesn't impact the operation of data centers hosting the other nodes.
128
+
automatically when you deploy a dedicated cluster to a region that supports availability zones. |service| splits the cluster's nodes across
129
+
availability zones. For example, for a three-node replica set {+cluster+} deployed to a three-availability-zone region, {+service+} deploys one node
130
+
in each zone. A local failure in the data center hosting one node doesn't
131
+
impact the operation of data centers hosting the other nodes because MongoDB
132
+
performs automatic failover and leader election. Applications will
133
+
automatically recover in the event of local failures.
108
134
109
135
We recommend that you deploy replica sets to the following regions because they support at least three availability zones:
110
136
@@ -127,8 +153,9 @@ Use ``mongos`` Redundancy for Sharded {+Clusters+}
When a client connects to a sharded {+cluster+}, we recommend that you include multiple :manual:`mongos </reference/program/mongos/>`
130
-
processes in the connection URI. This allows
131
-
operations to route to different ``mongos`` instances for load
156
+
processes, separated by commas, in the connection URI. To learn more,
157
+
see :manual:`MongoDB Connection String Examples </reference/connection-string-examples/#self-hosted-replica-set-with-members-on-different-machines>`.
158
+
This allows operations to route to different ``mongos`` instances for load
132
159
balancing, but it is also important for disaster recovery.
133
160
134
161
Consider the following diagram, which shows a sharded {+cluster+}
@@ -141,9 +168,11 @@ processes in the other data centers.
141
168
142
169
You can use
143
170
:manual:`retryable reads </core/retryable-reads/>` and
144
-
:manual:`retryable writes</core/retryable-writes/>` to simplify the required error handling for the previous
171
+
:manual:`retryable writes</core/retryable-writes/>` to simplify the required error handling for the ``mongos``
145
172
configuration.
146
173
174
+
.. _arch-center-majority-write-concern:
175
+
147
176
Use ``majority`` Write Concern
148
177
``````````````````````````````
149
178
@@ -153,16 +182,20 @@ for write operations by using :manual:`write concern
153
182
you had a three-node replica set and had a write concern of
154
183
``majority``, every write operation would need to be persisted on
155
184
two nodes before an acknowledgment of completion sends to the driver
156
-
that issued said write operation. For protection during a regional
157
-
outage, we recommend that you set the write concern to ``majority``.
185
+
that issued said write operation. For the best protection from a regional
186
+
node outage, we recommend that you set the write concern to ``majority``.
187
+
188
+
Even though using ``majority`` write concern increases latency, compared
189
+
with write concern ``1``, we recommend that you use ``majority`` write
190
+
concern because it allows your data centers to continue having write
191
+
operations even if a replica set loses the primary.
158
192
159
-
To understand the importance of ``majorty`` write concern, imagine a
193
+
To understand the importance of ``majority`` write concern, consider a
160
194
five-node replica set spread across three separate regions with a 2-2-1
161
195
topology (two regions with two nodes and one region with one node),
162
-
with a write concern of ``4``. If one of the regions
163
-
with two nodes becomes unavailable due to an outage and only three
164
-
nodes are available, no write operations complete and the
165
-
operation hangs because it is unable to persist data on four nodes.
196
+
with a write concern of ``4``. If one of the regions with two nodes becomes
197
+
unavailable due to an outage and only three nodes are available, no write
198
+
operations complete and the operation hangs because it is unable to persist data on four nodes.
166
199
In this scenario, despite the availability of the majority of nodes in the replica set, the database
167
200
behaves the same as if a majority of the nodes in the replica set were
168
201
unavailable. If you use ``majority`` write concern rather than a numeric value, it prevents this scenario.
@@ -180,7 +213,7 @@ We recommend that you:
180
213
source data.
181
214
- Test your backup recovery process to ensure that you can restore
182
215
backups in a repeatable and timely manner.
183
-
- Ensure that your {+clusters+} run the same MongoDB versions for
216
+
- Confirm that your {+clusters+} run the same MongoDB versions for
184
217
compatibility during restore.
185
218
- Configure a :atlas:`backup compliance policy
186
219
</backup/cloud-backup/backup-compliance-policy/#std-label-backup-compliance-policy>` to prevent deleting backup
@@ -195,6 +228,10 @@ To avoid resource capacity issues, we recommend that you monitor
195
228
resource utilization and hold regular capacity planning sessions.
196
229
MongoDB Professional Services offers these sessions.
197
230
231
+
Over-utilized clusters could fail causing a disaster.
232
+
Scale up clusters to higher tiers if your utilization is regularly alerting at a steady state,
233
+
such as above 60%+ utilization for system CPU and system memory.
234
+
198
235
To view your resource utilization, see :atlas:`Monitor Real-Time Performance </real-time-performance-panel>`. To view metrics with the {+atlas-admin-api+}, see :oas-atlas-tag:`Monitoring and Logs </Monitoring-and-Logs>`.
199
236
200
237
To learn best practices for alerts and monitoring for resource
@@ -205,11 +242,17 @@ If you encounter resource capacity issues, see :ref:`arch-center-resource-capaci
205
242
Plan Your MongoDB Version Changes
206
243
`````````````````````````````````
207
244
208
-
Ensure that you perform MongoDB major version upgrades far before
209
-
your current version reaches `end of life <https://www.mongodb.com/legal/support-policy/lifecycles>`__.
245
+
We recommend that you run the latest MongoDB version as it allows you to
246
+
take advantage of new features and provides improved security guarantees
247
+
compared with previous versions.
210
248
211
-
You can't downgrade your MongoDB version using the {+atlas-ui+}, so we recommend that you work directly with MongoDB Professional or Technical
212
-
Services when planning and executing a major version upgrade to avoid any issues that might occur during the upgrade process.
249
+
Ensure that you perform MongoDB major version upgrades far before your
250
+
current version reaches `end of life <https://www.mongodb.com/legal/support-policy/lifecycles>`__.
251
+
252
+
You can't downgrade your MongoDB version using the {+atlas-ui+}. Because of this,
253
+
we recommend that you work directly with MongoDB Professional or Technical Services
254
+
when planning and executing a major version upgrade. This will help you avoid
255
+
any issues that might occur during the upgrade process.
213
256
214
257
Disaster Recovery Recommendations
215
258
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -259,7 +302,7 @@ If a single region outage or multi-region outage degrades the state of your {+cl
259
302
260
303
.. step:: Identify how many nodes are still online
261
304
262
-
You can find info about {+cluster+} health in the {+cluster+}\'s :guilabel:`Overview` tab of the
305
+
You can find information about {+cluster+} health in the {+cluster+}\'s :guilabel:`Overview` tab of the
263
306
{+atlas-ui+}.
264
307
265
308
.. step:: Determine how many nodes you require
@@ -303,7 +346,9 @@ unavailable, follow these steps to bring your deployment back online:
303
346
.. procedure::
304
347
:style: normal
305
348
306
-
.. step:: Determine when the cloud provider outage began
349
+
.. step:: Determine when the cloud provider outage began.
350
+
351
+
You will need this information later in this procedure to restore your deployment.
307
352
308
353
.. step:: Identify the alternative cloud provider you would like to deploy your new {+cluster+} on
309
354
@@ -319,25 +364,28 @@ unavailable, follow these steps to bring your deployment back online:
319
364
320
365
Your new {+cluster+} must have an identical topology of the original cluster.
321
366
367
+
Alternatively, instead of creating a full new cluster, you can also add new
368
+
nodes hosted by an alternative cloud provider to the existing cluster.
369
+
322
370
.. step:: Restore the most recent snapshot from the previous step into the new {+cluster+}
323
371
324
372
To learn how to restore your snapshot, see :atlas:`Restore Your Cluster </backup/cloud-backup/restore-overview/>`.
325
373
326
374
.. step:: Switch any applications that connect to the old {+cluster+} to the newly-created {+cluster+}
327
375
328
376
To find the new connection string, see :atlas:`Connect via Drivers </driver-connection>`.
377
+
Review your application stack as you will likely need to redeploy it onto the new cloud provider.
329
378
330
379
.. _arch-center-atlas-outage:
331
380
332
381
{+service+} Outage
333
382
``````````````````
334
383
335
-
In the highly unlikely event that the {+service+} Control Plane is
336
-
unavailable, open a high-priority :atlas:`support ticket </support/#request-support>`.
337
-
338
-
Your {+cluster+}
339
-
might still be available and accessible even if the {+atlas-ui+} is
340
-
unavailable.
384
+
In the highly unlikely event that the {+service+} Control Plane and
385
+
the {+atlas-ui+} are unavailable, your {+cluster+} is still available and accessible.
386
+
To lear more, see `Platform Reliability <https://www.mongodb.com/products/platform/trust#reliability>`__.
387
+
Open a high-priority :atlas:`support ticket </support/#request-support>`
388
+
to investigate this further.
341
389
342
390
.. _arch-center-resource-capacity:
343
391
@@ -402,6 +450,9 @@ Deletion of Production Data
402
450
403
451
Production data might be accidentally deleted due to human error or a bug
404
452
in the application built on top of the database.
453
+
If the cluster itself was accidentally deleted, Atlas may have retained
454
+
the volume temporarily.
455
+
405
456
406
457
If the contents of a collection or database have been deleted, follow these steps to restore your data:
407
458
@@ -416,8 +467,9 @@ If the contents of a collection or database have been deleted, follow these step
416
467
417
468
.. step:: Restore your data
418
469
419
-
If the deletion occurred within the last 72 hours, use Point in Time (PIT) restore to restore from the point in time right before the deletion occurred.
420
-
470
+
If the deletion occurred within the last 72 hours, and you configured continuous backup,
471
+
use Point in Time (PIT) restore to restore from the point in time right before the deletion occurred.
472
+
421
473
If the deletion did not occur in the past 72 hours, restore the most recent backup from before the deletion occurred into the cluster.
422
474
423
475
To learn more, see :atlas:`Restore Your Cluster </backup/cloud-backup/restore-overview/>`.
@@ -447,7 +499,9 @@ If a driver fails, follow these steps:
447
499
448
500
.. step:: Evaluate if any other changes are required to move to an earlier driver version
449
501
450
-
This might include application code or query changes.
502
+
This might include application code or query changes. For example,
503
+
there might be a breaking change if you are moving between
504
+
major and minor versions.
451
505
452
506
.. step:: Test the changes in a non-production environment
0 commit comments