diff --git a/source/administration/replica-sets.txt b/source/administration/replica-sets.txt index 15b97e587ff..2fc66619a6e 100644 --- a/source/administration/replica-sets.txt +++ b/source/administration/replica-sets.txt @@ -46,11 +46,13 @@ configurations. .. warning:: The :method:`rs.reconfig()` shell command can force the current - primary to step down, which causes an election. When the primary + primary to step down, which causes an :ref:`election `. When the primary steps down, the :program:`mongod` closes all client connections. While, this typically takes 10-20 seconds, attempt to make these changes during scheduled maintenance periods. +.. include:: /includes/seealso-elections.rst + .. index:: replica set members; secondary only .. _replica-set-secondary-only-members: .. _replica-set-secondary-only-configuration: @@ -69,9 +71,10 @@ these members from ever becoming primary. To configure a member as secondary-only, set its :data:`members[n].priority` value to ``0``. Any member with a -:data:`members[n].priority` equal to ``0`` will never seek election and -cannot become primary in any situation. For more information on priority -levels, see :ref:`replica-set-node-priority`. +:data:`members[n].priority` equal to ``0`` will never seek +:ref:`election ` and cannot become primary in any +situation. For more information on priority levels, see +:ref:`replica-set-node-priority`. As an example of modifying member priorities, assume a four-member replica set with member ``_id`` values of: ``0``, ``1``, ``2``, and @@ -107,7 +110,7 @@ This sets the following: If your replica set has an even number of members, add an :ref:`arbiter ` to ensure that members can quickly obtain a majority of votes in an - :ref:`election ` for primary. + election for primary. .. seealso:: :data:`members[n].priority` and :ref:`Replica Set Reconfiguration `. @@ -155,7 +158,7 @@ other members in the set will not advertise the hidden member in the of ``0``, the operation fails. .. seealso:: :ref:`Replica Set Read Preference ` - and :ref:`Replica Set Reconfiguration ` + and :ref:`Replica Set Reconfiguration `. .. index:: replica set members; delayed .. _replica-set-delayed-members: @@ -183,8 +186,8 @@ the amount of slave delay to apply: - The size of the oplog is sufficient to capture *more than* the number of operations that typically occur in that period of - time. See the section on :ref:`oplog sizing - ` for more information. + time. For more information on oplog size, see the + :ref:`replica-set-oplog-sizing` topic in the :doc:`/core/replication` document. Delayed members must have a :term:`priority` set to ``0`` to prevent them from becoming primary in their replica sets. Also these members @@ -233,7 +236,7 @@ Arbiters Arbiters are special :program:`mongod` instances that do not hold a copy of the data and thus cannot become primary. Arbiters exist solely -participate in :term:`elections `. +participate in :ref:`elections `. .. note:: @@ -290,15 +293,14 @@ Non-Voting ~~~~~~~~~~ You may choose to change the number of votes that each member has in -:term:`elections ` for :term:`primary`. In general, all +:ref:`elections ` for :term:`primary`. In general, all members should have only 1 vote to prevent intermittent ties, deadlock, or the wrong members from becoming :term:`primary`. Use :ref:`replica set priorities ` to control which members are more likely to become primary. -To disable a member's ability to vote in :ref:`elections -` use the following command sequence in the -:program:`mongo` shell. +To disable a member's ability to vote in elections, use the following +command sequence in the :program:`mongo` shell. .. code-block:: javascript @@ -394,7 +396,7 @@ you specify a full configuration object with :method:`rs.add()`, you must declare the ``_id`` field, which is not automatically populated in this case. -.. seealso:: :doc:`/tutorial/expand-replica-set` +.. seealso:: :doc:`/tutorial/expand-replica-set`. .. _replica-set-admin-procedure-remove-members: @@ -454,7 +456,7 @@ number. :method:`rs.reconfig()` will not change the value of .. warning:: Any replica set configuration change can trigger the current - :term:`primary` to step down, which forces an :term:`election`. This + :term:`primary` to step down, which forces an :ref:`election `. This causes the current shell session, and clients connected to this replica set, to produce an error even when the operation succeeds. @@ -486,7 +488,7 @@ the new configuration. If a member has :data:`members[n].priority` set to ``0``, it is ineligible to become :term:`primary` and will not seek -elections. :ref:`Hidden members `, +election. :ref:`Hidden members `, :ref:`delayed members `, and :ref:`arbiters ` all have :data:`members[n].priority` set to ``0``. @@ -741,4 +743,5 @@ data to a :term:`BSON` file that you can view using You can prevent rollbacks by ensuring safe writes by using the appropriate :term:`write concern`. -.. seealso:: :ref:`Replica Set Elections ` +.. include:: /includes/seealso-elections.rst + diff --git a/source/applications/replication.txt b/source/applications/replication.txt index 214c6cc7915..c6c46254df1 100644 --- a/source/applications/replication.txt +++ b/source/applications/replication.txt @@ -17,6 +17,36 @@ This document describes those options and their implications. shards are also replica sets provide the same configuration options with regards to write and read operations. +.. TODO Is any of the following missing from this document: + +.. Writes committed at the primary may be visible before the + cluster-wide commit completes. The read uncommitted semantics (an + option on many databases) are more relaxed and make theoretically + achievable performance and availability higher (for example we never + have an object locked in the server where the locking is dependent on + network performance). + +.. On a failover, if there are writes which have not replicated from the + primary, the writes are rolled back. To confirm replica-set-wide + commits, use the getLastError command. On a failover, data is backed + up to files in the rollback directory. To recover this data use the + mongorestore. + +.. Merging back old operations later, after another member has accepted + writes, is a hard problem. One then has multi-master replication, + with potential for conflicting writes. Typically that is handled in + other products by manual version reconciliation code by developers. + That is too much work. Multi-master also can make atomic operation + semantics problematic. It is possible (as mentioned above) to + manually recover these events, via manual DBA effort, but in large + system with many, many members that such efforts become impractical. + +.. Calling getLastError causes the client to wait for a response from + the server. This can slow the client's throughput on writes if large + numbers are made because of the client/server network turnaround + times. Thus for "non-critical" writes it often makes sense to make no + getLastError check at all, or only a single check after many writes. + .. _write-concern: .. _replica-set-write-concern: diff --git a/source/core/replication-internals.txt b/source/core/replication-internals.txt index 2d016c65ea6..b73c6802a0c 100644 --- a/source/core/replication-internals.txt +++ b/source/core/replication-internals.txt @@ -18,17 +18,17 @@ troubleshooting and for further understanding MongoDB's behavior and approach. Oplog ----- -Replication itself works by way of a special :term:`capped collection` -called the :term:`oplog`. This collection keeps a rolling record of -all operations applied to the :term:`primary`. Secondary members then -replicate this log by applying the operations to themselves in an -asynchronous process. Under normal operation, :term:`secondary` members -reflect writes within one second of the primary. However, various -exceptional situations may cause secondaries to lag behind further. See +For an explanation of the oplog, see the :ref:`oplog ` +topic in the :doc:`/core/replication` document. + +Under various exceptional +situations, updates to a :term:`secondary's ` oplog might +lag behind the desired performance time. See :ref:`Replication Lag ` for details. -All members send heartbeats (pings) to all other members in the set and can -import operations to the local oplog from any other member in the set. +All members of a :term:`replica set` send heartbeats (pings) to all +other members in the set and can import operations to the local oplog +from any other member in the set. Replica set oplog operations are :term:`idempotent`. The following operations require idempotency: @@ -37,20 +37,21 @@ operations require idempotency: - post-rollback catch-up - sharding chunk migrations -.. seealso:: The :ref:`replica-set-oplog-sizing` topic in - :doc:`/core/replication`. - .. TODO Verify that "sharding chunk migrations" (above) requires idempotency. The wiki was unclear on the subject. .. In 2.0, replicas would import entries from the member lowest .. "ping," This wasn't true in 1.8 and will likely change in 2.2. +.. _replica-set-data-integrity: .. _replica-set-implementation: -Implementation +Data Integrity -------------- +Read Preferences +~~~~~~~~~~~~~~~~ + MongoDB uses :term:`single-master replication` to ensure that the database remains consistent. However, clients may modify the :ref:`read preferences ` on a @@ -59,10 +60,9 @@ per-connection basis in order to distribute read operations to the greater query throughput by distributing reads to secondary members. But keep in mind that replication is asynchronous; therefore, reads from secondaries may not always reflect the latest writes to the -:term:`primary`. See the :ref:`consistency ` -section for more about :ref:`read preference -` and :ref:`write concern -`. +:term:`primary`. + +.. seealso:: :ref:`replica-set-consistency` .. note:: @@ -71,16 +71,12 @@ section for more about :ref:`read preference output to asses the current state of replication and determine if there is any unintended replication delay. -In the default configuration, all members have an equal chance of -becoming primary; however, it's possible to set :data:`priority ` values that -weight the election. In some architectures, there may be operational -reasons for increasing the likelihood of a specific replica set member -becoming primary. For instance, a member located in a remote data -center should *not* become primary. See: :ref:`node -priority ` for more background on this -concept. - -Replica sets can also include members with the following four special +.. _replica-set-member-configurations-internals: + +Member Configurations +--------------------- + +Replica sets can include members with the following four special configurations that affect membership behavior: - :ref:`Secondary-only ` members have @@ -106,6 +102,12 @@ unique set of administrative requirements and concerns. Choosing the right :doc:`system architecture ` for your data set is crucial. +.. seealso:: The :ref:`replica-set-member-configurations` topic in the + :doc:`/administration/replica-sets` document. + +Security +-------- + Administrators of replica sets also have unique :ref:`monitoring ` and :ref:`security ` concerns. The :ref:`replica set functions ` in @@ -122,35 +124,46 @@ modify the configuration of an existing replica set. Elections --------- -When you initialize a :term:`replica set` for the first time, or when any -failover occurs, an election takes place to decide which member should +Elections are the process :term:`replica set` members use to select which member should become :term:`primary`. A primary is the only member in the replica set that can accept write operations, including :method:`insert() `, :method:`update() `, and :method:`remove() `. -Elections are the process replica set members use to -select the primary in a set. Two types of events can trigger an election: -a primary steps down or a :term:`secondary` member -loses contact with a primary. All members have one vote -in an election, and any :program:`mongod` can veto an election. A -single veto invalidates the election. - -An existing primary will step down in response to the -:dbcommand:`replSetStepDown` command or if it sees that one of -the current secondaries is eligible for election *and* has a higher -priority. A secondary will call for an election if it cannot -establish a connection to a primary. A primary will also step -down when it cannot contact a majority of the members of the replica -set. When the current primary steps down, it closes all open client -connections to prevent clients from unknowingly writing data to a -non-primary member. - -In an election, every member, including :ref:`hidden -` members, :ref:`arbiters -`, and even recovering members, get a single -vote. Members will give votes to every eligible member that calls an -election. +The following events can trigger an election: + +- You initialize a replica set for the first time. + +- A primary steps down. A primary will step down in response to the + :dbcommand:`replSetStepDown` command or if it sees that one of the + current secondaries is eligible for election *and* has a higher + priority. A primary also will step down when it cannot contact a + majority of the members of the replica set. When the current primary + steps down, it closes all open client connections to prevent clients + from unknowingly writing data to a non-primary member. + +- A :term:`secondary` member loses contact with a primary. A secondary + will call for an election if it cannot establish a connection to a + primary. + +- A :term:`failover` occurs. + +In an election, all members have one vote, +including :ref:`hidden ` members, :ref:`arbiters +`, and even recovering members. +Any :program:`mongod` can veto an election. + +In the default configuration, all members have an equal chance of +becoming primary; however, it's possible to set :data:`priority +` values that weight the election. In some +architectures, there may be operational reasons for increasing the +likelihood of a specific replica set member becoming primary. For +instance, a member located in a remote data center should *not* become +primary. See: :ref:`replica-set-node-priority` for more +information. + +Any member of a replica set can veto an election, even if the +member is a :ref:`non-voting member `. A member of the set will veto an election under the following conditions: @@ -167,15 +180,10 @@ conditions: (i.e. a higher "optime") than the member seeking election, from the perspective of the voting member. -- The current primary will also veto an election if it has the same or +- The current primary will veto an election if it has the same or more recent operations (i.e. a "higher or equal optime") than the member seeking election. -.. note:: - - Any member of a replica set *can* veto an election, even if the - member is a :ref:`non-voting member `. - The first member to receive votes from a majority of members in a set becomes the next primary until the next election. Be aware of the following conditions and possible situations: @@ -186,15 +194,9 @@ aware of the following conditions and possible situations: - Replica set members compare priorities only with other members of the set. The absolute value of priorities does not have any impact on - the outcome of replica set elections. - - .. note:: - - The only exception is that members with :data:`priority - ` values of ``0`` - cannot become primary and will not seek election. See - :ref:`replica-set-node-priority-configuration` for more - information. + the outcome of replica set elections, with the exception of the value ``0``, + which indicates the member cannot become primary and cannot seek election. + For details, see :ref:`replica-set-node-priority-configuration`. - A replica set member cannot become primary *unless* it has the highest "optime" of any visible member in the set. @@ -204,12 +206,24 @@ aware of the following conditions and possible situations: primary until the member with the highest priority catches up to the latest operation. - .. seealso:: :ref:`Non-voting members in a replica set `, :ref:`replica-set-node-priority-configuration`, and :data:`replica configuration `. +Elections and Network Partitions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. TODO The following two paragraphs needs review -BG + +Members on either side of a network partition cannot see each other when +determining whether a majority is available to hold an election. + +That means that if a primary steps down and neither side of the +partition has a majority on its own, the set will not elect a new +primary and the set will become read only. The best practice is to have +and a majority of servers in one data center and one server in another. + Syncing ------- diff --git a/source/core/replication.txt b/source/core/replication.txt index 132d2257e60..2f68448a420 100644 --- a/source/core/replication.txt +++ b/source/core/replication.txt @@ -68,7 +68,7 @@ You can configure a member as any of the following: a specified delay. See :ref:`replica-set-delayed-members`. - **Arbiters**: These members do not hold data and exist solely to - participate in :term:`elections `. See :ref:`replica-set-arbiters`. + participate in :ref:`elections `. See :ref:`replica-set-arbiters`. - **Non-Voting**: These members cannot vote in elections. See :ref:`replica-set-non-voting-members`. @@ -97,7 +97,7 @@ administrator intervention. The election allows replica sets to recover from failover situations very quickly and robustly. Whenever the primary becomes unreachable, the secondary members -trigger an :ref:`election `. The first member to +trigger an election. The first member to receive votes from a majority of the set will become primary. The most important feature of replica set elections is that a majority of the original number of members in the replica set must be present for @@ -114,7 +114,7 @@ remain a secondary. view of the :term:`replica set` and helps prevent :term:`rollbacks `. -.. seealso:: :ref:`Replica Set Election Internals ` +.. seealso:: The :ref:`replica-set-election-internals` topic in the :doc:`/core/replication-internals` document. .. index:: replica set; priority .. _replica-set-node-priority: @@ -127,7 +127,7 @@ In a replica set, every member has a "priority," that helps determine eligibility for :ref:`election ` to :term:`primary`. By default, all members have a priority of ``1``, unless you modify the :data:`members[n].priority` value. All members -have a single vote in :ref:`elections `. +have a single vote in elections. .. warning:: @@ -267,12 +267,16 @@ Oplog 64 bit = larger of 5% of disk or ~1 gigabyte 64 bit OS X = ~183 megabytes -The operation log (i.e. :term:`oplog`) is a :term:`capped collection` -that stores all operations that modify the data stored in MongoDB. All -members of the replica set have oplogs that allow them to maintain the -current state of the database. Unless you modify the size of your -oplog with the :setting:`oplogSize` option, the *default* size of the -oplog will be as follows: +The :term:`oplog` (operations log) is a special :term:`capped +collection` that keeps a rolling record of all operations that modify +that data stored in your databases. MongoDB applies database operations +on the :term:`primary` and then records the operations on the primary's +oplog. The :term:`secondary` members then replicate this log and apply +the operations to themselves in an asynchronous process. All replica set +members contain a copy of the oplog, allowing them to maintain the +current state of the database. + +By default, the size of the oplog is as follows: - For 64-bit Linux, Solaris, and FreeBSD systems, MongoDB will allocate 5% of the available free disk space to the oplog. @@ -286,11 +290,12 @@ oplog will be as follows: - For 32-bit systems, MongoDB allocates about 48 megabytes of space to the oplog. -.. note:: +Before oplog creation, you can specify the size of your oplog with the +:setting:`oplogSize` option. Once the oplog is created, you can only +change the size of the oplog by using the :ref:`oplog resizing procedure +`. - Once created, you cannot change the size of the oplog without using - the :ref:`oplog resizing procedure ` - outlined in the :doc:`/tutorial/change-oplog-size` guide. +.. QUESTION: SK, Can the next graph be rewritten? It's unclear to me. BG For example, if an oplog fits 24 hours of operations, then members can stop copying entries from the oplog for 24 hours before they require @@ -324,6 +329,9 @@ activity of your MongoDB-based application are reads and you are writing a small amount of data, you may find that you need a much smaller oplog. +.. seealso:: The :ref:`replica-set-oplog` topic in + the :doc:`/core/replication-internals` document. + Deployment ~~~~~~~~~~ diff --git a/source/includes/seealso-elections.rst b/source/includes/seealso-elections.rst new file mode 100644 index 00000000000..3f935aaf992 --- /dev/null +++ b/source/includes/seealso-elections.rst @@ -0,0 +1,4 @@ +.. seealso:: The :ref:`replica-set-elections` topic in the + :doc:`/core/replication` document, and the + :ref:`replica-set-election-internals` topic in the + :doc:`/core/replication-internals` document.