Skip to content

DOCS-467 new info on replica set troubleshooting #278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 4, 2012
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
158 changes: 136 additions & 22 deletions source/administration/replica-sets.txt
Original file line number Diff line number Diff line change
Expand Up @@ -385,7 +385,7 @@ Removing Members
~~~~~~~~~~~~~~~~

You may remove a member of a replica at any time. Use the
:method:`rs.remove()` function in the :program:`mongo` shell while
:method:`rs.remove()` method in the :program:`mongo` shell while
connected to the current :term:`primary`. Issue the
:method:`db.isMaster()` command when connected to *any* member of the
set to determine the current primary. Use a command in either
Expand Down Expand Up @@ -561,38 +561,65 @@ OpenSSL package to generate "random" content for use in a key file:

Key file permissions are not checked on Windows systems.

Troubleshooting
---------------
Troubleshooting Replica Sets
----------------------------

This section defines reasonable troubleshooting processes for common
operational challenges. While there is no single causes or guaranteed
response strategies for any of these symptoms, the following sections
provide good places to start a troubleshooting investigation with
This section describes common strategies for troubleshooting
:term:`replica sets <replica set>`.

.. seealso:: :doc:`/administration/monitoring`.

.. _replica-set-troubleshooting-check-replication-status:

Check Replica Set Status
~~~~~~~~~~~~~~~~~~~~~~~~

To display the current state of the replica set and current state of
each member, run the :method:`rs.status()` method in a :program:`mongo`
shell connected to the replica set's :term:`primary`. For descriptions
of the information displayed by :method:`rs.status()`, see
:doc:`/reference/replica-status`.

.. note:: The :method:`rs.status()` method is a wrapper that runs the
:dbcommand:`replSetGetStatus` database command.

.. _replica-set-replication-lag:

Replication Lag
~~~~~~~~~~~~~~~
Check the Replication Lag
~~~~~~~~~~~~~~~~~~~~~~~~~

Replication lag is a delay between an operation on the :term:`primary`
and the application of that operation from the :term:`oplog` to the
:term:`secondary`. Such lag can be a significant issue and can
:term:`secondary`. Replication lag can be a significant issue and can
seriously affect MongoDB :term:`replica set` deployments. Excessive
replication lag makes "lagged" members ineligible to quickly become
primary and increases the possibility that distributed
read operations will be inconsistent.

Identify replication lag by checking the value of
:data:`members[n].optimeDate` for each member of the replica set
using the :method:`rs.status()` function in the :program:`mongo`
shell.
To check the current length of replication lag:

- In a :program:`mongo` shell connected to the primary, call the
:method:`db.printSlaveReplicationInfo()` method.

The outputted document displays the ``syncedTo`` value for each member,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputted.

which shows you when each member last read from the oplog, as shown in the following
example:

.. code-block:: javascript

source: m1.example.net:30001
syncedTo: Tue Oct 02 2012 11:33:40 GMT-0400 (EDT)
= 7475 secs ago (2.08hrs)
source: m2.example.net:30002
syncedTo: Tue Oct 02 2012 11:33:40 GMT-0400 (EDT)
= 7475 secs ago (2.08hrs)

.. note:: The :method:`rs.status()` method is a wrapper that runs the
:dbcommand:`replSetGetStatus` database command.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is a wrapper around the :dbcommand:replSetGetStatus


Also, you can monitor how fast replication occurs by watching the oplog
time in the "replica" graph in the `MongoDB Monitoring Service`_. Also
see the `documentation for MMS`_.
- Monitor the rate of replication by watching the oplog time in the
"replica" graph in the `MongoDB Monitoring Service`_. For more
information see the `documentation for MMS`_.

.. _`MongoDB Monitoring Service`: http://mms.10gen.com/
.. _`documentation for MMS`: http://mms.10gen.com/help/
Expand Down Expand Up @@ -635,9 +662,9 @@ Possible causes of replication lag include:

If you are performing a large data ingestion or bulk load operation
that requires a large number of writes to the primary, the
secondaries will not be able to read the :term:`oplog` fast enough to keep
up with changes. Setting some level :ref:`write concern <write-concern>`, can
slow the overall progress of the batch, but will prevent the
secondaries will not be able to read the oplog fast enough to keep
up with changes. Setting some level of write concern can
slow the overall progress of the batch but will prevent the
secondary from falling too far behind.

To prevent this, use write concern so that MongoDB will perform
Expand All @@ -653,17 +680,104 @@ Possible causes of replication lag include:
- :ref:`replica-set-write-concern`.
- The :ref:`replica-set-oplog-sizing` topic in the :doc:`/core/replication` document.
- The :ref:`replica-set-oplog` topic in the :doc:`/core/replication-internals` document.
- The :ref:`replica-set-procedure-change-oplog-size` topic this document.
- The :ref:`replica-set-procedure-change-oplog-size` topic in this document.
- The :doc:`/tutorial/change-oplog-size` tutorial.

.. _replica-set-troubleshooting-check-oplog-size:

Check the Size of the Oplog
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The :term:`oplog` size can be the difference between a :term:`secondary`
staying up-to-date or becoming stale.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer oplogs aren't really a solution to repl lag, just a way to either delay the need to resync, or for sets that have a "bursty" usage pattern, make it possible to support longer bursts of high activity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rule of thumb: oplog should be long enough to hold all transactions for the longest downtime you expect to take on a secondary. Operationally, I'd say that 24 hours was a minimum; 72 hours a good value; and lots of folks run with a week's worth of oplog, just for good measure.


To check the size of the oplog for a given :term:`replica set` member,
connect to the member in a :program:`mongo` shell and run the
:method:`db.printReplicationInfo()` method.

The output displays the size of the oplog and the date ranges of the
operations contained in the oplog. In the following example, the oplog
is about 10MB and is able to fit about 26 hours (94400 seconds) of
operations:

.. code-block:: javascript

configured oplog size: 10.10546875MB
log length start to end: 94400 (26.22hrs)
oplog first event time: Mon Mar 19 2012 13:50:38 GMT-0400 (EDT)
oplog last event time: Wed Oct 03 2012 14:59:10 GMT-0400 (EDT)
now: Wed Oct 03 2012 15:00:21 GMT-0400 (EDT)

The oplog should be long enough to hold all transactions for the longest
downtime you expect on a secondary. In many cases, an oplog should fit
at minimum 24 hours of operations. A size of 72 hours is often
preferred. And it is not uncommon for an oplog to fit a week's worth of
operations.

For more information on how oplog size affects operations, see:

- The :ref:`replica-set-oplog-sizing` topic in the :doc:`/core/replication` document.
- The :ref:`replica-set-delayed-members` topic in this document.
- The :ref:`replica-set-replication-lag` topic in this document.

.. note:: You normally want the oplog to be the same size on all
members. If you resize the oplog, resize it on all members.

To change oplog size, see :ref:`replica-set-procedure-change-oplog-size`
in this document or see the :doc:`/tutorial/change-oplog-size` tutorial.

.. _replica-set-troubleshooting-check-connection:

Test the Connection Between Each Member
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There must be connectivity from every :term:`replica set` member to
every other member in order for replication to work. Problems with
network or firewall rules can prevent this connectivity and prevent
replication from working. To test the connection from every member to
every other member, in both directions, consider the following example:

.. example:: Given a replica set with three members running on three separate
hosts:

- ``m1.example.net``
- ``m2.example.net``
- ``m3.example.net``

1. Test the connection from ``m1.example.net`` to the other hosts by running
the following operations from ``m1.example.net``:

.. code-block:: sh

mongo --host m2.example.net --port 27017"

mongo --host m3.example.net --port 27017"

#. Test the connection from ``m2.example.net`` to the other two
hosts by running similar appropriate operations from ``m2.example.net``.

This means you have now tested the connection between
``m2.example.net`` and ``m1.example.net`` twice, but each time
from a different direction. This is important to verifying
connectivity. Network topologies and firewalls might allow a
connection in one direction but not the other. Therefore you must
make sure to verify that the connection works in both directions.

#. Test the connection from ``m3.example.net`` to the other two
hosts by running the operations from ``m3.example.net``.

If a connection in any direction fails, there's a networking or
firewall issue that needs to be diagnosed separately.

.. index:: pair: replica set; failover
.. _replica-set-failover-administration:
.. _failover:

Failover and Recovery
~~~~~~~~~~~~~~~~~~~~~

.. todo:: Revisit whether this belongs in troubleshooting. Perhaps this should be an H2 before troubleshooting.
.. TODO Revisit whether this belongs in troubleshooting. Perhaps this
should be an H2 before troubleshooting.

Replica sets feature automated failover. If the :term:`primary`
goes offline or becomes unresponsive and a majority of the original
Expand Down