Skip to content

Replication lag #147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 28, 2012
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 13 additions & 11 deletions source/administration/monitoring.txt
Original file line number Diff line number Diff line change
Expand Up @@ -339,11 +339,11 @@ This returns all operations that lasted longer than 100 milliseconds.
Ensure that the value specified here (i.e. ``100``) is above the
:setting:`slowms` threshold.

.. seealso:: The ":wiki:`Optimization`" wiki page addresses strategies
.. seealso:: The :wiki:`Optimization` wiki page addresses strategies
that may improve the performance of your database queries and
operations.

.. STUB ":doc:`/applications/optimization`"
.. STUB :doc:`/applications/optimization`

.. _replica-set-monitoring:

Expand All @@ -355,30 +355,32 @@ replica sets, beyond the requirements for any MongoDB instance is
"replication lag." This refers to the amount of time that it takes a
write operation on the :term:`primary` to replicate to a
:term:`secondary`. Some very small delay period may be acceptable;
however, as replication lag grows two significant problems emerge:
however, as replication lag grows, two significant problems emerge:

- First, operations that have occurred in the period of lag are not
replicated to one or more secondaries. If you're using replication
to ensure data persistence, exceptionally long delays may impact the
integrity of your data set.

- Second, if the replication lag exceeds the length of the operation
log (":term:`oplog`") then the secondary will have to resync all data
log (:term:`oplog`) then the secondary will have to resync all data
from the :term:`primary` and rebuild all indexes. In normal
circumstances this is uncommon given the typical size of the oplog,
but presents a major problem.
but it's an issue to be aware of.

For causes of replication lag, see :ref:`Replication Lag <replica-set-replication-lag>`.

Replication issues are most often the result of network connectivity
issues between members or a :term:`primary` instance that does not
issues between members or the result of a :term:`primary` that does not
have the resources to support application and replication traffic. To
check the status of a replica use the :dbcommand:`replSetGetStatus` or
check the status of a replica, use the :dbcommand:`replSetGetStatus` or
the following helper in the shell:

.. code-block:: javascript

rs.status()

See the ":doc:`/reference/replica-status`" document for a more in
See the :doc:`/reference/replica-status` document for a more in
depth overview view of this output. In general watch the value of
:status:`optimeDate`. Pay particular attention to the difference in
time between the :term:`primary` and the :term:`secondary` members.
Expand All @@ -393,7 +395,7 @@ option, :program:`mongod` will create an default sized oplog.
By default the oplog is 5% of total available disk space on 64-bit
systems.

.. seealso:: ":doc:`/tutorial/change-oplog-size`"
.. seealso:: :doc:`/tutorial/change-oplog-size`

Sharding and Monitoring
-----------------------
Expand All @@ -404,10 +406,10 @@ instances. Additionally, shard clusters require monitoring to ensure
that data is effectively distributed among nodes and that sharding
operations are functioning appropriately.

.. seealso:: See the ":wiki:`Sharding`" wiki page for more
.. seealso:: See the :wiki:`Sharding` wiki page for more
information.

.. STUB ":doc:`/core/sharding`"
.. STUB :doc:`/core/sharding`

Config Servers
~~~~~~~~~~~~~~
Expand Down
44 changes: 36 additions & 8 deletions source/administration/replica-sets.txt
Original file line number Diff line number Diff line change
Expand Up @@ -528,21 +528,24 @@ Replication Lag
~~~~~~~~~~~~~~~

Replication lag is a delay between an operation on the :term:`primary`
and the application of that operation from :term:`oplog` to the
and the application of that operation from the :term:`oplog` to the
:term:`secondary`. Such lag can be a significant issue and can
seriously affect MongoDB :term:`replica set` deployments. Excessive
replication lag makes "lagged" members ineligible to quickly become
primary and increases the possibility that distributed
read operations will be inconsistent.

Identify replication lag by checking the values of
Identify replication lag by checking the value of
:data:`members[n].optimeDate` for each member of the replica set
using the :method:`rs.status()` function in the :program:`mongo`
shell.

Also, you can monitor how fast replication occurs by watching the oplog
time in the "replica" graph in MMS.

Possible causes of replication lag include:

- **Network Latency.**
- **Network Latency**

Check the network routes between the members of your set to ensure
that there is no packet loss or network routing issue.
Expand All @@ -551,7 +554,7 @@ Possible causes of replication lag include:
members and ``traceroute`` to expose the routing of packets
network endpoints.

- **Disk Throughput.**
- **Disk Throughput**

If the file system and disk device on the secondary is
unable to flush data to disk as quickly as the primary, then
Expand All @@ -564,16 +567,41 @@ Possible causes of replication lag include:
Use system-level tools to assess disk status, including
``iostat`` or ``vmstat``.

- **Concurrency.**
- **Concurrency**

In some cases, long-running operations on the primary can block
replication on secondaries. You can use
:term:`write concern` to prevent write operations from returning
if replication cannot keep up with the write load.
replication on secondaries. You can use :term:`write concern` to
prevent write operations from returning if replication cannot keep up
with the write load.

Use the :term:`database profiler` to see if there are slow queries
or long-running operations that correspond to the incidences of lag.

- **Appropriate Write Concern**

If you are performing a large data load that requires a very high
number of writes to the primary, and if you have not set the
appropriate write concern, the secondaries will not be able to read
the oplog fast enough to keep up with changes. Write requests take
precedence over read requests, and a very large number of writes will
significantly reduce the numbers of reads the secondaries can make
from the oplog in order to update themselves.

The replication lag can grow to the point that the oplog over-writes
commands that the secondaries have not yet read. The oplog is a capped
collection, and when full it erases the oldest commands in order to
write new ones. If the secondaries get too far behind in their reads,
they reach a point where they no longer have access to certain
updates, and they become stale.

To prevent this, use "write concern" to tell MongoDB to always perform
a safe write after a designated number of inserts, such as after every
1,000 inserts. This provides a space for the secondaries to perform
reads and catch up with the primary. Using safe writes slightly slows
down the data load but keeps your secondaries from going stale.

See :ref:`replica-set-write-concern` for more information.

Failover and Recovery
~~~~~~~~~~~~~~~~~~~~~

Expand Down
2 changes: 1 addition & 1 deletion source/core/replication-internals.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ replicate this log by applying the operations to themselves in an
asynchronous process. Under normal operation, :term:`secondary` members
reflect writes within one second of the primary. However, various
exceptional situations may cause secondaries to lag behind further. See
:term:`replication lag` for details.
:ref:`Replication Lag <replica-set-replication-lag>` for details.

All members send heartbeats (pings) to all other members in the set and can
import operations to the local oplog from any other member in the set.
Expand Down
Loading