DOCSP-17952 documents Health Observers (#824) (#932)

jmd-mongo · web-flow · commit 9270cee27cae · 2022-04-07T17:28:41.000-04:00
* adds Health Observer params section

* adds Health Observers params section

* concept

* nav

* format fix

* adds progressMonitor

* active fault

* subsections

* double backticks

* intro

* typo

* overview examples

* title update

* rename

* lexicographic ordering

* tidy

* tidy

* extra i

* review feedback

* remove Params section from overview

* backtick

* include-ify notes re ``values`` arrays

* one more time

* sets up variables for more consistent usage

* update toc

* address fact-progress-monitor-fields.rst build error

* partial

* incorporates tech review feedback

* updates setParameter config file examples
diff --git a/snooty.toml b/snooty.toml
@@ -19,6 +19,7 @@ toc_landing_pages = [
    "/administration/backup-sharded-clusters",
    "/administration/configuration-and-maintenance",
    "/administration/connection-pool-overview",
+   "/administration/health-managers",
    "/administration/install-community",
    "/administration/install-enterprise-linux",
    "/administration/install-enterprise",
diff --git a/source/administration/analyzing-mongodb-performance.txt b/source/administration/analyzing-mongodb-performance.txt
@@ -262,4 +262,5 @@ analyzing or debugging issues with support from MongoDB Inc. engineers.
    /administration/connection-pool-overview
    /tutorial/manage-the-database-profiler
    /tutorial/transparent-huge-pages
+   /administration/health-managers
    /reference/ulimit
diff --git a/source/administration/health-managers.txt b/source/administration/health-managers.txt
@@ -0,0 +1,109 @@
+.. _health-managers-overview:
+
+.. include:: /includes/health-manager-short-names.rst
+
+==================================================
+Manage Sharded Cluster Health with Health Managers
+==================================================
+
+.. default-domain:: mongodb
+
+.. contents:: On this page
+   :local:
+   :backlinks: none
+   :depth: 1
+   :class: singlecol
+
+This document describes how to use |HMS| to monitor and manage sharded 
+cluster health issues.
+
+Overview
+--------
+
+A |HM| runs health checks on a :term:`health manager facet`
+at a specified :ref:`intensity level 
+<health-managers-intensity-levels>`. |HM| checks
+run at specified time intervals. A |HM| can be configured to 
+move a failing :ref:`mongos <mongos>` out of a cluster automatically. 
+:ref:`Progress Monitor <health-managers-progress-monitor>` ensures 
+that |HM| checks do not become stuck or unresponsive.
+
+.. _health-managers-facets:
+
+Health Manager Facets
+~~~~~~~~~~~~~~~~~~~~~
+
+The following table shows the available |HM| facets:
+
+.. include:: /includes/fact-health-manager-facets.rst
+
+.. _health-managers-intensity-levels:
+
+Health Manager Intensity Levels
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following table shows the available |HM| intensity levels:
+
+.. include:: /includes/fact-health-manager-intensities.rst
+
+.. _health-managers-active-fault:
+
+Active Fault Duration
+---------------------
+
+When a failure is detected and the |HM| intensity level
+is set to ``critical``, the |HM| waits the amount of time specified by 
+:parameter:`activeFaultDurationSecs` before stopping and moving the 
+:ref:`mongos <mongos>` out of the cluster automatically.
+
+.. _health-managers-progress-monitor:
+
+Progress Monitor
+----------------
+
+.. include:: /includes/fact-progressMonitor.rst
+
+``progressMonitor`` Fields
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. include:: /includes/fact-progress-monitor-fields.rst
+
+Examples
+--------
+
+The following examples show how |HMS| can be configured. For
+information on |HM| parameters, see :ref:`health-manager-parameters`.
+
+Intensity
+~~~~~~~~~
+
+.. include:: /includes/example-healthMonitoringIntensities.rst
+
+.. include:: /includes/fact-healthMonitoringIntensities-values-array.rst
+
+See :parameter:`healthMonitoringIntensities` for details.
+
+Intervals
+~~~~~~~~~
+
+.. include:: /includes/example-healthMonitoringIntervals.rst
+
+.. include:: /includes/fact-healthMonitoringIntervals-values-array.rst
+
+See :parameter:`healthMonitoringIntervals` for details.
+
+Active Fault Duration
+~~~~~~~~~~~~~~~~~~~~~
+
+.. include:: /includes/example-activeFaultDurationSecs.rst
+
+See :parameter:`activeFaultDurationSecs` for details.
+
+Progress Monitor
+~~~~~~~~~~~~~~~~
+
+.. include:: /includes/fact-progressMonitor.rst
+
+.. include:: /includes/example-progress-monitor.rst
+
+See :parameter:`progressMonitor` for details.
diff --git a/source/includes/example-activeFaultDurationSecs.rst b/source/includes/example-activeFaultDurationSecs.rst
@@ -0,0 +1,33 @@
+For example, to set the duration from failure to crash to five
+minutes, issue the following at startup:
+
+.. code-block:: bash
+
+  mongos --setParameter activeFaultDurationSecs=300
+
+Or if using the :dbcommand:`setParameter` command in a
+:binary:`~bin.mongosh` session that is connected to a running
+:binary:`~bin.mongos`:
+
+.. code-block:: javascript
+
+  db.adminCommand( 
+    {
+        setParameter: 1, 
+        activeFaultDurationSecs: 300 
+    }
+  )
+
+
+Parameters set with :dbcommand:`setParameter` do not persist across
+restarts. See the :ref:`setParameter page 
+<setParameter-commands-not-persistent>` for details.
+
+To make this setting persistent, set ``activeFaultDurationSecs``
+in your :ref:`mongos config file <configuration-options>` using the
+:setting:`setParameter` option as in the following example:
+
+.. code-block:: yaml
+
+  setParameter:
+     activeFaultDurationSecs: 300
diff --git a/source/includes/example-healthMonitoringIntensities.rst b/source/includes/example-healthMonitoringIntensities.rst
@@ -0,0 +1,32 @@
+For example, to set the ``dns`` |HM| facet to the 
+``critical`` intensity level, issue the following at startup:
+
+.. code-block:: bash
+
+  mongos --setParameter 'healthMonitoringIntensities={ values:[ { type:"dns", intensity: "critical"} ] }'
+
+Or if using the :dbcommand:`setParameter` command in a
+:binary:`~bin.mongosh` session that is connected to a running
+:binary:`~bin.mongos`:
+
+.. code-block:: javascript
+
+  db.adminCommand( 
+    {
+        setParameter: 1, 
+        healthMonitoringIntensities: { values: [ { type: "dns", intensity: "critical" } ] } } )
+    }
+  )
+
+Parameters set with :dbcommand:`setParameter` do not persist across
+restarts. See the :ref:`setParameter page 
+<setParameter-commands-not-persistent>` for details.
+
+To make this setting persistent, set ``healthMonitoringIntensities``
+in your :ref:`mongos config file <configuration-options>` using the
+:setting:`setParameter` option as in the following example:
+
+.. code-block:: yaml
+
+  setParameter:
+     healthMonitoringIntensities: "{ values:[ { type:\"dns\", intensity: \"critical\"} ] }"
diff --git a/source/includes/example-healthMonitoringIntervals.rst b/source/includes/example-healthMonitoringIntervals.rst
@@ -0,0 +1,32 @@
+For example, to set the ``ldap`` |HM| facet to the 
+run health checks every 30 seconds, issue the following at startup:
+
+.. code-block:: bash
+
+  mongos --setParameter 'healthMonitoringIntervals={ values:[ { type:"ldap", interval: "30000"} ] }'
+
+Or if using the :dbcommand:`setParameter` command in a
+:binary:`~bin.mongosh` session that is connected to a running
+:binary:`~bin.mongos`:
+
+.. code-block:: javascript
+
+  db.adminCommand( 
+    {
+        setParameter: 1, 
+        healthMonitoringIntervals: { values: [ { type: "ldap", interval: "30000" } ] } } )
+    }
+  )
+
+Parameters set with :dbcommand:`setParameter` do not persist across
+restarts. See the :ref:`setParameter page 
+<setParameter-commands-not-persistent>` for details.
+
+To make this setting persistent, set ``healthMonitoringIntervals``
+in your :ref:`mongos config file <configuration-options>` using the
+:setting:`setParameter` option as in the following example:
+
+.. code-block:: yaml
+
+  setParameter:
+     healthMonitoringIntervals: "{ values: [{type: \"ldap\", interval: 200}] }"
diff --git a/source/includes/example-progress-monitor.rst b/source/includes/example-progress-monitor.rst
@@ -0,0 +1,32 @@
+To set the ``interval`` to 1000 milliseconds and the ``deadline`` 
+to 300 seconds, issue the following at startup:
+
+.. code-block:: bash
+
+  mongos --setParameter 'progressMonitor={"interval": 1000, "deadline": 300}'
+
+Or if using the :dbcommand:`setParameter` command in a
+:binary:`~bin.mongosh` session that is connected to a running
+:binary:`~bin.mongos`:
+
+.. code-block:: javascript
+
+  db.adminCommand( 
+    {
+        setParameter: 1, 
+        progressMonitor: { interval: 1000, deadline: 300 } )
+    }
+  )
+
+Parameters set with :dbcommand:`setParameter` do not persist across
+restarts. See the :ref:`setParameter page 
+<setParameter-commands-not-persistent>` for details.
+
+To make this setting persistent, set ``progressMonitor``
+in your :ref:`mongos config file <configuration-options>` using the
+:setting:`setParameter` option as in the following example:
+
+.. code-block:: yaml
+
+  setParameter:
+     progressMonitor: "{ interval: 1000, deadline: 300 }"
diff --git a/source/includes/fact-health-manager-facets.rst b/source/includes/fact-health-manager-facets.rst
@@ -0,0 +1,19 @@
+.. list-table::
+  :header-rows: 1
+  :widths: 25 75
+
+  * - Facet
+
+    - What the Health Observer Checks
+
+  * - ``configServer``
+
+    - Cluster health issues related to connectivity to the config server.
+
+  * - ``dns``
+
+    - Cluster health issues related to DNS availability and functionality.
+
+  * - ``ldap``
+
+    - Cluster health issues related to LDAP availability and functionality.
diff --git a/source/includes/fact-health-manager-intensities.rst b/source/includes/fact-health-manager-intensities.rst
@@ -0,0 +1,27 @@
+.. list-table::
+  :header-rows: 1
+  :widths: 25 75
+
+  * - Intensity Level
+
+    - Description
+
+  * - ``critical``
+
+    - The |HM| on this facet is enabled and has the ability to move the 
+      failing :ref:`mongos <mongos>` out of the cluster if an error 
+      occurs. The |HM| waits the amount of time specified by 
+      :parameter:`activeFaultDurationSecs` before stopping and moving 
+      the :ref:`mongos <mongos>` out of the cluster automatically.
+
+  * - ``non-critical``
+
+    - The |HM| on this facet is enabled and logs
+      errors, but the :ref:`mongos <mongos>` remains in the cluster if 
+      errors are encountered. 
+
+  * - ``off``
+
+    - The |HM| on this facet is disabled. The :ref:`mongos 
+      <mongos>` does not perform any health checks on this facet. This
+      is the default intensity level.
diff --git a/source/includes/fact-healthMonitoringIntensities-values-array.rst b/source/includes/fact-healthMonitoringIntensities-values-array.rst
@@ -0,0 +1,5 @@
+``healthMonitoringIntensities`` accepts an array of documents,
+``values``. Each document in ``values`` takes two fields: 
+
+- ``type``, the |HM| facet 
+- ``intensity``, the intensity level
diff --git a/source/includes/fact-healthMonitoringIntervals-values-array.rst b/source/includes/fact-healthMonitoringIntervals-values-array.rst
@@ -0,0 +1,5 @@
+``healthMonitoringIntervals`` accepts an array of documents,
+``values``. Each document in ``values`` takes two fields:
+
+- ``type``, the |HM| facet 
+- ``interval``, the time interval it runs at, in milliseconds
diff --git a/source/includes/fact-progress-monitor-fields.rst b/source/includes/fact-progress-monitor-fields.rst
@@ -0,0 +1,22 @@
+.. list-table::
+  :header-rows: 1
+  :widths: 25 50 25
+
+  * - Field
+
+    - Description
+
+    - Units
+
+  * - ``interval``
+
+    - How often to ensure |HMS| are not stuck or unresponsive.
+
+    - Milliseconds
+
+  * - ``deadline``
+
+    - Timeout before automatically failing the :ref:`mongos <mongos>` 
+      if a |HM| check is not making progress.
+    
+    - Seconds
diff --git a/source/includes/fact-progressMonitor.rst b/source/includes/fact-progressMonitor.rst
@@ -0,0 +1,6 @@
+:ref:`Progress Monitor <health-managers-progress-monitor>` runs tests 
+to ensure that |HM| checks do not become stuck or 
+unresponsive. Progress Monitor runs these tests in intervals specified 
+by ``interval``. If a health check begins but does not complete within
+the timeout given by ``deadline``, Progress Monitor stops the 
+:ref:`mongos <mongos>` and removes it from the cluster.
diff --git a/source/includes/health-manager-short-names.rst b/source/includes/health-manager-short-names.rst
@@ -0,0 +1,3 @@
+.. |HM| replace:: Health Manager
+.. |HMS| replace:: Health Managers
+.. |HMREF| replace:: :ref:`<health-managers-overview>`
diff --git a/source/reference/command/setParameter.txt b/source/reference/command/setParameter.txt
@@ -28,6 +28,8 @@ Definition
    For the available parameters, including examples, see
    :doc:`/reference/parameters`.
 
+.. setParameter-commands-not-persistent:
+
 Behavior
 --------
 
diff --git a/source/reference/glossary.txt b/source/reference/glossary.txt
@@ -393,6 +393,21 @@ Glossary
       "buckets" of objects grouped by a second criterion. See
       :doc:`/core/geohaystack`.
 
+   health manager
+     A health manager runs health checks on a :term:`health manager 
+     facet` at a specified :ref:`intensity level 
+     <health-managers-intensity-levels>`. Health manager checks run at 
+     specified time intervals. A health manager can be configured to 
+     move a failing :ref:`mongos <mongos>` out of a cluster 
+     automatically. 
+
+   health manager facet
+      A specific set of features and functionality that a :term:`health
+      manager` can be configured to run health checks against. For 
+      example, you can configure a health manager to monitor and
+      manage DNS or LDAP cluster health issues automatically. See 
+      :ref:`health-managers-facets` for details.
+
    hidden member
       A :term:`replica set` member that cannot become :term:`primary`
       and are invisible to client applications. See
diff --git a/source/reference/parameters.txt b/source/reference/parameters.txt