(DOCSP-44989) Revises per second-draft edits. (#91)

erabil-mdb · web-flow · commit 7222dec5c5cb · 2025-01-29T14:01:25.000-07:00
* (DOCSP-44989) Revises per second-draft edits. * Adds shared includes and rewrites leftovers to match.
diff --git a/source/includes/cloud-docs/shared-metric-description-cpu.rst b/source/includes/cloud-docs/shared-metric-description-cpu.rst
@@ -0,0 +1 @@
+Monitor CPU usage to determine whether data is retrieved from disk instead of memory.
diff --git a/source/includes/cloud-docs/shared-metric-description-iops.rst b/source/includes/cloud-docs/shared-metric-description-iops.rst
@@ -0,0 +1,2 @@
+Monitor whether disk IOPS approaches the maximum provisioned IOPS. 
+Determine whether the cluster can handle future workloads.
diff --git a/source/includes/cloud-docs/shared-metric-description-latency.rst b/source/includes/cloud-docs/shared-metric-description-latency.rst
@@ -0,0 +1,2 @@
+Monitor disk latency to track the efficiency of reading from and 
+writing to disk.
diff --git a/source/includes/cloud-docs/shared-metric-description-memory.rst b/source/includes/cloud-docs/shared-metric-description-memory.rst
@@ -0,0 +1 @@
+Monitor memory to determine whether to upgrade to a higher cluster tier. This metric represents the average value over the time period specified by the metric granularity.
diff --git a/source/includes/cloud-docs/shared-metric-description-oplog-window.rst b/source/includes/cloud-docs/shared-metric-description-oplog-window.rst
@@ -0,0 +1,5 @@
+Monitor the replication oplog window, together with replication 
+headroom, to determine whether the secondary may soon require a 
+full resync. The replication oplog window often helps to 
+determine in advance the resilience of secondaries to planned 
+and unplanned outages.
diff --git a/source/includes/cloud-docs/shared-metric-description-page-faults.rst b/source/includes/cloud-docs/shared-metric-description-page-faults.rst
@@ -0,0 +1,4 @@
+Monitor page faults to determine whether to increase your memory.
+This metric displays the average rate of page faults on this process per second 
+over the selected sample period. In non-Windows 
+environments this applies to hard page faults only.
diff --git a/source/includes/cloud-docs/shared-metric-description-replication-lag.rst b/source/includes/cloud-docs/shared-metric-description-replication-lag.rst
@@ -0,0 +1 @@
+Monitor replication lag to determine whether the secondary might fall off the oplog.
diff --git a/source/monitoring-alerts.txt b/source/monitoring-alerts.txt
@@ -170,124 +170,167 @@ same condition, one for a low priority / "warning" level of severity, and one fo
 
 .. list-table:: 
    :header-rows: 1
-   :widths: 30 35 35
+   :widths: 20 25 25 30
    :stub-columns: 1 
 
    * - Condition
      - Recommended Alert Threshold: Low Priority
      - Recommended Alert Threshold: High Priority
+     - Key Insights
 
    * - Oplog Window
      - < 24h for 5 minutes 
      - < 1h for 10 minutes
+     - .. include:: /includes/cloud-docs/shared-metric-description-oplog-window.rst
     
    * - :manual:`Election </core/replica-set-elections/>` events
      - > 3 for 5 minutes 
      - > 30 for 5 minutes
+     - Monitor election events, which occur when a primary node steps down and a 
+       secondary node is elected as the new primary. Frequent election events can 
+       disrupt operations and impact availability, causing temporary write 
+       unavailability and possible rollback of data. Keeping election events to 
+       a minimum ensures consistent write operations and stable {+cluster+} performance.
 
    * - Read :atlas:`IOPS </reference/alert-resolutions/disk-io-utilization/>`
      - > 4000 for 2 minutes
      - > 9000 for 5 minutes
+     - .. include:: /includes/cloud-docs/shared-metric-description-iops.rst
 
    * - Write :atlas:`IOPS </reference/alert-resolutions/disk-io-utilization/>`
      - > 4000 for 2 minutes
      - > 9000 for 5 minutes
+     - .. include:: /includes/cloud-docs/shared-metric-description-iops.rst
 
    * - Read Latency
      - > 20ms for 5 minutes
      - > 50 s for 5 minutes
+     - .. include:: /includes/cloud-docs/shared-metric-description-latency.rst
 
    * - Write Latency
      - > 20ms for 5 minutes 
      - > 50ms for more than 5 minutes
+     - .. include:: /includes/cloud-docs/shared-metric-description-latency.rst
 
    * - Swap use
      - > 2GB for 15 minutes
      - > 2GB for 15 minutes
+     - .. include:: /includes/cloud-docs/shared-metric-description-memory.rst
 
    * - Host down
      - 15 minutes
      - 24 hours
+     - Monitor your hosts to detect downtime promptly. A host down for more than 
+       15 minutes can impact availability, while downtime exceeding 24 hours is 
+       critical, risking data accessibility and application performance.
 
    * - No primary
      - 5 minutes
      - 5 minutes
+     - Monitor the status of your replica sets to identify instances where there 
+       is no primary node. A lack of a primary for more than 5 minutes can halt 
+       write operations and impact application functionality.
 
    * - Missing active ``mongos``
      - 15 minutes
      - 15 minutes
+     - Monitor the status of active ``mongos`` processes to ensure effective query 
+       routing in sharded {+clusters+}. A missing ``mongos`` can disrupt query routing.
 
    * - Page faults
      - > 50/second for 5 minutes
      - > 100/second for 5 minutes
+     - .. include:: /includes/cloud-docs/shared-metric-description-page-faults.rst
 
    * - Replication lag
      - > 240 second for 5 minutes
      - > 1 hour for 5 minutes
+     - .. include:: /includes/cloud-docs/shared-metric-description-replication-lag.rst
 
    * - Failed backup
      - Any occurrence
      - None
+     - Track backup operations to ensure data integrity. A failed backup can compromise 
+       data availability.
 
    * - Restored backup
      - Any occurrence
      - None
+     - Verify restored backups to ensure data integrity and system functionality.
 
    * - Fallback snapshot failed
      - Any occurrence
      - None
+     - Monitor fallback snapshot operations to ensure data redundancy and recovery 
+       capability.
 
    * - Backup schedule behind
      - > 12 hours
      - > 12 hours
-
-   * - Available write tickets
-     - < 75 for 5 minutes
-     - < 25 for 5 minutes
-
-   * - Available read tickets
-     - < 75 for 5 minutes
-     - < 25 for 5 minutes
+     - Check backup schedules to ensure they are on track. Falling behind can 
+       risk data loss and compromise recovery plans.
+
+   * - Queued Reads    
+     - > 0-10    
+     - > 10+    
+     - Monitor queued reads to ensure efficient data retrieval. High levels of 
+       queued reads may indicate resource constraints or performance bottlenecks, 
+       requiring optimization to maintain system responsiveness.
+
+   * - Queued Writes    
+     - > 0-10    
+     - > 10+    
+     - Monitor queued writes to maintain efficient data processing. High levels 
+       of queued writes may signal resource constraints or performance bottlenecks, requiring optimization to maintain system responsiveness.
 
    * - Restarts last hour
      - > 2
      - > 2
+     - Track the number of restarts in the last hour to detect instability or 
+       configuration issues. Frequent restarts can indicate underlying problems 
+       that require immediate investigation to maintain system reliability and uptime.
 
    * - :manual:`Primary election </core/replica-set-elections/>`
      - Any occurrence
      - None
+     - Monitor primary elections to ensure stable {+cluster+} operations. Frequent 
+       elections can indicate network issues or resource constraints, potentially 
+       impacting the availability and performance of the database.
 
    * - Maintenance no longer needed
      - Any occurrence
      - None
+     - Review unnecessary maintenance tasks to optimize resources and minimize disruptions.
 
    * - Maintenance started
      - Any occurrence
      - None
+     - Track the start of maintenance tasks to ensure planned activities proceed smoothly.
+       Proper oversight helps maintain system performance and minimize downtime during maintenance.
 
    * - Maintenance scheduled
      - Any occurrence
      - None
+     - Monitor scheduled maintenance to prepare for potential system impacts. 
 
    * - :atlas:`Steal </alert-basics/#cpu-steal>`
      - > 5% for 5 minutes
      - > 20% for 5 minutes
+     - Monitor CPU steal on AWS EC2 {+clusters+} with Burstable Performance 
+       to identify when CPU usage exceeds the guaranteed baseline due to shared 
+       cores. High steal percentages indicate the CPU credit balance is depleted, 
+       affecting performance.
 
    * - CPU
      - > 75% for 5 minutes
      - > 75% for 5 minutes
+     - .. include:: /includes/cloud-docs/shared-metric-description-cpu.rst
 
    * - Disk partition usage
      - > 90%
      - > 95% for 5 minutes
-
-   * - Index partition usage
-     - > 90%
-     - > 95% for 5 minutes
-
-   * - Journal partition usage
-     - > 90%
-     - > 95% for 5 minutes
+     - Monitor disk partition usage to ensure sufficient storage availability. 
+       High usage levels can lead to performance degradation and potential system outages. 
 
 To learn more, see :atlas:`Configure and Resolve Alerts </alerts>`. 
 

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+Monitor CPU usage to determine whether data is retrieved from disk instead of memory.`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+Monitor whether disk IOPS approaches the maximum provisioned IOPS.`
	`2`	`+Determine whether the cluster can handle future workloads.`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+Monitor disk latency to track the efficiency of reading from and`
	`2`	`+writing to disk.`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+Monitor memory to determine whether to upgrade to a higher cluster tier. This metric represents the average value over the time period specified by the metric granularity.`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+Monitor replication lag to determine whether the secondary might fall off the oplog.`