@@ -170,124 +170,167 @@ same condition, one for a low priority / "warning" level of severity, and one fo
170170
171171.. list-table::
172172 :header-rows: 1
173- :widths: 30 35 35
173+ :widths: 20 25 25 30
174174 :stub-columns: 1
175175
176176 * - Condition
177177 - Recommended Alert Threshold: Low Priority
178178 - Recommended Alert Threshold: High Priority
179+ - Key Insights
179180
180181 * - Oplog Window
181182 - < 24h for 5 minutes
182183 - < 1h for 10 minutes
184+ - .. include:: /includes/cloud-docs/shared-metric-description-oplog-window.rst
183185
184186 * - :manual:`Election </core/replica-set-elections/>` events
185187 - > 3 for 5 minutes
186188 - > 30 for 5 minutes
189+ - Monitor election events, which occur when a primary node steps down and a
190+ secondary node is elected as the new primary. Frequent election events can
191+ disrupt operations and impact availability, causing temporary write
192+ unavailability and possible rollback of data. Keeping election events to
193+ a minimum ensures consistent write operations and stable {+cluster+} performance.
187194
188195 * - Read :atlas:`IOPS </reference/alert-resolutions/disk-io-utilization/>`
189196 - > 4000 for 2 minutes
190197 - > 9000 for 5 minutes
198+ - .. include:: /includes/cloud-docs/shared-metric-description-iops.rst
191199
192200 * - Write :atlas:`IOPS </reference/alert-resolutions/disk-io-utilization/>`
193201 - > 4000 for 2 minutes
194202 - > 9000 for 5 minutes
203+ - .. include:: /includes/cloud-docs/shared-metric-description-iops.rst
195204
196205 * - Read Latency
197206 - > 20ms for 5 minutes
198207 - > 50 s for 5 minutes
208+ - .. include:: /includes/cloud-docs/shared-metric-description-latency.rst
199209
200210 * - Write Latency
201211 - > 20ms for 5 minutes
202212 - > 50ms for more than 5 minutes
213+ - .. include:: /includes/cloud-docs/shared-metric-description-latency.rst
203214
204215 * - Swap use
205216 - > 2GB for 15 minutes
206217 - > 2GB for 15 minutes
218+ - .. include:: /includes/cloud-docs/shared-metric-description-memory.rst
207219
208220 * - Host down
209221 - 15 minutes
210222 - 24 hours
223+ - Monitor your hosts to detect downtime promptly. A host down for more than
224+ 15 minutes can impact availability, while downtime exceeding 24 hours is
225+ critical, risking data accessibility and application performance.
211226
212227 * - No primary
213228 - 5 minutes
214229 - 5 minutes
230+ - Monitor the status of your replica sets to identify instances where there
231+ is no primary node. A lack of a primary for more than 5 minutes can halt
232+ write operations and impact application functionality.
215233
216234 * - Missing active ``mongos``
217235 - 15 minutes
218236 - 15 minutes
237+ - Monitor the status of active ``mongos`` processes to ensure effective query
238+ routing in sharded {+clusters+}. A missing ``mongos`` can disrupt query routing.
219239
220240 * - Page faults
221241 - > 50/second for 5 minutes
222242 - > 100/second for 5 minutes
243+ - .. include:: /includes/cloud-docs/shared-metric-description-page-faults.rst
223244
224245 * - Replication lag
225246 - > 240 second for 5 minutes
226247 - > 1 hour for 5 minutes
248+ - .. include:: /includes/cloud-docs/shared-metric-description-replication-lag.rst
227249
228250 * - Failed backup
229251 - Any occurrence
230252 - None
253+ - Track backup operations to ensure data integrity. A failed backup can compromise
254+ data availability.
231255
232256 * - Restored backup
233257 - Any occurrence
234258 - None
259+ - Verify restored backups to ensure data integrity and system functionality.
235260
236261 * - Fallback snapshot failed
237262 - Any occurrence
238263 - None
264+ - Monitor fallback snapshot operations to ensure data redundancy and recovery
265+ capability.
239266
240267 * - Backup schedule behind
241268 - > 12 hours
242269 - > 12 hours
243-
244- * - Available write tickets
245- - < 75 for 5 minutes
246- - < 25 for 5 minutes
247-
248- * - Available read tickets
249- - < 75 for 5 minutes
250- - < 25 for 5 minutes
270+ - Check backup schedules to ensure they are on track. Falling behind can
271+ risk data loss and compromise recovery plans.
272+
273+ * - Queued Reads
274+ - > 0-10
275+ - > 10+
276+ - Monitor queued reads to ensure efficient data retrieval. High levels of
277+ queued reads may indicate resource constraints or performance bottlenecks,
278+ requiring optimization to maintain system responsiveness.
279+
280+ * - Queued Writes
281+ - > 0-10
282+ - > 10+
283+ - Monitor queued writes to maintain efficient data processing. High levels
284+ of queued writes may signal resource constraints or performance bottlenecks, requiring optimization to maintain system responsiveness.
251285
252286 * - Restarts last hour
253287 - > 2
254288 - > 2
289+ - Track the number of restarts in the last hour to detect instability or
290+ configuration issues. Frequent restarts can indicate underlying problems
291+ that require immediate investigation to maintain system reliability and uptime.
255292
256293 * - :manual:`Primary election </core/replica-set-elections/>`
257294 - Any occurrence
258295 - None
296+ - Monitor primary elections to ensure stable {+cluster+} operations. Frequent
297+ elections can indicate network issues or resource constraints, potentially
298+ impacting the availability and performance of the database.
259299
260300 * - Maintenance no longer needed
261301 - Any occurrence
262302 - None
303+ - Review unnecessary maintenance tasks to optimize resources and minimize disruptions.
263304
264305 * - Maintenance started
265306 - Any occurrence
266307 - None
308+ - Track the start of maintenance tasks to ensure planned activities proceed smoothly.
309+ Proper oversight helps maintain system performance and minimize downtime during maintenance.
267310
268311 * - Maintenance scheduled
269312 - Any occurrence
270313 - None
314+ - Monitor scheduled maintenance to prepare for potential system impacts.
271315
272316 * - :atlas:`Steal </alert-basics/#cpu-steal>`
273317 - > 5% for 5 minutes
274318 - > 20% for 5 minutes
319+ - Monitor CPU steal on AWS EC2 {+clusters+} with Burstable Performance
320+ to identify when CPU usage exceeds the guaranteed baseline due to shared
321+ cores. High steal percentages indicate the CPU credit balance is depleted,
322+ affecting performance.
275323
276324 * - CPU
277325 - > 75% for 5 minutes
278326 - > 75% for 5 minutes
327+ - .. include:: /includes/cloud-docs/shared-metric-description-cpu.rst
279328
280329 * - Disk partition usage
281330 - > 90%
282331 - > 95% for 5 minutes
283-
284- * - Index partition usage
285- - > 90%
286- - > 95% for 5 minutes
287-
288- * - Journal partition usage
289- - > 90%
290- - > 95% for 5 minutes
332+ - Monitor disk partition usage to ensure sufficient storage availability.
333+ High usage levels can lead to performance degradation and potential system outages.
291334
292335To learn more, see :atlas:`Configure and Resolve Alerts </alerts>`.
293336
0 commit comments