Skip to content

Commit 0173eb8

Browse files
committed
initial draft.
1 parent 968158b commit 0173eb8

9 files changed

+307
-0
lines changed
165 KB
Loading
127 KB
Loading
98.7 KB
Loading
126 KB
Loading
144 KB
Loading
137 KB
Loading

src/current/v25.2/detect_hotspots.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
---
2+
title: Detect Hotspots
3+
summary: Learn how to detect hotspots in CockroachDB.
4+
toc: true
5+
---
6+
7+
This page provides guidance on identifying common hotspots in CockroachDB clusters, using real-time monitoring and historical logs.
8+
9+
## Real-time Detection
10+
11+
### Read hotspots
12+
13+
To detect a [read hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}#read-hotspot), such as an [index hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}#index-hotspot) or [hot key]({% link {{ page.version.version }}/understand-hotspots.md %}#row-hotspot), monitor the following graphs on the [DB Console **Metrics** page]({% link {{ page.version.version }}/ui-overview.md %}#metrics).
14+
15+
#### CPU check
16+
17+
- On the DB Console **Metrics** page **Hardware** dashboard, monitor the [**CPU Percent** graph]({% link {{ page.version.version }}/ui-hardware-dashboard.md %}#cpu-percent).
18+
- If the CPU usage of the hottest node is 20% or more above the cluster average, it may indicate a potential hotspot.
19+
- For example, node `n5`, represented by the green line in the following graph, hovers at around 87% at time 17:35 compared to other nodes which hover around 20% to 25%:
20+
21+
<img src="{{ 'images/v25.2/detect-hotspots-1.png' | relative_url }}" alt="graph of CPU Percent utilization per node showing hot key" style="border:1px solid #eee;max-width:100%" />
22+
23+
#### Node overload check
24+
25+
- On the DB Console **Metrics** page **Runtime** dashboard, monitor the [**Runnable Goroutines Per CPU** graph]({% link {{ page.version.version }}/ui-runtime-dashboard.md %}#runnable-goroutines-per-cpu).
26+
- A significant difference between the average and maximum values may indicate a potential hotspot.
27+
- Nodes typically hover near `0.0`, unless a node is at or near its system-configured limit of 32.
28+
- The **Runnable Goroutines per CPU** graph increases faster than the **CPU Percent** graph only when a node approaches its limit.
29+
- For example, node `n5`, represented by the green line in the following graph, hovers above 3 at 17:35, compared to other nodes hovering around 0.5:
30+
31+
<img src="{{ 'images/v25.2/detect-hotspots-2.png' | relative_url }}" alt="graph of Runnable Goroutines per CPU per node showing node overload" style="border:1px solid #eee;max-width:100%" />
32+
33+
{{site.data.alerts.callout_success}}
34+
Compare the **Runnable Goroutine per CPU** graph and the **CPU Percent** graph at the same timestamp to spot sharp increases.
35+
{{site.data.alerts.end}}
36+
37+
#### I/O check
38+
39+
- On the DB Console **Metrics** page **Overload** dashboard, monitor the [**IO Overload** graph]({% link {{ page.version.version }}/ui-overload-dashboard.md %}#io-overload).
40+
- If the CPU is not the bottleneck, check the **IO Overload** graph for potential I/O issues.
41+
42+
<img src="{{ 'images/v25.2/detect-hotspots-3.png' | relative_url }}" alt="graph of IO Overload score per node" style="border:1px solid #eee;max-width:100%" />
43+
44+
### Read or write hotspots
45+
46+
To detect a [read or write hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}#read-hotspot), such as an [index hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}#index-hotspot), monitor the following metric on the [DB Console **Advanced Debug Custom Chart** page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}).
47+
48+
#### Clear direction check
49+
50+
To optimize your cluster's performance, CockroachDB can split frequently accessed keys into smaller ranges. In conjunction with load-based rebalancing, [load-based splitting]({% link {{ page.version.version }}/load-based-splitting.md %}) distributes load evenly across your cluster.
51+
52+
- On the [DB Console **Advanced Debug Custom Chart** page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}), select the `system` virtual cluster and create a chart to monitor the metric `kv.loadsplitter.cleardirection`. This metric tracks whether the [load-based splitter]({% link {{ page.version.version }}/load-based-splitting.md %}) observed an access direction greater than 80% to the left or right in the samples. This suggests that keys used for a replica are steadily increasing or decreasing in a consistent direction.
53+
- If this metric significantly increases, it may indicate an index hotspot.
54+
- TODO: READ Example, look at last 3 messages
55+
56+
<img src="{{ 'images/v25.2/detect-hotspots-4.png' | relative_url }}" alt="graph of kv.loadsplitter.cleardirection" style="border:1px solid #eee;max-width:100%" />
57+
58+
### Write hotspots
59+
60+
To detect a [write hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}#write-hotspot), such as a [hot key]({% link {{ page.version.version }}/understand-hotspots.md %}#row-hotspot), monitor the following graph on the [DB Console **Metrics** page]({% link {{ page.version.version }}/ui-overview.md %}#metrics), as well as the following metrics on the [DB Console **Advanced Debug Custom Chart** page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}).
61+
62+
#### KV Execution Latency check
63+
64+
- On the DB Console **Metrics** page **SQL** dashboard, monitor the [**KV Execution Latency: 90th percentile** graph]({% link {{ page.version.version }}/ui-sql-dashboard.md %}#kv-execution-latency-90th-percentile).
65+
- If the maximum value is a clear outlier in the cluster, it may indicate a potential hotspot.
66+
67+
<img src="{{ 'images/v25.2/detect-hotspots-5.png' | relative_url }}" alt="graph of KV Execution Latency: 90th percentile" style="border:1px solid #eee;max-width:100%" />
68+
69+
#### Latch conflict wait durations check
70+
71+
- On the [DB Console **Advanced Debug Custom Chart** page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}), select the `system` virtual cluster and create a chart to monitor the metric `kv.concurrency.latch_conflict_wait_durations`. This metric tracks durations in nanoseconds spent on [latch acquisition]({% link {{ page.version.version }}/architecture/transaction-layer.md %}#latch-manager) waiting for conflicts with other latches. For example, a [sequence]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-sequence) writing to the same row must wait for the latch.
72+
- If the maximum value is a clear outlier in the cluster, it may indicate a potential hotspot.
73+
74+
- ?? metric is a histogram so which should be chosen?
75+
76+
<img src="{{ 'images/v25.2/detect-hotspots-6.png' | relative_url }}" alt="kv.concurrency.latch_conflict_wait_durations" style="border:1px solid #eee;max-width:100%" />
77+
78+
#### Popular key check
79+
80+
- On the [DB Console **Advanced Debug Custom Chart** page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}), select the `system` virtual cluster and create a chart to monitor the metric `kv.loadsplitter.popularkey`. This metric tracks whether the [load-based splitter]({% link {{ page.version.version }}/load-based-splitting.md %}) could not find a split key, and the most popular sampled split key appears in more than 25% of the samples. In a given replica, one key is receiving most of the traffic.
81+
- If this metric significantly increases, it may indicate a potential hotspot.
82+
83+
### Finding the Hot Index
84+
85+
#### Hot Ranges page
86+
87+
If the graphs on the **Metrics** page or the **Advanced Debug Custom Chart** page indicate a potential [node hotspot]({% link {{ page.version.version }}/understand-hotspots.md %}#node-hotspot), navigate to the DB Console [**Hot Ranges** page]({% link {{ page.version.version }}/ui-hot-ranges-page.md %}) to determine the corresponding [hot index]({% link {{ page.version.version }}/understand-hotspots.md %}#index-hotspot). The **Hot ranges** page results are averaged over 30 minutes.
88+
89+
1. Sort the results.
90+
1. By the **CPU** column descending. The values are CPU time measured in milliseconds used per second compared to the **CPU Percent** graph which is percent usage.
91+
1. By the **Write (bytes)** column descending.
92+
1. Check the **Leaseholder** column to correlate the hot range results with the suspected node from the preceding graph checks on the **Metrics** or **Advanced Debug Custom Chart** pages.
93+
1. If a correlation is found:
94+
1. Note the range in the **Range ID** column.
95+
1. Scroll to the right of the page to the **Table** and **Index** columns to retrieve the table and index associated with that range.
96+
97+
{{site.data.alerts.callout_success}}
98+
Focus on correlating spikes in **CPU Percent** or **Runnable Goroutine per CPU** with specific index usage to confirm the hotspot.
99+
{{site.data.alerts.end}}
100+
101+
Use the SQL statement [`SHOW CREATE TABLE`] to inspect the schema for the involved table and index.
102+
103+
## Historical Detection
104+
105+
- Check **Hot Range Logs**:
106+
- Logs are emitted under certain conditions.
107+
- Contain details like replica ID and CPU usage.
108+
- Found in the **HEALTH** channel (moved from **TELEMETRY**).
109+
110+
## Performance Benchmarks and Limitations
111+
112+
During internal testing (row sizes 256–512 bytes) on an **N2-standard-16** machine:
113+
114+
| Category | Performance Limit |
115+
|------------------------|-----------------------------|
116+
| Index Hotspot | ~22,000 inserts per second |
117+
| Row Hotspot (Writes) | ~1,000 writes per second |
118+
| Row Hotspot (Reads) | ~70,000 reads per second |
119+
120+
{{site.data.alerts.callout_info}}
121+
The larger the cluster, the easier it is to detect hotspots due to clearer outliers.
122+
{{site.data.alerts.end}}
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
---
2+
title: Detect Hotspots 1
3+
summary: Learn how to detect hotspots in CockroachDB.
4+
toc: true
5+
---
6+
7+
This document provides guidance on identifying common hotspots in CockroachDB clusters, using real-time monitoring and historical logs.
8+
9+
## Real-time Detection
10+
11+
### Indicators of Hotspots
12+
13+
#### Index Hotspot or Hot Key (Read)
14+
15+
- **Hotspot "Eyeball Test"**
16+
- Use the **DB Console** and **Hardware** dashboard.
17+
- If the CPU usage of the hottest node is 20% or more greater than the cluster average, it indicates a potential hotspot.
18+
19+
- **Node Overload**
20+
- Use the **DB Console** and **Runtime** dashboard.
21+
- Monitor **Runnable Goroutines Per CPU**:
22+
- Nodes typically hover near 0.0.
23+
- A stark difference between the average and maximum values indicates that a node is at or near its limit.
24+
- Runnable goroutines increase faster than CPU usage when nearing the system-configured maximum of 32.
25+
26+
**Tip:** Compare runnable goroutine graphs and CPU graphs at the same timestamp to spot sharp increases.
27+
28+
- **IO Overload**
29+
- If CPU is not the bottleneck, review the **Overload** dashboard for IO issues.
30+
31+
### Index Hotspot (Read/Write)
32+
33+
- **Clear Direction Metric** (`kv.loadsplitter.cleardirection`)
34+
- Indicates keys for a replica are increasing or decreasing uniformly.
35+
- Check via **Advanced Debug**.
36+
- Examine the last three messages for signs.
37+
38+
- **KV Execution Latency**
39+
- Found on the **SQL Dashboard** in the **DB Console**.
40+
- Look at the 90th percentile latency; an outlier indicates a hotspot.
41+
42+
- **Latch Conflict Wait Durations** (`kv.concurrency.latch_conflict_wait_durations`)
43+
- Available in **Advanced Debug**.
44+
- A clear outlier signals contention, often caused by writes to the same row.
45+
46+
- **Popular Key Metric** (`kv.loadsplitter.popularkey`)
47+
- Indicates one key is receiving most of the traffic.
48+
- Available only via **Advanced Debug**.
49+
50+
### Finding the Hot Index
51+
52+
- Use the **Hot Ranges** page:
53+
- Sort by **CPU usage** and **Write (bytes)**.
54+
- Correlate metrics with the suspected node.
55+
- Metrics are averaged over 30 minutes.
56+
- CPU time is measured in milliseconds used per second.
57+
58+
- If correlation is found:
59+
1. Identify the range.
60+
2. Retrieve the index associated with the range.
61+
62+
- **Final Step:**
63+
- Use `SHOW CREATE TABLE` to inspect the table structure for the involved index.
64+
65+
## Historical Detection
66+
67+
- Check **Hot Range Logs**:
68+
- Logs are emitted under certain conditions.
69+
- Contain details like replica ID and CPU usage.
70+
- Found in the **HEALTH** channel (moved from **TELEMETRY**).
71+
72+
## Performance Benchmarks and Limitations
73+
74+
During internal testing (row sizes 256–512 bytes) on an **N2-standard-16** machine:
75+
76+
- **Index Hotspot Limit:** ~22,000 inserts per second.
77+
- **Row Hotspot Limits:**
78+
- ~1,000 writes per second.
79+
- ~70,000 reads per second.
80+
81+
**Note:** The larger the cluster (number of nodes), the clearer and easier it is to detect hotspots.
82+
83+
---
84+
85+
**Deadline:** First minor release `v25.2.1` scheduled for June.
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
---
2+
title: Detect Hotspots 2
3+
summary: Learn how to detect hotspots in CockroachDB.
4+
toc: true
5+
---
6+
7+
This document provides guidance on identifying common hotspots in CockroachDB clusters, using real-time monitoring and historical logs.
8+
9+
{{site.data.alerts.callout_info}}
10+
This guide focuses on both real-time and historical hotspot detection techniques.
11+
{{site.data.alerts.end}}
12+
13+
## Real-time Detection
14+
15+
### Indicators of Hotspots
16+
17+
#### Index Hotspot or Hot Key (Read)
18+
19+
- **Hotspot "Eyeball Test"**
20+
- Use the **DB Console** and **Hardware** dashboard.
21+
- If the CPU usage of the hottest node is 20% or more greater than the cluster average, it indicates a potential hotspot.
22+
23+
- **Node Overload**
24+
- Use the **DB Console** and **Runtime** dashboard.
25+
- Monitor **Runnable Goroutines Per CPU**:
26+
- Nodes typically hover near 0.0.
27+
- A stark difference between the average and maximum values indicates that a node is at or near its limit.
28+
- Runnable goroutines increase faster than CPU usage when nearing the system-configured maximum of 32.
29+
30+
{{site.data.alerts.callout_success}}
31+
Compare runnable goroutine graphs and CPU graphs at the same timestamp to easily spot sharp increases.
32+
{{site.data.alerts.end}}
33+
34+
- **IO Overload**
35+
- If CPU is not the bottleneck, review the **Overload** dashboard for IO issues.
36+
37+
### Index Hotspot (Read/Write)
38+
39+
- **Clear Direction Metric** (`kv.loadsplitter.cleardirection`)
40+
- Indicates keys for a replica are increasing or decreasing uniformly.
41+
- Check via **Advanced Debug**.
42+
- Examine the last three messages for signs.
43+
44+
- **KV Execution Latency**
45+
- Found on the **SQL Dashboard** in the **DB Console**.
46+
- Look at the 90th percentile latency; an outlier indicates a hotspot.
47+
48+
- **Latch Conflict Wait Durations** (`kv.concurrency.latch_conflict_wait_durations`)
49+
- Available in **Advanced Debug**.
50+
- A clear outlier signals contention, often caused by writes to the same row.
51+
52+
- **Popular Key Metric** (`kv.loadsplitter.popularkey`)
53+
- Indicates one key is receiving most of the traffic.
54+
- Available only via **Advanced Debug**.
55+
56+
### Finding the Hot Index
57+
58+
- Use the **Hot Ranges** page:
59+
- Sort by **CPU usage** and **Write (bytes)**.
60+
- Correlate metrics with the suspected node.
61+
- Metrics are averaged over 30 minutes.
62+
- CPU time is measured in milliseconds used per second.
63+
64+
- If correlation is found:
65+
1. Identify the range.
66+
2. Retrieve the index associated with the range.
67+
68+
- **Final Step:**
69+
- Use `SHOW CREATE TABLE` to inspect the table structure for the involved index.
70+
71+
{{site.data.alerts.callout_success}}
72+
Focus on correlating spikes in CPU or goroutines with specific index usage to confirm the hotspot.
73+
{{site.data.alerts.end}}
74+
75+
## Historical Detection
76+
77+
- Check **Hot Range Logs**:
78+
- Logs are emitted under certain conditions.
79+
- Contain details like replica ID and CPU usage.
80+
- Found in the **HEALTH** channel (moved from **TELEMETRY**).
81+
82+
## Performance Benchmarks and Limitations
83+
84+
During internal testing (row sizes 256–512 bytes) on an **N2-standard-16** machine:
85+
86+
| Category | Performance Limit |
87+
|------------------------|-----------------------------|
88+
| Index Hotspot | ~22,000 inserts per second |
89+
| Row Hotspot (Writes) | ~1,000 writes per second |
90+
| Row Hotspot (Reads) | ~70,000 reads per second |
91+
92+
{{site.data.alerts.callout_info}}
93+
The larger the cluster, the easier it is to detect hotspots due to clearer outliers.
94+
{{site.data.alerts.end}}
95+
96+
---
97+
98+
{{site.data.alerts.callout_danger}}
99+
**Deadline:** First minor release `v25.2.1` scheduled for June.
100+
{{site.data.alerts.end}}

0 commit comments

Comments
 (0)