fix(dispatch): unblock processing in maintenance #4551

siavashs · 2025-09-08T11:39:21Z

This change uses thread safe types to ensure dispatcher can process alerts while maintenance is running.

The included benchmark shows big improvements:

goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/dispatch
cpu: Apple M3 Pro
                                                │ bench-dispatch-main.txt │     bench-dispatch-no-lock.txt      │
                                                │         sec/op          │   sec/op     vs base                │
Dispatch_100k_AggregationGroups_10k_Empty-12               1.240µ ± 2%   1.196µ ± 2%   -3.59% (p=0.007 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12               1.332µ ± 1%   1.163µ ± 6%  -12.69% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12               1.441µ ± 2%   1.180µ ± 5%  -18.12% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12               1.648µ ± 3%   1.128µ ± 6%  -31.50% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12               2.304µ ± 8%   1.200µ ± 6%  -47.94% (p=0.000 n=10)
geomean                                                    1.553µ        1.173µ       -24.47%

                                                │ bench-dispatch-main.txt │       bench-dispatch-no-lock.txt       │
                                                │       alerts/sec        │  alerts/sec    vs base                 │
Dispatch_100k_AggregationGroups_10k_Empty-12              1.615M ±  1%    1.800M ±  6%   +11.51% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12              1.418M ±  2%    1.835M ±  3%   +29.44% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12              1.223M ±  4%    1.811M ±  5%   +48.15% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12              1.011M ± 13%    1.864M ±  3%   +84.32% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12              623.3k ± 15%   1798.9k ± 10%  +188.62% (p=0.000 n=10)
geomean                                                   1.120M          1.822M         +62.63%

                                                │ bench-dispatch-main.txt │           bench-dispatch-no-lock.txt           │
                                                │ maintenance_overhead_%  │ maintenance_overhead_%  vs base                │
Dispatch_100k_AggregationGroups_10k_Empty-12              16.980 ±  9%             4.069 ±  90%  -76.04% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12              35.255 ±  5%             4.588 ±    ?  -86.99% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12              59.865 ±  9%             8.347 ±  71%  -86.06% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12              93.145 ± 78%             8.848 ± 147%  -90.50% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12              194.35 ± 19%             17.97 ± 105%  -90.75% (p=0.000 n=10)
geomean                                                    57.86                   7.564         -86.93%

                                                │ bench-dispatch-main.txt │       bench-dispatch-no-lock.txt       │
                                                │     ms/maintenance      │ ms/maintenance  vs base                │
Dispatch_100k_AggregationGroups_10k_Empty-12               670.5 ±  3%      615.5 ±  7%   -8.20% (p=0.001 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12               717.0 ±  6%      624.5 ±  6%  -12.90% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12               783.5 ±  7%      597.0 ±  8%  -23.80% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12               788.0 ± 12%      623.0 ±  8%  -20.94% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12              1090.5 ± 17%      644.0 ± 12%  -40.94% (p=0.000 n=10)
geomean                                                    798.0            620.6        -22.23%

Changes from #4541 are also included.

Signed-off-by: Siavash Safi [email protected]

- make dispatcher maintenance interval configurable - bump the default maintenance interval from 30s to 15m Signed-off-by: Siavash Safi <[email protected]>

This change uses thread safe types to ensure dispatcher can process alerts while maintenance is running. The included benchmark shows big improvements: ``` goos: darwin goarch: arm64 pkg: github.com/prometheus/alertmanager/dispatch cpu: Apple M3 Pro │ bench-dispatch-main.txt │ bench-dispatch-no-lock.txt │ │ sec/op │ sec/op vs base │ Dispatch_100k_AggregationGroups_10k_Empty-12 1.240µ ± 2% 1.196µ ± 2% -3.59% (p=0.007 n=10) Dispatch_100k_AggregationGroups_20k_Empty-12 1.332µ ± 1% 1.163µ ± 6% -12.69% (p=0.000 n=10) Dispatch_100k_AggregationGroups_30k_Empty-12 1.441µ ± 2% 1.180µ ± 5% -18.12% (p=0.000 n=10) Dispatch_100k_AggregationGroups_40k_Empty-12 1.648µ ± 3% 1.128µ ± 6% -31.50% (p=0.000 n=10) Dispatch_100k_AggregationGroups_50k_Empty-12 2.304µ ± 8% 1.200µ ± 6% -47.94% (p=0.000 n=10) geomean 1.553µ 1.173µ -24.47% │ bench-dispatch-main.txt │ bench-dispatch-no-lock.txt │ │ alerts/sec │ alerts/sec vs base │ Dispatch_100k_AggregationGroups_10k_Empty-12 1.615M ± 1% 1.800M ± 6% +11.51% (p=0.000 n=10) Dispatch_100k_AggregationGroups_20k_Empty-12 1.418M ± 2% 1.835M ± 3% +29.44% (p=0.000 n=10) Dispatch_100k_AggregationGroups_30k_Empty-12 1.223M ± 4% 1.811M ± 5% +48.15% (p=0.000 n=10) Dispatch_100k_AggregationGroups_40k_Empty-12 1.011M ± 13% 1.864M ± 3% +84.32% (p=0.000 n=10) Dispatch_100k_AggregationGroups_50k_Empty-12 623.3k ± 15% 1798.9k ± 10% +188.62% (p=0.000 n=10) geomean 1.120M 1.822M +62.63% │ bench-dispatch-main.txt │ bench-dispatch-no-lock.txt │ │ maintenance_overhead_% │ maintenance_overhead_% vs base │ Dispatch_100k_AggregationGroups_10k_Empty-12 16.980 ± 9% 4.069 ± 90% -76.04% (p=0.000 n=10) Dispatch_100k_AggregationGroups_20k_Empty-12 35.255 ± 5% 4.588 ± ? -86.99% (p=0.000 n=10) Dispatch_100k_AggregationGroups_30k_Empty-12 59.865 ± 9% 8.347 ± 71% -86.06% (p=0.000 n=10) Dispatch_100k_AggregationGroups_40k_Empty-12 93.145 ± 78% 8.848 ± 147% -90.50% (p=0.000 n=10) Dispatch_100k_AggregationGroups_50k_Empty-12 194.35 ± 19% 17.97 ± 105% -90.75% (p=0.000 n=10) geomean 57.86 7.564 -86.93% │ bench-dispatch-main.txt │ bench-dispatch-no-lock.txt │ │ ms/maintenance │ ms/maintenance vs base │ Dispatch_100k_AggregationGroups_10k_Empty-12 670.5 ± 3% 615.5 ± 7% -8.20% (p=0.001 n=10) Dispatch_100k_AggregationGroups_20k_Empty-12 717.0 ± 6% 624.5 ± 6% -12.90% (p=0.000 n=10) Dispatch_100k_AggregationGroups_30k_Empty-12 783.5 ± 7% 597.0 ± 8% -23.80% (p=0.000 n=10) Dispatch_100k_AggregationGroups_40k_Empty-12 788.0 ± 12% 623.0 ± 8% -20.94% (p=0.000 n=10) Dispatch_100k_AggregationGroups_50k_Empty-12 1090.5 ± 17% 644.0 ± 12% -40.94% (p=0.000 n=10) geomean 798.0 620.6 -22.23% ``` Signed-off-by: Siavash Safi <[email protected]>

siavashs · 2025-09-10T10:03:23Z

If all operations made atomic then sync.Map actually performs worse.
Closing in favour of #4552

siavashs added 2 commits September 8, 2025 13:37

feat(dispatcher): add maintenance interval config

7d7119c

- make dispatcher maintenance interval configurable - bump the default maintenance interval from 30s to 15m Signed-off-by: Siavash Safi <[email protected]>

siavashs force-pushed the dispatch-no-lock branch from 5a6ea0e to e887293 Compare September 8, 2025 11:45

siavashs closed this Sep 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(dispatch): unblock processing in maintenance #4551

fix(dispatch): unblock processing in maintenance #4551

siavashs commented Sep 8, 2025

Uh oh!

siavashs commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix(dispatch): unblock processing in maintenance #4551

fix(dispatch): unblock processing in maintenance #4551

Conversation

siavashs commented Sep 8, 2025

Uh oh!

siavashs commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant