Skip to content

Conversation

@siavashs
Copy link
Contributor

@siavashs siavashs commented Sep 8, 2025

This change uses thread safe types to ensure dispatcher can process alerts while maintenance is running.

The included benchmark shows big improvements:

goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/dispatch
cpu: Apple M3 Pro
                                                │ bench-dispatch-main.txt │     bench-dispatch-no-lock.txt      │
                                                │         sec/op          │   sec/op     vs base                │
Dispatch_100k_AggregationGroups_10k_Empty-12               1.240µ ± 2%   1.196µ ± 2%   -3.59% (p=0.007 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12               1.332µ ± 1%   1.163µ ± 6%  -12.69% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12               1.441µ ± 2%   1.180µ ± 5%  -18.12% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12               1.648µ ± 3%   1.128µ ± 6%  -31.50% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12               2.304µ ± 8%   1.200µ ± 6%  -47.94% (p=0.000 n=10)
geomean                                                    1.553µ        1.173µ       -24.47%

                                                │ bench-dispatch-main.txt │       bench-dispatch-no-lock.txt       │
                                                │       alerts/sec        │  alerts/sec    vs base                 │
Dispatch_100k_AggregationGroups_10k_Empty-12              1.615M ±  1%    1.800M ±  6%   +11.51% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12              1.418M ±  2%    1.835M ±  3%   +29.44% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12              1.223M ±  4%    1.811M ±  5%   +48.15% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12              1.011M ± 13%    1.864M ±  3%   +84.32% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12              623.3k ± 15%   1798.9k ± 10%  +188.62% (p=0.000 n=10)
geomean                                                   1.120M          1.822M         +62.63%

                                                │ bench-dispatch-main.txt │           bench-dispatch-no-lock.txt           │
                                                │ maintenance_overhead_%  │ maintenance_overhead_%  vs base                │
Dispatch_100k_AggregationGroups_10k_Empty-12              16.980 ±  9%             4.069 ±  90%  -76.04% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12              35.255 ±  5%             4.588 ±    ?  -86.99% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12              59.865 ±  9%             8.347 ±  71%  -86.06% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12              93.145 ± 78%             8.848 ± 147%  -90.50% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12              194.35 ± 19%             17.97 ± 105%  -90.75% (p=0.000 n=10)
geomean                                                    57.86                   7.564         -86.93%

                                                │ bench-dispatch-main.txt │       bench-dispatch-no-lock.txt       │
                                                │     ms/maintenance      │ ms/maintenance  vs base                │
Dispatch_100k_AggregationGroups_10k_Empty-12               670.5 ±  3%      615.5 ±  7%   -8.20% (p=0.001 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12               717.0 ±  6%      624.5 ±  6%  -12.90% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12               783.5 ±  7%      597.0 ±  8%  -23.80% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12               788.0 ± 12%      623.0 ±  8%  -20.94% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12              1090.5 ± 17%      644.0 ± 12%  -40.94% (p=0.000 n=10)
geomean                                                    798.0            620.6        -22.23%

Changes from #4541 are also included.

Signed-off-by: Siavash Safi [email protected]

- make dispatcher maintenance interval configurable
- bump the default maintenance interval from 30s to 15m

Signed-off-by: Siavash Safi <[email protected]>
This change uses thread safe types to ensure dispatcher can process alerts
while maintenance is running.

The included benchmark shows big improvements:
```
goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/dispatch
cpu: Apple M3 Pro
                                             │ bench-dispatch-main.txt │     bench-dispatch-no-lock.txt      │
                                             │         sec/op          │   sec/op     vs base                │
Dispatch_100k_AggregationGroups_10k_Empty-12               1.240µ ± 2%   1.196µ ± 2%   -3.59% (p=0.007 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12               1.332µ ± 1%   1.163µ ± 6%  -12.69% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12               1.441µ ± 2%   1.180µ ± 5%  -18.12% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12               1.648µ ± 3%   1.128µ ± 6%  -31.50% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12               2.304µ ± 8%   1.200µ ± 6%  -47.94% (p=0.000 n=10)
geomean                                                    1.553µ        1.173µ       -24.47%

                                             │ bench-dispatch-main.txt │       bench-dispatch-no-lock.txt       │
                                             │       alerts/sec        │  alerts/sec    vs base                 │
Dispatch_100k_AggregationGroups_10k_Empty-12              1.615M ±  1%    1.800M ±  6%   +11.51% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12              1.418M ±  2%    1.835M ±  3%   +29.44% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12              1.223M ±  4%    1.811M ±  5%   +48.15% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12              1.011M ± 13%    1.864M ±  3%   +84.32% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12              623.3k ± 15%   1798.9k ± 10%  +188.62% (p=0.000 n=10)
geomean                                                   1.120M          1.822M         +62.63%

                                             │ bench-dispatch-main.txt │           bench-dispatch-no-lock.txt           │
                                             │ maintenance_overhead_%  │ maintenance_overhead_%  vs base                │
Dispatch_100k_AggregationGroups_10k_Empty-12              16.980 ±  9%             4.069 ±  90%  -76.04% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12              35.255 ±  5%             4.588 ±    ?  -86.99% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12              59.865 ±  9%             8.347 ±  71%  -86.06% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12              93.145 ± 78%             8.848 ± 147%  -90.50% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12              194.35 ± 19%             17.97 ± 105%  -90.75% (p=0.000 n=10)
geomean                                                    57.86                   7.564         -86.93%

                                             │ bench-dispatch-main.txt │       bench-dispatch-no-lock.txt       │
                                             │     ms/maintenance      │ ms/maintenance  vs base                │
Dispatch_100k_AggregationGroups_10k_Empty-12               670.5 ±  3%      615.5 ±  7%   -8.20% (p=0.001 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12               717.0 ±  6%      624.5 ±  6%  -12.90% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12               783.5 ±  7%      597.0 ±  8%  -23.80% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12               788.0 ± 12%      623.0 ±  8%  -20.94% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12              1090.5 ± 17%      644.0 ± 12%  -40.94% (p=0.000 n=10)
geomean                                                    798.0            620.6        -22.23%
```

Signed-off-by: Siavash Safi <[email protected]>
@siavashs
Copy link
Contributor Author

If all operations made atomic then sync.Map actually performs worse.
Closing in favour of #4552

@siavashs siavashs closed this Sep 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant