Skip to content

Conversation

siavashs
Copy link
Contributor

@siavashs siavashs commented Sep 8, 2025

Reduce the amount of time spent holding locks in the dispatcher by using
the new data structures:

  • doMaintenance() snapshots empty groups and deletes them afterwards
  • Groups() snapshots routes and queries groups per route on demand
  • processAlert() only locks the group it is processing an alert for

None of the above 3 methods hold any locks directly.

This results in -68% maintenance overhead or +12991% alert processing rate improvement:

goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/dispatch
cpu: Apple M3 Pro
                                                 │ bench-dispatch-main.txt │  bench-dispatch-nested-map-locks.txt   │
                                                 │         sec/op          │    sec/op      vs base                 │
Dispatch_100k_AggregationGroups_10k_Empty-12                  1.242µ ±  1%   1.193µ ±   5%    -3.91% (p=0.037 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12                  1.329µ ±  2%   1.188µ ±   3%   -10.65% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12                  1.437µ ±  3%   1.288µ ±   7%   -10.37% (p=0.001 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12                  1.695µ ±  8%   1.236µ ±   6%   -27.06% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12                  2.185µ ± 11%   1.286µ ±  11%   -41.17% (p=0.000 n=10)
Dispatch_20k_AggregationGroups_Groups_Impact-12          189007.708µ ±  6%   1.143µ ±  18%  -100.00% (p=0.000 n=10)
Dispatch_50k_AggregationGroups_Groups_Impact-12              602.17m ± 69%   28.09m ± 605%   -95.33% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_Groups_Impact-12               1.543 ± 29%    1.272 ±  96%         ~ (p=0.247 n=10)
geomean                                                       187.7µ         24.22µ          -87.10%

                                                 │ bench-dispatch-main.txt │       bench-dispatch-nested-map-locks.txt        │
                                                 │       alerts/sec        │     alerts/sec      vs base                      │
Dispatch_100k_AggregationGroups_10k_Empty-12             1.616M ±       1%        1.767M ±   3%         +9.36% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12             1.412M ±       2%        1.715M ±   1%        +21.45% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12             1.235M ±       4%        1.637M ±   4%        +32.58% (p=0.002 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12             950.9k ±      14%       1614.0k ±   8%        +69.73% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12             693.5k ±      18%       1370.7k ±  17%        +97.65% (p=0.000 n=10)
Dispatch_20k_AggregationGroups_Groups_Impact-12           5.586 ±      14%   2027690.000 ±   5%  +36299398.75% (p=0.000 n=10)
Dispatch_50k_AggregationGroups_Groups_Impact-12           3.277 ±      40%    742918.500 ± 167%  +22667131.12% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_Groups_Impact-12          1.424 ± 4255939%    252756.500 ±  73%  +17749654.21% (p=0.000 n=10)
geomean                                                  9.134k                   1.196M            +12991.42%

                                             │ bench-dispatch-main.txt │      bench-dispatch-nested-map-locks.txt       │
                                             │ maintenance_overhead_%  │ maintenance_overhead_%  vs base                │
Dispatch_100k_AggregationGroups_10k_Empty-12              17.185 ±  7%              5.905 ±   ?  -65.64% (p=0.002 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12               36.50 ± 11%              13.66 ± 26%  -62.58% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12               55.44 ± 13%              17.45 ± 26%  -68.52% (p=0.002 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12              125.65 ± 27%              27.03 ± 34%  -78.49% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12              172.40 ± 36%              64.64 ± 60%  -62.51% (p=0.000 n=10)
geomean                                                    59.62                    18.97        -68.17%

other changes

feat(dispatch): add nested map implementation

Add custom nested map implementation with locks at root and branches.
This allows to read/write operation to not always block each other.

chore(dispatch): split dispatch into multiple files

  • split dispatch.go into multiple files
  • update copyright

Signed-off-by: Siavash Safi [email protected]

@siavashs siavashs force-pushed the fix-dispatch-locking branch from d60b6fe to f7685e1 Compare September 9, 2025 08:12
@siavashs siavashs changed the title fix(dispatch): improve locking contention fix(dispatch): reduce locking contention Sep 9, 2025
@OGKevin
Copy link
Contributor

OGKevin commented Sep 9, 2025

The thing I like about

mtx.Lock()
defer mtx.Unlock()

is that it makes the management of the lock very clear and explicit.

Your changes make sense to me. However, I wonder if, for improved maintainability, it would be worth extracting the logic that needs to hold locks into smaller functions where we can still use defer for unlocking.

For example, instead of:

func (d *Dispatcher) doMaintenance() {
    d.mtx.RLock()
    empty := make(map[*Route]map[model.Fingerprint]*aggrGroup)
    for route, groups := range d.aggrGroupsPerRoute {
        for fp, ag := range groups {
            if ag.empty() {
                if empty[route] == nil {
                    empty[route] = make(map[model.Fingerprint]*aggrGroup)
                }
                empty[route][fp] = ag
            }
        }
    }
    d.mtx.RUnlock()
}

it could become:

func (d *Dispatcher) makeEmptyAggrMap() map[*Route]map[model.Fingerprint]*aggrGroup {
    d.mtx.RLock()
    defer d.mtx.RUnlock()

    empty := make(map[*Route]map[model.Fingerprint]*aggrGroup)
    for route, groups := range d.aggrGroupsPerRoute {
        for fp, ag := range groups {
            if ag.empty() {
                if empty[route] == nil {
                    empty[route] = make(map[model.Fingerprint]*aggrGroup)
                }
                empty[route][fp] = ag
            }
        }
    }

    return empty
}

func (d *Dispatcher) doMaintenance() {
    empty := d.makeEmptyAggrMap()
}

This approach could be applied to all the places where lock holding is involved, essentially making the code easier to maintain and reason about.

wdyt?

@siavashs siavashs force-pushed the fix-dispatch-locking branch from f7685e1 to bf4e9c5 Compare September 9, 2025 17:07
@siavashs
Copy link
Contributor Author

siavashs commented Sep 9, 2025

I reimplemented the change using custom nested structures.
This allows dispatcher to hold r/w locks at different branches of the data structure, so it is less blocking for parallel operations.
It also addresses your comment @OGKevin since we don't hold any locks in dispatcher methods.

I will update PR description and benchmarks tomorrow.

@siavashs siavashs force-pushed the fix-dispatch-locking branch from bf4e9c5 to 00dd27c Compare September 9, 2025 17:30
Copy link
Contributor

@OGKevin OGKevin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks 👌🏾 Just a few comments on wrapping test sencarios.

Would be nice to make metrics an interface so that we can also test if metrics are being set correctly, but I would say that is for a diff mr.

return as
}

// Test regular situation where we wait for group_wait to send out alerts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test function covers several scenarios. Would it be better to wrap each scenario in a t.Run() block? That would allow a single failing test to be run repeatedly for easier debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do this in a separate MR.

The group_test.go was split from dispatch_test.go without modifications.
In dispatch_test.go there is TestGroupsWithLimits which checks the alertmanager_dispatcher_aggregation_group_limit_reached_total metric.

Note: Limits are currently not used in Alertmanager, it's a feature imported by Grafana Mimir.


ag.stop()

// Add an alert that started more than group_interval in the past. We expect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same idea here

}
}

// Resolve an alert, and it should be removed after the next batch was sent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

}
}

// Resolve all remaining alerts, they should be removed after the next batch was sent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this last scenario

- split dispatch.go into multiple files
- update copyright

Signed-off-by: Siavash Safi <[email protected]>
@siavashs siavashs force-pushed the fix-dispatch-locking branch from 00dd27c to 46b464c Compare September 10, 2025 09:25
Add custom nested map implementation with locks at root and branches.
This allows to read/write operation to not always block each other.

Signed-off-by: Siavash Safi <[email protected]>
@siavashs siavashs force-pushed the fix-dispatch-locking branch from 46b464c to 2a27e78 Compare September 10, 2025 09:56
@siavashs siavashs marked this pull request as ready for review September 10, 2025 09:59
Reduce the amount of time spent holding locks in the dispatcher by using
the new data structures:

- doMaintenance() snapshots empty groups and deletes them afterwards
- Groups() snapshots routes and queries groups per route on demand
- processAlert() only locks the group it is processing an alert for

None of the above 3 methods hold any locks directly.

This results in -68% maintenance overhead or +12991% alert processing rate improvement:
```
goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/dispatch
cpu: Apple M3 Pro
                                                 │ bench-dispatch-main.txt │  bench-dispatch-nested-map-locks.txt   │
                                                 │         sec/op          │    sec/op      vs base                 │
Dispatch_100k_AggregationGroups_10k_Empty-12                  1.242µ ±  1%   1.193µ ±   5%    -3.91% (p=0.037 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12                  1.329µ ±  2%   1.188µ ±   3%   -10.65% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12                  1.437µ ±  3%   1.288µ ±   7%   -10.37% (p=0.001 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12                  1.695µ ±  8%   1.236µ ±   6%   -27.06% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12                  2.185µ ± 11%   1.286µ ±  11%   -41.17% (p=0.000 n=10)
Dispatch_20k_AggregationGroups_Groups_Impact-12          189007.708µ ±  6%   1.143µ ±  18%  -100.00% (p=0.000 n=10)
Dispatch_50k_AggregationGroups_Groups_Impact-12              602.17m ± 69%   28.09m ± 605%   -95.33% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_Groups_Impact-12               1.543 ± 29%    1.272 ±  96%         ~ (p=0.247 n=10)
geomean                                                       187.7µ         24.22µ          -87.10%

                                                 │ bench-dispatch-main.txt │       bench-dispatch-nested-map-locks.txt        │
                                                 │       alerts/sec        │     alerts/sec      vs base                      │
Dispatch_100k_AggregationGroups_10k_Empty-12             1.616M ±       1%        1.767M ±   3%         +9.36% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12             1.412M ±       2%        1.715M ±   1%        +21.45% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12             1.235M ±       4%        1.637M ±   4%        +32.58% (p=0.002 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12             950.9k ±      14%       1614.0k ±   8%        +69.73% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12             693.5k ±      18%       1370.7k ±  17%        +97.65% (p=0.000 n=10)
Dispatch_20k_AggregationGroups_Groups_Impact-12           5.586 ±      14%   2027690.000 ±   5%  +36299398.75% (p=0.000 n=10)
Dispatch_50k_AggregationGroups_Groups_Impact-12           3.277 ±      40%    742918.500 ± 167%  +22667131.12% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_Groups_Impact-12          1.424 ± 4255939%    252756.500 ±  73%  +17749654.21% (p=0.000 n=10)
geomean                                                  9.134k                   1.196M            +12991.42%

                                             │ bench-dispatch-main.txt │      bench-dispatch-nested-map-locks.txt       │
                                             │ maintenance_overhead_%  │ maintenance_overhead_%  vs base                │
Dispatch_100k_AggregationGroups_10k_Empty-12              17.185 ±  7%              5.905 ±   ?  -65.64% (p=0.002 n=10)
Dispatch_100k_AggregationGroups_20k_Empty-12               36.50 ± 11%              13.66 ± 26%  -62.58% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_30k_Empty-12               55.44 ± 13%              17.45 ± 26%  -68.52% (p=0.002 n=10)
Dispatch_100k_AggregationGroups_40k_Empty-12              125.65 ± 27%              27.03 ± 34%  -78.49% (p=0.000 n=10)
Dispatch_100k_AggregationGroups_50k_Empty-12              172.40 ± 36%              64.64 ± 60%  -62.51% (p=0.000 n=10)
geomean                                                    59.62                    18.97        -68.17%
```

Signed-off-by: Siavash Safi <[email protected]>
@siavashs siavashs force-pushed the fix-dispatch-locking branch from 2a27e78 to 4520036 Compare September 10, 2025 10:05
Copy link
Contributor

@sysadmind sysadmind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to review, mostly because of how much was moved. At this point we should leave it like it is, because I've gone through the review. I would suggest in the future to move code in a separate PR because I spent a lot of time just reviewing lines that moved to a different file.


// notifyFunc is a function that performs notification for the alert
// with the given fingerprint. It aborts on context cancelation.
// Returns false iff notifying failed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Returns false iff notifying failed.
// Returns false if notifying failed.

// // Each branch of the map can hold its own R(W)Lock.
type routeGroups struct {
mu sync.RWMutex
routeGroups map[*Route]*fingerprintGroups
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a weird name conflict with the struct name. The comment is also slightly misleading. This struct isn't the map, but a structure that holds the map. Should this be called something like fingerprints because that's what it is pointing to?

type routeGroups struct {
mu sync.RWMutex
routeGroups map[*Route]*fingerprintGroups
groupsNum *atomic.Int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it would be better named in the reverse - numGroups

}
}

// fingerprintGroups is a map of fingerprints to aggregation groups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is also technically wrong

type fingerprintGroups struct {
mu sync.RWMutex
aggrGroups map[model.Fingerprint]*aggrGroup
groupsNum *atomic.Int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would also be better to rename as numGroups

fg.aggrGroups = make(map[model.Fingerprint]*aggrGroup)
}
// Check if we've reached the rate limit before creating a new group.
if fg.LimitReached() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this not be a good time to return an error?

func (ag AlertGroups) Less(i, j int) bool {
if ag[i].Labels.Equal(ag[j].Labels) {
return ag[i].Receiver < ag[j].Receiver
// Second pass: remove collected groups
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trouble with not having the lock this whole time is that between defining what to remove above and actually removing it below, the state could have changed because we give up the read lock, then try to get a write lock. Who knows what happens in between. Is this maintenance function the one that take the significant amount of time?

If so, we could try to range over it to find empty groups, and when we do, instead of putting it in a slice, we could decide to do the expensive operation of taking a write lock, again reading to make sure it's still empty, then deleting/removing it. That would keep write locks down to only when we have something to do, even if we pay that price twice in the less likely scenario that we have cleanup to do.

Copy link
Contributor Author

@siavashs siavashs Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this maintenance function the one that take the significant amount of time?

yes, if you check the benchmarks there is a huge perfomance penalty of locking the dispatcher every 30s.

Who knows what happens in between. Is this maintenance function the one that take the significant amount of time?

I think the code can be improved, but we don't want to keep the lock at the root while looping over branches.
When trying to remove a group, we can double check if it's still empty or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for correctness we have to re-check for emptyness inside an RWLock.

@siavashs
Copy link
Contributor Author

It took me a while to review, mostly because of how much was moved. At this point we should leave it like it is, because I've gone through the review. I would suggest in the future to move code in a separate PR because I spent a lot of time just reviewing lines that moved to a different file.

There are 3 atomic commits which can be reviewed separately, if that helps.

@sysadmind
Copy link
Contributor

Looks like there's a merge conflict.

@siavashs
Copy link
Contributor Author

So #4607 turned out to be the major bottleneck in dispatch.
But I will simplify this PR and update it since there are still minor improvements we can add.

@siavashs siavashs marked this pull request as draft October 19, 2025 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants