-
Notifications
You must be signed in to change notification settings - Fork 832
AM Metric: Add tenant label to valid/invalid configs #2960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AM Metric: Add tenant label to valid/invalid configs #2960
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we delete a config: https://github.com/cortexproject/cortex/blob/9e8374f8dfd5eaac8084d50fc9d5936b3657b431/pkg/alertmanager/multitenant.go#L335
I think it would make sense to unregister the metrics for that user? WDYT?
Originally, I thought that it might look strange that all your other metrics are registered except this one. But maybe it makes sense? I have added the commit if we decide to not do it I can simply remove it. |
Gives us a way to know whenever a tenant has an invalid configuration in place. Signed-off-by: gotjosh <[email protected]>
Signed-off-by: gotjosh <[email protected]>
Signed-off-by: gotjosh <[email protected]>
f0d9580
to
a72a5e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the value the per-tenant metric can assume would be only 1 or 0, I'm wondering if we could keep the metric as is (no change) and add a new metric (with a better naming) to cover the new use case.
There's an extra scenario that this serves as a precedent. If you provide an invalid config whilst you have a valid config, Alertmanager will revert back to the last working configuration. cortex/pkg/alertmanager/multitenant.go Lines 416 to 423 in 6db67a4
By having the extra-label adding a way to signal this becomes trivial and uniform from a UX perspective (anything other than With that said, I'm not fuzzed (as the above would just have to be a new metric then?) - Happy to go with what you think would align best with what we do in other parts of Cortex. |
Discussed offline, where Marco suggested that |
We've settled on |
Signed-off-by: gotjosh <[email protected]>
4ec50cf
to
c5cc415
Compare
@@ -79,7 +80,8 @@ func TestAlertmanagerStoreAPI(t *testing.T) { | |||
err = c.SetAlertmanagerConfig(context.Background(), cortexAlertmanagerUserConfigYaml, map[string]string{}) | |||
require.NoError(t, err) | |||
|
|||
require.NoError(t, am.WaitSumMetrics(e2e.Equals(1), "cortex_alertmanager_configs")) | |||
time.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I solved the CI failure using a quick hack, but I dislike Sleep
is tests as it a general source of flaky tests. The problem here is the metric does not get initially registered (given it is per tenant) until much later.
A solution I can see would be to create a new method (or argument to WaitSumMetrics
) that waits for a metric to appear instead of bailing out if we don't see the metric on the initial metrics page load. As I can see this being useful in other areas.
Please let me know and I'll be happy to make the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's work on it in a separate PR. Could you open an issue, please? I have this pending PR #2522 which could be used as a baseline to add a functional option to enable the logic you suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks marco, filed in #2975
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a suggestion!
Signed-off-by: gotjosh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Gives us a way to know whenever a tenant has an invalid configuration in
place.
Signed-off-by: gotjosh [email protected]
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]