Skip to content

Conversation

gotjosh
Copy link
Contributor

@gotjosh gotjosh commented Jul 30, 2020

Gives us a way to know whenever a tenant has an invalid configuration in
place.

Signed-off-by: gotjosh [email protected]

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we delete a config: https://github.com/cortexproject/cortex/blob/9e8374f8dfd5eaac8084d50fc9d5936b3657b431/pkg/alertmanager/multitenant.go#L335

I think it would make sense to unregister the metrics for that user? WDYT?

@gotjosh
Copy link
Contributor Author

gotjosh commented Jul 30, 2020

I think it would make sense to unregister the metrics for that user? WDYT?

Originally, I thought that it might look strange that all your other metrics are registered except this one. But maybe it makes sense?

I have added the commit if we decide to not do it I can simply remove it.

gotjosh added 3 commits August 3, 2020 09:34
Gives us a way to know whenever a tenant has an invalid configuration in
place.

Signed-off-by: gotjosh <[email protected]>
@gotjosh gotjosh force-pushed the invalid-configs-metrics branch from f0d9580 to a72a5e1 Compare August 3, 2020 12:26
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the value the per-tenant metric can assume would be only 1 or 0, I'm wondering if we could keep the metric as is (no change) and add a new metric (with a better naming) to cover the new use case.

@gotjosh
Copy link
Contributor Author

gotjosh commented Aug 3, 2020

Given the value the per-tenant metric can assume would be only 1 or 0

There's an extra scenario that this serves as a precedent. If you provide an invalid config whilst you have a valid config, Alertmanager will revert back to the last working configuration.

userAmConfig, err = amconfig.Load(cfg.RawConfig)
if err != nil && hasExisting {
// XXX: This means that if a user has a working configuration and
// they submit a broken one, we'll keep processing the last known
// working configuration, and they'll never know.
// TODO: Provide a way of communicating this to the user and for removing
// Alertmanager instances.
return fmt.Errorf("invalid Cortex configuration for %v: %v", cfg.User, err)

By having the extra-label adding a way to signal this becomes trivial and uniform from a UX perspective (anything other than cortex_alertmanager_configs{status="valid"} == 1 is a failure).

With that said, I'm not fuzzed (as the above would just have to be a new metric then?) - Happy to go with what you think would align best with what we do in other parts of Cortex.

@gotjosh
Copy link
Contributor Author

gotjosh commented Aug 3, 2020

Discussed offline, where Marco suggested that prometheus_config_last_reload_successful looks pretty similar to what we want to do and thus should be a separate metric.

@gotjosh
Copy link
Contributor Author

gotjosh commented Aug 3, 2020

We've settled on cortex_alertmanager_invalid_config

@gotjosh gotjosh force-pushed the invalid-configs-metrics branch from 4ec50cf to c5cc415 Compare August 3, 2020 18:41
@@ -79,7 +80,8 @@ func TestAlertmanagerStoreAPI(t *testing.T) {
err = c.SetAlertmanagerConfig(context.Background(), cortexAlertmanagerUserConfigYaml, map[string]string{})
require.NoError(t, err)

require.NoError(t, am.WaitSumMetrics(e2e.Equals(1), "cortex_alertmanager_configs"))
time.Sleep(2 * time.Second)
Copy link
Contributor Author

@gotjosh gotjosh Aug 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I solved the CI failure using a quick hack, but I dislike Sleep is tests as it a general source of flaky tests. The problem here is the metric does not get initially registered (given it is per tenant) until much later.

A solution I can see would be to create a new method (or argument to WaitSumMetrics) that waits for a metric to appear instead of bailing out if we don't see the metric on the initial metrics page load. As I can see this being useful in other areas.

Please let me know and I'll be happy to make the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's work on it in a separate PR. Could you open an issue, please? I have this pending PR #2522 which could be used as a baseline to add a functional option to enable the logic you suggest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks marco, filed in #2975

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Contributor

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a suggestion!

Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gouthamve gouthamve merged commit b404a95 into cortexproject:master Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants