Skip to content

Conversation

gotjosh
Copy link
Contributor

@gotjosh gotjosh commented Sep 21, 2020

What this PR does:

Adds a new metric to keep track of when was the last uploaded configuration successfully applied to the Alertmanager. Also, improves documentation.

Still needs a test but wanted to see whenever this was an acceptable solution - An alternative here is that we stop the Alertmanager completely if we fail to load the last configuration and symbolise the user the configuration did not work.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] wanted to see whenever this was an acceptable solution - An alternative here is that we stop the Alertmanager completely if we fail to load the last configuration and symbolise the user the configuration did not work.

I have a question: how is the new metric different then what cortex_alertmanager_config_invalid already does?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[open question] What if we keep the metric name specular with Prometheus AlertManager, so cortex_alertmanager_config_last_reload_success_timestamp_seconds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below we do am.multitenantMetrics.invalidConfig.DeleteLabelValues(user). Shouldn't we also do it for this new metric?

@pracucci
Copy link
Contributor

@gotjosh Is there still interest in this PR?

@gotjosh
Copy link
Contributor Author

gotjosh commented Nov 18, 2020

This was superseded by #3289 but the comments are still relevant - I'll modify the PR, as I think the comments are very relevant for upcoming work.

@gotjosh gotjosh marked this pull request as ready for review November 18, 2020 17:14
@gotjosh gotjosh changed the title Alertmanager: Introduce new metric for successful reloads Alertmanager: Remove outdated comment Nov 19, 2020
Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// 1) the user had a previous alertmanager
// 2) then, submitted a non-working configuration (and we kept running the prev working config)
// 3) finally, the cortex AM instance is restarted and the running version is no longer present
if userAmConfig == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(non-blocking,if-minor): Since this check has been moved from transformConfig consider removing the transformConfig function entirely an moving the for loop logic from the function here. transformConfig is only called here so it should be safe to move and I think it would still be quite readable as part of a single function call.

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pull-request-size pull-request-size bot added size/M and removed size/S labels Nov 20, 2020
@gotjosh
Copy link
Contributor Author

gotjosh commented Nov 20, 2020

@pracucci This should be good to go!

@pracucci pracucci merged commit e3dd1d6 into cortexproject:master Nov 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants