Skip to content

Conversation

simonswine
Copy link
Contributor

@simonswine simonswine commented Jul 14, 2020

What this PR does:

This adds metrics for exposing the the sha256 hash of the config files used (before expansion). This allows to monitor the roll-out of a new config file across the cluster and can help to detect a mismatch in active configuration files.

The metric would look like this:

cortex_config_info{hash="sha256:86ec4c62f1a0ecc2eb111b752d6aa06db0d361c9f8d8fa6fc703ce6b85532458"} 1

As no expansion is going on this matches with the local file:

86ec4c62f1a0ecc2eb111b752d6aa06db0d361c9f8d8fa6fc703ce6b85532458  ./docs/configuration/single-process-config.yaml

Which issue(s) this PR fixes:

No issue yet

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@simonswine
Copy link
Contributor Author

@annanay25 if you get the chance can you take a look at this?

@pstibrany
Copy link
Contributor

Thanks for your first PR!

This looks like low-risk change, although I'm not sure what it's trying to solve.

For admin trying to see if his Cortex instances use the same config file, I think pre-expansion hash would make tiny bit more sense.

You make a good point about -runtime-config.file – I think it would be nice to extend the support for that one too.

@simonswine
Copy link
Contributor Author

This looks like low-risk change, although I'm not sure what it's trying to solve.

I think I have buried that in the commit message: This allows to monitor the roll-out of a new config file across the
cluster and can help to detect a mismatch in configuration files.

For admin trying to see if his Cortex instances use the same config file, I think pre-expansion hash would make tiny bit more sense.

That was my feeling, so I am assuming it's generally the node specific that's expanded using environment variables

You make a good point about -runtime-config.file – I think it would be nice to extend the support for that one too.

I will have a look how this can be achieved, will come back to you on this

Name: "cortex_overrides_last_reload_successful",
Help: "Whether the last config reload attempt was successful.",
Name: "cortex_runtime_config_last_reload_successful",
Help: "Whether the last runtime-config reload attempt was successful.",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed that because i found this mismatch between CLI argument and metric name quite confusing.

Not too sure if this is subject to a deprecation process or if we just should use the cortex_overrides for both

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to change (if you mention it in CHANGELOG.md and update existing alerts too in https://github.com/grafana/cortex-jsonnet repo) but I'm not sure if metric names are actually part of our V1 guarantees. It's not mentioned in the docs/configuration/v1-guarantees.md doc. @gouthamve what do you say?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked with Goutham, we don't cover metrics in guarantees, so it's ok to rename it. (with changelog and alerts update)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, following this lets make the changes in the cortex-mixin as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonswine As Annanay correctly mentioned, please don't forget to open a PR to update the mixin once this PR will be merged, thanks! 🙏 🙏 🙏

@simonswine
Copy link
Contributor Author

I have updated the code to also support the runtime config.

A couple of notes:

  • I still need to figure a nice way to observe the metric values for testing this. Are there any good examples how to do this?
  • I went for having separate metrics for config and runtime-config. Is that ok, should we use a shared metric with a label type=startup and type=runtime?
  • Also look at my inline comments

Would be great if I can get some input from one of you @annanay25 / @pstibrany

@pstibrany
Copy link
Contributor

pstibrany commented Jul 15, 2020

  • I still need to figure a nice way to observe the metric values for testing this. Are there any good examples how to do this?

Try to look for usage of github.com/prometheus/client_golang/prometheus/testutil package, there are many tests in Cortex codebase, either for individual metrics or entire registry.

  • I went for having separate metrics for config and runtime-config. Is that ok, should we use a shared metric with a label type=startup and type=runtime?

I think separate metrics is preferable in this case, since it's different components of the system.

  • Also look at my inline comments

Will try.

Name: "cortex_overrides_last_reload_successful",
Help: "Whether the last config reload attempt was successful.",
Name: "cortex_runtime_config_last_reload_successful",
Help: "Whether the last runtime-config reload attempt was successful.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to change (if you mention it in CHANGELOG.md and update existing alerts too in https://github.com/grafana/cortex-jsonnet repo) but I'm not sure if metric names are actually part of our V1 guarantees. It's not mentioned in the docs/configuration/v1-guarantees.md doc. @gouthamve what do you say?

@simonswine simonswine force-pushed the expose-config-hash branch 2 times, most recently from 24e42d9 to e83f01e Compare July 15, 2020 13:32
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 @simonswine! First time contributio? Amazing! Looking forward for many more ❤️

Please remember to add a CHANGELOG entry. The cortex_overrides_last_reload_successful rename should be a [CHANGE], while other changes an [ENHANCEMENT] (so two different entries). Also don't forget to add the PR number (look at other entries as an example).

@simonswine simonswine force-pushed the expose-config-hash branch from e83f01e to 0d43f84 Compare July 15, 2020 16:29
@simonswine
Copy link
Contributor Author

I think I have addressed most of the feedback. Github will not show my pushes for a while, as they have some problems, but once you are on 696b9bfe9, I should have addressed most concerns

@simonswine simonswine force-pushed the expose-config-hash branch 2 times, most recently from 0d43f84 to 9a9d77b Compare July 15, 2020 17:01
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job on your 🥇 time contribution! LGTM and looking forward to many more 🚀

Name: "cortex_overrides_last_reload_successful",
Help: "Whether the last config reload attempt was successful.",
Name: "cortex_runtime_config_last_reload_successful",
Help: "Whether the last runtime-config reload attempt was successful.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonswine As Annanay correctly mentioned, please don't forget to open a PR to update the mixin once this PR will be merged, thanks! 🙏 🙏 🙏

simonswine added a commit to simonswine/cortex-jsonnet that referenced this pull request Jul 17, 2020
As part of cortexproject/cortex#2874 the metric
was renamed, this PR renames the alert accordingly and supports both the
old and the new metric name
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you! (I've left some non-blocking nits.)

This allows to monitor the roll-out of a new config file across the
cluster and can helps to detect a mismatch in active config files. It
supports both start-up and runtime config.

Signed-off-by: Christian Simon <[email protected]>
@simonswine simonswine force-pushed the expose-config-hash branch from e0b8dd8 to 9c44169 Compare July 17, 2020 15:49
@simonswine
Copy link
Contributor Author

I think most if not all feedback should be addressed now 🎉

The change in the jsonnet alerts has happened here: grafana/cortex-jsonnet#139

@simonswine simonswine changed the title Expose hashsum of the config used at startup Expose hashsum of the config files Jul 17, 2020
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you very much for your work and addressing our feedback!

@pstibrany pstibrany merged commit cb7c60c into cortexproject:master Jul 17, 2020
@simonswine
Copy link
Contributor Author

Thanks for your responsive feedback. Have a great weekend 👋

simonswine added a commit to simonswine/cortex-jsonnet that referenced this pull request Jul 21, 2020
This allow to monitor the roll out of new config file versions to the
various nodes of a cluster. The metric was added as part of
cortexproject/cortex#2874.
simonswine added a commit to simonswine/cortex-jsonnet that referenced this pull request Jul 21, 2020
This allow to monitor the roll out of new config file versions to the
various nodes of a cluster. The metric was added as part of
cortexproject/cortex#2874.
simonswine added a commit to simonswine/cortex-jsonnet that referenced this pull request Jul 28, 2020
This allow to monitor the roll out of new config file versions to the
various nodes of a cluster. The metric was added as part of
cortexproject/cortex#2874.
simonswine added a commit to grafana/mimir that referenced this pull request Oct 18, 2021
As part of cortexproject/cortex#2874 the metric
was renamed, this PR renames the alert accordingly and supports both the
old and the new metric name
simonswine added a commit to grafana/mimir that referenced this pull request Oct 18, 2021
This allow to monitor the roll out of new config file versions to the
various nodes of a cluster. The metric was added as part of
cortexproject/cortex#2874.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants