bugfix: refactor alerts to accomodate for non-HA clusters #1010

rexagod · 2025-01-06T06:29:38Z

For the sake of brevity, let:

Q:  kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"} (allocable)
QQ: namespace_cpu:kube_pod_container_resource_requests:sum{} (requested)

thus, both quota alert expressions relevant here (KubeCPUOvercommit and KubeMemoryOvercommit) exist in the form: sum(QQ) by (cluster) - (sum(Q) by (cluster) - max(Q) by (cluster)) > 0 and (sum(Q) by (cluster) - max(Q) by (cluster)) > 0, which, in case of a single-node cluster (sum(Q) by (cluster) = max(Q) by (cluster)), is reduced to, sum(QQ) by (cluster) > 0, i.e., the alert will fire if any request limits exist.

To address this, drop the max(Q) by (cluster) buffer assumed in non-SNO clusters from SNO, reducing the expression to: sum(QQ) by (cluster) - sum(Q) by (cluster) > 0 (total requeted - total allocable > 0 to trigger alert), since there is only a single node, so a buffer of the same sort does not make sense.

rexagod · 2025-01-06T08:15:48Z

alerts/resource_alerts.libsonnet

              and
-              (sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) - max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0
+              sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) -
+              sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0)


Suggested change

sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0)

0.95 * sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0)

Since a max(Q) buffer is not applicable in SNO, how about a numeric buffer of 5% (or more?)? That should help alert before things go out of budget.

alerts/resource_alerts.libsonnet

rexagod

Plus, in that case we can suggest to the users to add worker nodes or increase resources on the given node.

LMK if I should append the above to the existing descriptions to make them more relevant in case of SNO.

skl

@rexagod I added tests to your PR in eca85fe, hope you don't mind.

Also tested in a dev environment and compared the old rules with the new, I observed effectively the same behaviour - with the addition of support for single-node clusters.

I think the existing alert descriptions are self-explanatory, and imply action is required (reduce requests or increase capacity) regardless of number of nodes.

lgtm!

rexagod · 2025-01-27T09:10:34Z

Ah, sorry I completely missed adding the tests in my last commit. Thanks a bunch for the contribution! ❤️

I'll ask Simon to take a look here since with this patch we are essentially setting the trend for adapting such alerts for SNO, instead of dropping them as we have done internally (till now).

alerts/resource_alerts.libsonnet

rexagod · 2025-02-10T13:58:57Z

@simonpasquier I've updated the PR to reflect only what upstream is concerned with, i.e., HA and non-HA, leaving logic for any OpenShift-specific variants outside of this patch.

I believe following this trend upstream will be vendor-neutral and future-safe.

rexagod · 2025-02-10T21:50:37Z

I noticed the tests are failing, will fix.

rexagod · 2025-07-21T16:15:06Z

Hello again, @skl! 👋🏼

Apologies for the delay here, it seems this went under the radar after the discussions internally. In the latest revision for this patch, I've addressed the failing tests.

PTAL, thank you!

alerts/resource_alerts.libsonnet

skl

lgtm, thanks for all the work here!

tests/tests.yaml

For the sake of brevity, let: Q: kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"} (allocable), and, QQ: namespace_cpu:kube_pod_container_resource_requests:sum{} (requested), thus, both quota alerts relevant here exist in the form: sum(QQ) by (cluster) - (sum(Q) by (cluster) - max(Q) by (cluster)) > 0 and (sum(Q) by (cluster) - max(Q) by (cluster)) > 0, which, in case of a single-node cluster (sum(Q) by (cluster) = max(Q) by (cluster)), is reduced to, sum(QQ) by (cluster) > 0, i.e., the alert will fire if *any* request limits exist. To address this, drop the "max(Q) by (cluster)" buffer assumed in non-SNO clusters from SNO, reducing the expression to: sum(QQ) by (cluster) - sum(Q) by (cluster) > 0 (total requeted - total allocable > 0 to trigger alert), since there is only a single node, so a buffer of the same sort does not make sense. Signed-off-by: Pranshu Srivastava <[email protected]>

Use kube_node_role{role="control-plane"} (see [1] and [2]) to estimate if the cluster is HA or not. * [1]: https://github.com/search?q=repo%3Akubernetes%2Fkube-state-metrics%20kube_node_role&type=code * [2]: https://kubernetes.io/docs/reference/labels-annotations-taints/#node-role-kubernetes-io-control-plane Also drop any thresholds as they would lead to false positives.

Signed-off-by: Pranshu Srivastava <[email protected]>

skl

Thanks, lgtm!

rexagod force-pushed the KubeCPUOvercommit-SNO branch from 5b96fb5 to 5cd53d6 Compare January 6, 2025 07:07

rexagod marked this pull request as ready for review January 6, 2025 07:08

rexagod requested review from povilasv and skl as code owners January 6, 2025 07:08

rexagod marked this pull request as draft January 6, 2025 07:17

rexagod marked this pull request as ready for review January 6, 2025 08:13

rexagod commented Jan 6, 2025

View reviewed changes

rexagod mentioned this pull request Jan 7, 2025

OCPBUGS-55736: disable --auto-gomemlimit for Prometheus on SNO until we can ensure it won't result in excessive CPU usage openshift/cluster-monitoring-operator#2549

Merged

2 tasks

skl reviewed Jan 7, 2025

View reviewed changes

alerts/resource_alerts.libsonnet Outdated Show resolved Hide resolved

alerts/resource_alerts.libsonnet Show resolved Hide resolved

skl self-assigned this Jan 7, 2025

simonpasquier reviewed Jan 9, 2025

View reviewed changes

alerts/resource_alerts.libsonnet Outdated Show resolved Hide resolved

skl removed their assignment Jan 13, 2025

rexagod commented Jan 21, 2025

View reviewed changes

rexagod requested a review from simonpasquier January 21, 2025 13:39

rexagod force-pushed the KubeCPUOvercommit-SNO branch from 984a0a3 to 3cd8f3c Compare January 21, 2025 19:33

skl approved these changes Jan 23, 2025

View reviewed changes

simonpasquier reviewed Jan 27, 2025

View reviewed changes

alerts/resource_alerts.libsonnet Outdated Show resolved Hide resolved

skl added the keepalive Use to prevent automatic closing label Feb 3, 2025

rexagod force-pushed the KubeCPUOvercommit-SNO branch from 77dc076 to fc9fe6a Compare February 10, 2025 13:55

rexagod force-pushed the KubeCPUOvercommit-SNO branch 3 times, most recently from 899e369 to 725fe6b Compare July 21, 2025 14:14

rexagod requested a review from skl July 21, 2025 16:13

rexagod mentioned this pull request Jul 21, 2025

feat: Support OpenShift-specific topologies openshift/cluster-monitoring-operator#2622

Draft

rexagod force-pushed the KubeCPUOvercommit-SNO branch 2 times, most recently from f899f9b to e892fd2 Compare July 21, 2025 18:24

skl reviewed Jul 22, 2025

View reviewed changes

alerts/resource_alerts.libsonnet Outdated Show resolved Hide resolved

rexagod force-pushed the KubeCPUOvercommit-SNO branch 2 times, most recently from 75fcc1f to 4aea62b Compare July 22, 2025 19:07

rexagod requested a review from skl July 23, 2025 10:18

skl approved these changes Jul 23, 2025

View reviewed changes

tests/tests.yaml Outdated Show resolved Hide resolved

rexagod and others added 4 commits July 23, 2025 22:20

fixup! bugfix: refactor alerts to accomodate for single-node clusters

3b1549f

test: Add tests for KubeCPUOvercommit and KubeMemoryOvercommit

ec3a9f3

rexagod changed the title ~~bugfix: refactor alerts to accomodate for single-node clusters~~ bugfix: refactor alerts to accomodate for non-HA clusters Jul 23, 2025

follow-up: add test cases for non-HA (two-node)

4ae726c

Signed-off-by: Pranshu Srivastava <[email protected]>

rexagod force-pushed the KubeCPUOvercommit-SNO branch from 4aea62b to 4ae726c Compare July 23, 2025 19:45

skl approved these changes Jul 23, 2025

View reviewed changes

skl merged commit ab4cb2b into kubernetes-monitoring:master Jul 24, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bugfix: refactor alerts to accomodate for non-HA clusters #1010

bugfix: refactor alerts to accomodate for non-HA clusters #1010

rexagod commented Jan 6, 2025 •

edited

Loading

Uh oh!

rexagod Jan 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rexagod left a comment •

edited

Loading

Uh oh!

skl left a comment

Uh oh!

rexagod commented Jan 27, 2025

Uh oh!

Uh oh!

rexagod commented Feb 10, 2025

Uh oh!

rexagod commented Feb 10, 2025

Uh oh!

rexagod commented Jul 21, 2025

Uh oh!

Uh oh!

skl left a comment

Uh oh!

Uh oh!

skl left a comment

Uh oh!

Uh oh!

Uh oh!

	sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0)
	0.95 * sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0)

bugfix: refactor alerts to accomodate for non-HA clusters #1010

bugfix: refactor alerts to accomodate for non-HA clusters #1010

Conversation

rexagod commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rexagod Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rexagod left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skl left a comment

Choose a reason for hiding this comment

Uh oh!

rexagod commented Jan 27, 2025

Uh oh!

Uh oh!

rexagod commented Feb 10, 2025

Uh oh!

rexagod commented Feb 10, 2025

Uh oh!

rexagod commented Jul 21, 2025

Uh oh!

Uh oh!

skl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

skl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rexagod commented Jan 6, 2025 •

edited

Loading

rexagod left a comment •

edited

Loading