-
Notifications
You must be signed in to change notification settings - Fork 613
bugfix: refactor alerts to accomodate for non-HA clusters #1010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfix: refactor alerts to accomodate for non-HA clusters #1010
Conversation
5b96fb5
to
5cd53d6
Compare
alerts/resource_alerts.libsonnet
Outdated
and | ||
(sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) - max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0 | ||
sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) - | ||
sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0) | |
0.95 * sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0) |
Since a max(Q)
buffer is not applicable in SNO, how about a numeric buffer of 5% (or more?)? That should help alert before things go out of budget.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus, in that case we can suggest to the users to add worker nodes or increase resources on the given node.
LMK if I should append the above to the existing descriptions to make them more relevant in case of SNO.
984a0a3
to
3cd8f3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rexagod I added tests to your PR in eca85fe, hope you don't mind.
Also tested in a dev environment and compared the old rules with the new, I observed effectively the same behaviour - with the addition of support for single-node clusters.
I think the existing alert descriptions are self-explanatory, and imply action is required (reduce requests or increase capacity) regardless of number of nodes.
lgtm!
Ah, sorry I completely missed adding the tests in my last commit. Thanks a bunch for the contribution! ❤️ I'll ask Simon to take a look here since with this patch we are essentially setting the trend for adapting such alerts for SNO, instead of dropping them as we have done internally (till now). |
77dc076
to
fc9fe6a
Compare
@simonpasquier I've updated the PR to reflect only what upstream is concerned with, i.e., HA and non-HA, leaving logic for any OpenShift-specific variants outside of this patch. I believe following this trend upstream will be vendor-neutral and future-safe. |
I noticed the tests are failing, will fix. |
899e369
to
725fe6b
Compare
Hello again, @skl! 👋🏼 Apologies for the delay here, it seems this went under the radar after the discussions internally. In the latest revision for this patch, I've addressed the failing tests. PTAL, thank you! |
f899f9b
to
e892fd2
Compare
75fcc1f
to
4aea62b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks for all the work here!
For the sake of brevity, let: Q: kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"} (allocable), and, QQ: namespace_cpu:kube_pod_container_resource_requests:sum{} (requested), thus, both quota alerts relevant here exist in the form: sum(QQ) by (cluster) - (sum(Q) by (cluster) - max(Q) by (cluster)) > 0 and (sum(Q) by (cluster) - max(Q) by (cluster)) > 0, which, in case of a single-node cluster (sum(Q) by (cluster) = max(Q) by (cluster)), is reduced to, sum(QQ) by (cluster) > 0, i.e., the alert will fire if *any* request limits exist. To address this, drop the "max(Q) by (cluster)" buffer assumed in non-SNO clusters from SNO, reducing the expression to: sum(QQ) by (cluster) - sum(Q) by (cluster) > 0 (total requeted - total allocable > 0 to trigger alert), since there is only a single node, so a buffer of the same sort does not make sense. Signed-off-by: Pranshu Srivastava <[email protected]>
Use kube_node_role{role="control-plane"} (see [1] and [2]) to estimate if the cluster is HA or not. * [1]: https://github.com/search?q=repo%3Akubernetes%2Fkube-state-metrics%20kube_node_role&type=code * [2]: https://kubernetes.io/docs/reference/labels-annotations-taints/#node-role-kubernetes-io-control-plane Also drop any thresholds as they would lead to false positives.
Signed-off-by: Pranshu Srivastava <[email protected]>
4aea62b
to
4ae726c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, lgtm!
For the sake of brevity, let:
thus, both quota alert expressions relevant here (
KubeCPUOvercommit
andKubeMemoryOvercommit
) exist in the form:sum(QQ) by (cluster) - (sum(Q) by (cluster) - max(Q) by (cluster)) > 0 and (sum(Q) by (cluster) - max(Q) by (cluster)) > 0
, which, in case of a single-node cluster (sum(Q) by (cluster)
=max(Q) by (cluster)
), is reduced to,sum(QQ) by (cluster) > 0
, i.e., the alert will fire if any request limits exist.To address this, drop the
max(Q) by (cluster)
buffer assumed in non-SNO clusters from SNO, reducing the expression to:sum(QQ) by (cluster) - sum(Q) by (cluster) > 0
(total requeted - total allocable > 0 to trigger alert), since there is only a single node, so a buffer of the same sort does not make sense.