Skip to content

disable rule groups #5521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 8, 2023

Conversation

rajagopalanand
Copy link
Contributor

@rajagopalanand rajagopalanand commented Aug 21, 2023

What this PR does:

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@@ -187,6 +197,8 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
f.BoolVar(&cfg.EnableQueryStats, "ruler.query-stats-enabled", false, "Report the wall time for ruler queries to complete as a per user metric and as an info level log message.")
f.BoolVar(&cfg.DisableRuleGroupLabel, "ruler.disable-rule-group-label", false, "Disable the rule_group label on exported metrics")

f.Var(&cfg.DisabledRuleGroups, "ruler.disabled-rule-groups", "Comma separated list of keys of rule groups this ruler cannot evaluate. If specified, a ruler that would normally pick the specified rule groups for processing will ignore them instead.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the disabled rule groups list in runtime config? It is more flexible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@rajagopalanand rajagopalanand force-pushed the disable-rule-groups branch 4 times, most recently from 46e7185 to fdbb9cb Compare August 23, 2023 00:55
@pull-request-size pull-request-size bot added size/M and removed size/L labels Aug 23, 2023
@rajagopalanand rajagopalanand force-pushed the disable-rule-groups branch 6 times, most recently from b3f858b to 7075adc Compare August 29, 2023 23:07
@rajagopalanand rajagopalanand force-pushed the disable-rule-groups branch 2 times, most recently from 2c6a3e4 to be3d241 Compare August 30, 2023 19:57
@pull-request-size pull-request-size bot added size/L and removed size/M labels Aug 30, 2023
@rajagopalanand rajagopalanand force-pushed the disable-rule-groups branch 2 times, most recently from 91651bc to 6864bc9 Compare August 30, 2023 20:17
@rajagopalanand rajagopalanand marked this pull request as ready for review August 30, 2023 20:22
@rajagopalanand rajagopalanand force-pushed the disable-rule-groups branch 3 times, most recently from e4eba59 to 80add1d Compare August 30, 2023 21:44
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice pr. I think what I want to see is one more integration test for this feature

if !ruleGroupDisabled(group, disabledRuleGroupsForUser) {
filteredGroupsForUser = append(filteredGroupsForUser, group)
} else {
level.Info(r.logger).Log("msg", "rule group disabled", "rule group name", group.Name, "namespace", group.Namespace, "user", group.User)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we put this as debug level? Wondering if it is really necessary to make info.

Can simplify rule group name to just name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was thinking that we should do info though, we should have the log to see if we are indeed skip the rule evaluation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see the log with debug level as well as long as you change log level.
Can we tell some rule group is disabled by metrics? We dont have to use logs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am good. Let's keep this as info for now…

continue
switch e := err.(type) {
case *DisabledRuleGroupErr:
level.Info(log).Log("msg", e.Message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it too verbose to log it every time during filtering? And this doesn't feel like an error to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give visibility, I think it needs to be either Info or Warn

return disabledRuleGroupsForUser
}
}
return DisabledRuleGroups{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return the default limits here? I think someone may wanna set this on the config (instead runtimeconfig) as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on this one. Since user would be required to disable the rule groups to avoid accidentally disabling rule groups from other users that happen to have the same name and namespace, I thought it might be better to allow this config only from runtime config. If we want to allow config, then we need to have user field exposed here so it can be set. If we did that, then someone can configure user under runtime config which is already a map of [user:overrides] which can be confusing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! +1

Comment on lines +564 to +569
if len(disabledRuleGroupsForUser) == 0 {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do this check right in the beginning of the method and return the original allRuleGroups value if the len is 0, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to loop thru and check if there are overrides for each user, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh my bad... this is for all users! ok

if len(disabledRuleGroupsForUser) == 0 {
continue
}
filteredGroupsForUser := rulespb.RuleGroupList{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should we use make here to since we have an idea how many elements there will be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't know how many rule groups are disabled until the loop is complete

continue
switch e := err.(type) {
case *DisabledRuleGroupErr:
level.Info(log).Log("msg", e.Message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, all the instance will log this right even if that instance doesn't own the group? I think this will produce too much logs for large clusters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't seems to be an error to me +1

@@ -3093,6 +3093,9 @@ The `limits_config` configures default and per-tenant limits imposed by Cortex s
# alerts will fail with a log message and metric increment. 0 = no limit.
# CLI flag: -alertmanager.max-alerts-size-bytes
[alertmanager_max_alerts_size_bytes: <int> | default = 0]

# list of rule groups to disable
[disabled_rule_groups: <list of rule groups to disable> | default = ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should default be []?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document this format somehow?

We have others "complex configs" like https://cortexmetrics.io/docs/configuration/configuration-file/#memcached_config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will submit another PR for this. Existing doc generator does not handle slices of structs well

@rajagopalanand rajagopalanand force-pushed the disable-rule-groups branch 2 times, most recently from e327544 to 47cc7c4 Compare September 6, 2023 21:49
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work!

return rlrs.Instances[0].Addr == instanceAddr, nil
ownsRuleGroup := rlrs.Instances[0].Addr == instanceAddr
if ownsRuleGroup && ruleGroupDisabled(g, disabledRuleGroups) {
return false, &DisabledRuleGroupErr{Message: fmt.Sprintf("rule group %s, namespace %s, user %s is disabled", g.Name, g.Namespace, g.User)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Should this return an error? i see why u did that (u wanna catch on the caller layer) but is weird to return an error on something expected.

But im ok with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this is a little bit weird... But seems like a workaround here as Go doesn't have a good way to capture such things...

Signed-off-by: Anand Rajagopal <[email protected]>
@alanprot alanprot merged commit 39be93d into cortexproject:master Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants