fix: show actual event kinds in Queue mode #706

trutx · 2022-10-18T09:28:42Z

Issue #704

Description of changes:

When working in Queue processor mode all interruption events have kind SQS_TERMINATE. This is wrong because this value reflects the operation mode or the kind of monitor handling the event and not the actual event kind. Besides SQS_TERMINATE is a pretty inaccurate and confusing term. However, when working in IMDS mode each event Kind field contains the real AWS event kind.

In this PR I'm fixing this inconsistency by making every InterruptionEvent.Kind field show the actual event kind and not the monitor kind that handled it. I'm also adding a new Monitor field that will hold the monitor kind for each event. An AWS rebalance recommendation event (or any other) is the same event regardless of where NTH gets it from (IMDS or SQS), so for the sake of consistency now all events in both operation modes will have the same Kind value and event kinds in SQS mode won't be masked behind a vague SQS_TERMINATE term.

Sample log lines (anonymized):

Before

2022/10/14 08:14:27 INF Adding new event to the event store event={"AutoScalingGroupName":"my-asg","Description":"Rebalance recommendation event received. Instance i-aaaaaaaaaaaaaaaaaa will be cordoned at 2022-10-14 08:14:26 +0000 UTC \n","EndTime":"0001-01-01T00:00:00Z","EventID":"rebalance-recommendation-event-65646564353331632d306265352d396161352d623aaaaaaaa366664313162636632613233","InProgress":false,"InstanceID":"i-aaaaaaaaaaaaaaaaa","IsManaged":true,"Kind":"SQS_TERMINATE","NodeLabels":null,"NodeName":"ip-1-2-3-4.us-east-2.compute.internal","NodeProcessed":false,"Pods":null,"ProviderID":"aws:///us-east-2c/i-aaaaaaaaaaaaaaaaa","StartTime":"2022-10-14T08:14:26Z","State":""}

After

2022/10/18 04:07:57 INF Adding new event to the event store event={"AutoScalingGroupName":"my-asg","Description":"Rebalance recommendation event received. Instance i-xxxxxxxxxxxxxxxxx will be cordoned at 2022-10-18 04:07:56 +0000 UTC \n","EndTime":"0001-01-01T00:00:00Z","EventID":"rebalance-recommendation-event-62386430343663302d323831352d633731312d646232312d373938316562623935633466","InProgress":false,"InstanceID":"i-xxxxxxxxxxxxxxxxx","IsManaged":true,"Kind":"REBALANCE_RECOMMENDATION","Monitor":"SQS_MONITOR","NodeLabels":null,"NodeName":"ip-1-2-3-4.us-east-2.compute.internal","NodeProcessed":false,"Pods":null,"ProviderID":"aws:///us-east-2c/i-xxxxxxxxxxxxxxxxx","StartTime":"2022-10-18T04:07:56Z","State":""}

This PR doesn't change the basic NTH behavior but it does change its output and this might break monitoring systems based on such output. I.e. systems watching SQSTermination Kubernetes events or parsing log lines in search for "Kind":"SQS_TERMINATE" will have to be updated.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

snay2 · 2022-11-01T22:05:07Z

I like the direction of these changes, as they give cluster operators a clearer picture of what NTH is actually doing. However, we will have to be clear about what changes will be made: as you pointed out, customers with monitoring systems configured to look for specific strings would have to update their monitoring after upgrading.

It may be possible to make the logging changes backwards compatible by leaving the existing event_type and kind fields intact and adding new fields for the more specific event types and monitor types. However, the k8s events only have a single "reason" field, so it's not possible to make it entirely backwards compatible, even if we include both the old and the new strings in that field. For these reasons, I'm leaning toward making this change but calling out the differences explicitly as part of a minor version bump.

Potential description of the logging changes to be included in the release notes:

This release modifies several Debug, Info, and Warn logs, as well as Kuberentes events emitted by NTH. These changes primarily affect NTH when running in Queue Processor mode. These changes improve your observability about what NTH is doing when responding to events via SQS. If your monitoring system is configured to look for any of the specific strings in the tables below, you may need to modify your configuration to use the updated strings instead.

Changes to logs when starting up

Remove event_type field from the Info log when starting a monitor; replace with monitor_type field, with new values. See Table 1.
Remove event_type field from the Warn log when a monitor fails to start; replace with monitor_type field, with new values. See Table 1.

Changes to logs when processing an event

New monitor field in the Info log. See Table 1.
Potentially change value of kind field in the Info log, if running Queue Processor mode. See Table 2.
Potentially change the "reason" field in the k8s event if running Queue Processor mode. See Table 3.

Changes to logs when receiving an SQS message

Include the specific event type instead of SQS_TERMINATE in the Debug log if running Queue Processor mode. See Table 2.

Tables of changed values

Table 1: Monitor types

Previous	New
`REBALANCE_RECOMMENDATION`	`REBALANCE_RECOMMENDATION_MONITOR`
`SCHEDULED_EVENT`	`SCHEDULED_EVENT_MONITOR`
`SPOT_ITN`	`SPOT_ITN_MONITOR`
`SQS_TERMINATE`	`SQS_MONITOR`

Table 2: Event types

Previous	New
`REBALANCE_RECOMMENDATION`	`REBALANCE_RECOMMENDATION`
`SCHEDULED_EVENT`	`SCHEDULED_EVENT`
`SPOT_ITN`	`SPOT_ITN`
`SQS_TERMINATE`	`REBALANCE_RECOMMENDATION` `SCHEDULED_EVENT` `SPOT_ITN` `STATE_CHANGE` `ASG_LIFECYCLE`

Table 3: Event reasons

Previous reason	New reason
`RebalanceRecommendation`	`RebalanceRecommendation`
`ScheduledEvent`	`ScheduledEvent`
`SpotInterruption`	`SpotInterruption`
`SQSTermination`	`RebalanceRecommendation` `ScheduledEvent` `SpotInterruption` `StateChange` `ASGLifecycle`

trutx · 2022-11-02T09:49:30Z

Yes, like you said there is no way around a breaking change, so given we're doing it let's just group all breaking changes in one release bump.

The release notes and tables you added look great to me, I would only make the table 2 and 3 titles explicitly clear so there is no confusion about what events we are talking about in each case. Something like Table 2: AWS event types and Table 3: Kubernetes event reasons. For the same reason in the text above I would add an AWS in one sentence, replacing when responding to events via SQS with when responding to AWS events via SQS.

snay2

After discussing with my team, we’ve decided we need to prioritize backwards compatibility as much as possible. We propose the following changes:

Keep any existing fields in debug, info, or warn logs the same
Where the previous information was incorrect or incomplete, add a new field with an appropriate name to convey the new information.

This way, customers who currently rely on specific values in specific fields of the log messages will continue to receive those, but customers who want the new values can add them into their monitoring configurations. We don’t have any way of knowing which customers are using the existing fields for monitoring, and we also don’t have a reliable way to warn customers about the change before they upgrade, especially if they don’t religiously follow our release notes.

So even though your proposal creates more accurate logs, I’m asking you to change it in the following ways:

Keep the Kind property as-is. That is, it will basically indicate the monitor type, not the event type.
Instead of a new property called Monitor, create a new property called AWSEventDetails (open to suggestions on the name, but the idea is that it has a more detailed, more accurate description of the actual AWS event). This stays the same for the IMDS-based monitors, but the SQS-based monitor will specify the actual AWS event type here.
dAssociated logging changes to reflect this (described in inline comments).

All that said, I think we should do logging right in v2 (currently in alpha) so we don’t have this backwards compatibility debt in the future. @cjerad is working on that and has been part of my discussions internally, so I’ll make sure he’s in the loop.

Thank you and your team for all your hard work on this so far! Sorry about the churn.

snay2 · 2022-11-04T21:14:33Z

pkg/monitor/types.go

 	"github.com/aws/aws-node-termination-handler/pkg/node"
 )

+const (


I like that you centralized the event types here--keep this!

snay2 · 2022-11-04T21:21:32Z

pkg/monitor/types.go

 type InterruptionEvent struct {
 	EventID              string
 	Kind                 string
+	Monitor              string


Let's rename this AWSEventDetails (open to suggestions on the name if you disagree). Because the Kind field basically contained the monitor type in the past, let's keep that behavior even though the name is sort of wrong. This new field will contain the accurate description of the actual AWS event (e.g., AWSEventDetails = "SPOT_ITN" and Kind = "SQS_TERMINATE").

Same as above, if it's only meant to contain the AWS event type I'd call this field AWSEventType or AWSEventKind, but that's just a suggestion.

Good idea. I like AWSEventType.

snay2 · 2022-11-04T21:23:03Z

pkg/monitor/sqsevent/sqs-monitor.go


-// Kind denotes the kind of event that is processed
+// Kind denotes the kind of monitor
 func (m SQSMonitor) Kind() string {


I'm opening to renaming this method to MonitorKind() so that it's more clear internally that we are referring to the monitor type, not the event type. Similar change on the IMDS-based monitors.

Yep, sounds good. Will do.

snay2 · 2022-11-04T21:24:10Z

cmd/node-termination-handler.go

+						Str("monitor", event.Monitor).
 						Str("node-name", event.NodeName).
 						Str("instance-id", event.InstanceID).
 						Str("provider-id", event.ProviderID).


On line 252 when we emit the Kubernetes event, we need to keep the "reason" argument the same as it was before (observability.GetReasonForKind(event.Kind)).

However, let's add another argument at the end, after event.Description that includes the AWSEventDetails. I'd suggest we use the "screaming snake case" names without conversion, but I'm open to doing a conversion like GetReasonForKind() does right now if you disagree.

Will do, but I have to respectfully say I'm against this one. In all other cases the old and wrong field, data, log line or whatever can be kept and coexist with a new and fixed one. In this particular case like you pointed out there's only one reason, so either we use the right or the wrong reason. Kubernetes event reasons are the cornerstone of event-based Kubernetes monitoring, and we're choosing to leave it wrong. I understand backwards compatibility but in this case here we're not talking about an enhancement or a new feature but a fix, and I think not fixing things that are known to be wrong is not the best way to go.

That said, I think the best option is to leave the recorder.Emit() call in 252 as is and instead modify the Description field generation in all monitors. I.e., we would change line 93 in spot-itn-monitor.go to something like

Description: fmt.Sprintf("Spot ITN received. Instance will be interrupted at %s. AWS event type: %s \n", instanceAction.Time, monitor.SpotITNKind),

or maybe

Description: fmt.Sprintf("Spot ITN received (%s). Instance will be interrupted at %s \n", monitor.SpotITNKind, instanceAction.Time),

WDYT?

Are you also intending to keep the return values of observability.GetReasonForKind() as they were before, such that the only change to the argument values in recorder.Emit() would be this updated Description field? If so, I am in favor of that plan.

I like your first suggestion (AWS event type: ... at the end of the Description).

Yes, if we are to keep the old and wrong reason there is no other option than reverting the changes in GetReasonForKind() so they return the old values.

snay2 · 2022-11-04T21:25:27Z

cmd/node-termination-handler.go

 					log.Info().
 						Str("event-id", event.EventID).
 						Str("kind", event.Kind).
+						Str("monitor", event.Monitor).


Let's call this aws-event-details or perhaps event-type-details, and have it reflect the underlying AWS event. kind will continue to refer to the monitor type as it did in the past, to preserve backwards compatibility.

I'd just call this aws-event-type or aws-event-kind. IMO adding details here is confusing because the field only contains one single detail.

Agreed. Let's go with aws-event-type.

snay2 · 2022-11-04T21:27:16Z

cmd/node-termination-handler.go

 	for _, fn := range monitoringFns {
 		go func(monitor monitor.Monitor) {
-			log.Info().Str("event_type", monitor.Kind()).Msg("Started monitoring for events")
+			log.Info().Str("monitor_type", monitor.Kind()).Msg("Started monitoring for events")


Let's use the original field name event_type to preserve backwards compatibility, even though that's technically not correct. I'm open to adding another field named monitor_type as well, to make the meaning more clear, but we need to keep the existing field name.

So you want to keep the event_type field but change its value? Old values here were consts like RebalanceRecommendationKind = "REBALANCE_RECOMMENDATION" while new ones are consts like RebalanceRecommendationMonitorKind = "REBALANCE_RECOMMENDATION_MONITOR". With that we are keeping the field name but changing its value so we're effectively introducing breaking changes anyway when it comes to people parsing such fields (I guess they are also parsing the field value, not just the mere presence of the field with any value).

You're right. I meant that we should keep the original value, too. So event_type should be REBALANCE_RECOMMENDATION, SCHEDULED_EVENT, SPOT_ITN, or SQS_TERMINATE, rather than the newer *_MONITOR values returned by MonitorKind().

trutx · 2022-11-09T22:55:10Z

I have started working on the requested changes but after a while I realized they ultimately mean undoing the entire PR. One change leads to another and in the end everything stays almost the same. If we're to keep backwards compatibility and keep all log messages and event reasons like they are today there's really not much to be done here. Centralizing the event kinds in the monitor package makes no sense now and renaming the Kind() method to MonitorKind() doesn't solve much because it is both used to refer to monitor and event kinds and it also returns a const value that applies both to monitors and events, so by fixing the former we're breaking the latter.

The only new thing is the addition of the AWSEventType field and appending it to the event.Description, and even this could be a breaking change if people's monitoring/log parsing is expecting the description as it is today. For example, in this spot ITN description

Description:  fmt.Sprintf("Spot ITN received. Instance will be interrupted at %s \n", instanceAction.Time),

they could be reading the last part of the string with something like

descriptionParts := strings.Split(event.Description, " ")
interruptionTime := descriptionParts[len(descriptionParts)-1]

expecting a time which would be used somewhere else.

All in all, I think this PR should be closed. Even if we introduced the small changes above we wouldn't be fixing #704.

As always, thoughts welcome.

cjerad · 2022-11-14T17:59:28Z

Hi trutx, thanks for your patience and working with us.

snay2 and I think another path is to introduce the concept of a select-able log format version. "Version 1" would correspond to the existing log format and messages, and a new "version 2" would contain your changes. The log format version would be specified via Helm config, with a default of version 1. This would make the change in format "opt-in".

I have implemented this feature, and can add it to this PR if you enable the Allow edits and access to secrets by maintainers checkbox.

Let me know your thoughts.

trutx · 2022-11-16T21:58:09Z

Hey, sorry about the silence. I can't find the checkbox you mention, maybe some repo settings need to be changed?

In any case I'm switching jobs and will be leaving New Relic this Friday, but I am on holidays so it means I effectively left already, hence the silence. I will lose my perms on the newrelic-forks org and the repo this PR comes from, and I don't find comfortable allowing edits to a PR coming from a repo owned by a company I'm just about to leave.

Feel free to use this PR and the changes in it at will. Probably the best move is that you guys open a PR in the upstream repo with all my changes and the feature you just implemented. Or else I can check if I can create a new PR from another fork, either mine or Spotify's.

cjerad · 2022-11-17T19:46:01Z

@trutx Thanks for the update. I've opened a new PR (#725) that incorporates both changes (and includes you as author). So I'll close out this PR.

trutx · 2022-11-17T22:20:19Z

Sounds great, thanks for doing that!

trutx requested a review from a team as a code owner October 18, 2022 09:28

fix: show actual event kinds in Queue mode

8c48270

trutx force-pushed the rtorrents/fix_sqs_kind branch from 789dde6 to 8c48270 Compare October 20, 2022 20:50

trutx mentioned this pull request Oct 21, 2022

chore: add annotations to events in SQS mode #715

Merged

Roger Torrentsgeneros added 6 commits November 2, 2022 10:14

feat: emit pod events on drain

1d949ef

fix: satisfy PR requested change

d1fe29d

feat: emit pod events on drain

2aaf71b

fix: satisfy PR requested change

de453e1

Merge remote-tracking branch 'origin/main' into rtorrents/fix_sqs_kind

38a9d7a

fix: typo

7e74679

snay2 suggested changes Nov 4, 2022

View reviewed changes

cjerad closed this Nov 17, 2022

fix: show actual event kinds in Queue mode #706

fix: show actual event kinds in Queue mode #706

Uh oh!

Conversation

trutx commented Oct 18, 2022

Uh oh!

snay2 commented Nov 1, 2022

Changes to logs when starting up

Changes to logs when processing an event

Changes to logs when receiving an SQS message

Tables of changed values

Table 1: Monitor types

Table 2: Event types

Table 3: Event reasons

Uh oh!

trutx commented Nov 2, 2022

Uh oh!

snay2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trutx Nov 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trutx commented Nov 9, 2022

Uh oh!

cjerad commented Nov 14, 2022

Uh oh!

trutx commented Nov 16, 2022

Uh oh!

cjerad commented Nov 17, 2022

Uh oh!

trutx commented Nov 17, 2022

Uh oh!

Uh oh!

trutx Nov 8, 2022 •

edited

Loading