-
Notifications
You must be signed in to change notification settings - Fork 274
fix: show actual event kinds in Queue mode #706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
789dde6
to
8c48270
Compare
I like the direction of these changes, as they give cluster operators a clearer picture of what NTH is actually doing. However, we will have to be clear about what changes will be made: as you pointed out, customers with monitoring systems configured to look for specific strings would have to update their monitoring after upgrading. It may be possible to make the logging changes backwards compatible by leaving the existing Potential description of the logging changes to be included in the release notes: This release modifies several Debug, Info, and Warn logs, as well as Kuberentes events emitted by NTH. These changes primarily affect NTH when running in Queue Processor mode. These changes improve your observability about what NTH is doing when responding to events via SQS. If your monitoring system is configured to look for any of the specific strings in the tables below, you may need to modify your configuration to use the updated strings instead. Changes to logs when starting up
Changes to logs when processing an event
Changes to logs when receiving an SQS message
Tables of changed valuesTable 1: Monitor types
Table 2: Event types
Table 3: Event reasons
|
Yes, like you said there is no way around a breaking change, so given we're doing it let's just group all breaking changes in one release bump. The release notes and tables you added look great to me, I would only make the table 2 and 3 titles explicitly clear so there is no confusion about what events we are talking about in each case. Something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing with my team, we’ve decided we need to prioritize backwards compatibility as much as possible. We propose the following changes:
- Keep any existing fields in debug, info, or warn logs the same
- Where the previous information was incorrect or incomplete, add a new field with an appropriate name to convey the new information.
This way, customers who currently rely on specific values in specific fields of the log messages will continue to receive those, but customers who want the new values can add them into their monitoring configurations. We don’t have any way of knowing which customers are using the existing fields for monitoring, and we also don’t have a reliable way to warn customers about the change before they upgrade, especially if they don’t religiously follow our release notes.
So even though your proposal creates more accurate logs, I’m asking you to change it in the following ways:
- Keep the
Kind
property as-is. That is, it will basically indicate the monitor type, not the event type. - Instead of a new property called
Monitor
, create a new property calledAWSEventDetails
(open to suggestions on the name, but the idea is that it has a more detailed, more accurate description of the actual AWS event). This stays the same for the IMDS-based monitors, but the SQS-based monitor will specify the actual AWS event type here. - dAssociated logging changes to reflect this (described in inline comments).
All that said, I think we should do logging right in v2 (currently in alpha) so we don’t have this backwards compatibility debt in the future. @cjerad is working on that and has been part of my discussions internally, so I’ll make sure he’s in the loop.
Thank you and your team for all your hard work on this so far! Sorry about the churn.
"github.com/aws/aws-node-termination-handler/pkg/node" | ||
) | ||
|
||
const ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that you centralized the event types here--keep this!
type InterruptionEvent struct { | ||
EventID string | ||
Kind string | ||
Monitor string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this AWSEventDetails
(open to suggestions on the name if you disagree). Because the Kind
field basically contained the monitor type in the past, let's keep that behavior even though the name is sort of wrong. This new field will contain the accurate description of the actual AWS event (e.g., AWSEventDetails = "SPOT_ITN"
and Kind = "SQS_TERMINATE"
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, if it's only meant to contain the AWS event type I'd call this field AWSEventType
or AWSEventKind
, but that's just a suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I like AWSEventType
.
|
||
// Kind denotes the kind of event that is processed | ||
// Kind denotes the kind of monitor | ||
func (m SQSMonitor) Kind() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm opening to renaming this method to MonitorKind()
so that it's more clear internally that we are referring to the monitor type, not the event type. Similar change on the IMDS-based monitors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, sounds good. Will do.
Str("monitor", event.Monitor). | ||
Str("node-name", event.NodeName). | ||
Str("instance-id", event.InstanceID). | ||
Str("provider-id", event.ProviderID). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On line 252 when we emit the Kubernetes event, we need to keep the "reason" argument the same as it was before (observability.GetReasonForKind(event.Kind)
).
However, let's add another argument at the end, after event.Description
that includes the AWSEventDetails
. I'd suggest we use the "screaming snake case" names without conversion, but I'm open to doing a conversion like GetReasonForKind()
does right now if you disagree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, but I have to respectfully say I'm against this one. In all other cases the old and wrong field, data, log line or whatever can be kept and coexist with a new and fixed one. In this particular case like you pointed out there's only one reason, so either we use the right or the wrong reason. Kubernetes event reasons are the cornerstone of event-based Kubernetes monitoring, and we're choosing to leave it wrong. I understand backwards compatibility but in this case here we're not talking about an enhancement or a new feature but a fix, and I think not fixing things that are known to be wrong is not the best way to go.
That said, I think the best option is to leave the recorder.Emit()
call in 252 as is and instead modify the Description
field generation in all monitors. I.e., we would change line 93 in spot-itn-monitor.go
to something like
Description: fmt.Sprintf("Spot ITN received. Instance will be interrupted at %s. AWS event type: %s \n", instanceAction.Time, monitor.SpotITNKind),
or maybe
Description: fmt.Sprintf("Spot ITN received (%s). Instance will be interrupted at %s \n", monitor.SpotITNKind, instanceAction.Time),
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you also intending to keep the return values of observability.GetReasonForKind()
as they were before, such that the only change to the argument values in recorder.Emit()
would be this updated Description
field? If so, I am in favor of that plan.
I like your first suggestion (AWS event type: ...
at the end of the Description
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we are to keep the old and wrong reason there is no other option than reverting the changes in GetReasonForKind()
so they return the old values.
log.Info(). | ||
Str("event-id", event.EventID). | ||
Str("kind", event.Kind). | ||
Str("monitor", event.Monitor). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call this aws-event-details
or perhaps event-type-details
, and have it reflect the underlying AWS event. kind
will continue to refer to the monitor type as it did in the past, to preserve backwards compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just call this aws-event-type
or aws-event-kind
. IMO adding details
here is confusing because the field only contains one single detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Let's go with aws-event-type
.
for _, fn := range monitoringFns { | ||
go func(monitor monitor.Monitor) { | ||
log.Info().Str("event_type", monitor.Kind()).Msg("Started monitoring for events") | ||
log.Info().Str("monitor_type", monitor.Kind()).Msg("Started monitoring for events") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the original field name event_type
to preserve backwards compatibility, even though that's technically not correct. I'm open to adding another field named monitor_type
as well, to make the meaning more clear, but we need to keep the existing field name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you want to keep the event_type
field but change its value? Old values here were consts like RebalanceRecommendationKind = "REBALANCE_RECOMMENDATION"
while new ones are consts like RebalanceRecommendationMonitorKind = "REBALANCE_RECOMMENDATION_MONITOR"
. With that we are keeping the field name but changing its value so we're effectively introducing breaking changes anyway when it comes to people parsing such fields (I guess they are also parsing the field value, not just the mere presence of the field with any value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I meant that we should keep the original value, too. So event_type
should be REBALANCE_RECOMMENDATION
, SCHEDULED_EVENT
, SPOT_ITN
, or SQS_TERMINATE
, rather than the newer *_MONITOR
values returned by MonitorKind()
.
I have started working on the requested changes but after a while I realized they ultimately mean undoing the entire PR. One change leads to another and in the end everything stays almost the same. If we're to keep backwards compatibility and keep all log messages and event reasons like they are today there's really not much to be done here. Centralizing the event kinds in the The only new thing is the addition of the
they could be reading the last part of the string with something like
expecting a time which would be used somewhere else. All in all, I think this PR should be closed. Even if we introduced the small changes above we wouldn't be fixing #704. As always, thoughts welcome. |
Hi trutx, thanks for your patience and working with us. snay2 and I think another path is to introduce the concept of a select-able log format version. "Version 1" would correspond to the existing log format and messages, and a new "version 2" would contain your changes. The log format version would be specified via Helm config, with a default of version 1. This would make the change in format "opt-in". I have implemented this feature, and can add it to this PR if you enable the Allow edits and access to secrets by maintainers checkbox. Let me know your thoughts. |
Hey, sorry about the silence. I can't find the checkbox you mention, maybe some repo settings need to be changed? In any case I'm switching jobs and will be leaving New Relic this Friday, but I am on holidays so it means I effectively left already, hence the silence. I will lose my perms on the newrelic-forks org and the repo this PR comes from, and I don't find comfortable allowing edits to a PR coming from a repo owned by a company I'm just about to leave. Feel free to use this PR and the changes in it at will. Probably the best move is that you guys open a PR in the upstream repo with all my changes and the feature you just implemented. Or else I can check if I can create a new PR from another fork, either mine or Spotify's. |
Sounds great, thanks for doing that! |
Issue #704
Description of changes:
When working in Queue processor mode all interruption events have kind
SQS_TERMINATE
. This is wrong because this value reflects the operation mode or the kind of monitor handling the event and not the actual event kind. BesidesSQS_TERMINATE
is a pretty inaccurate and confusing term. However, when working in IMDS mode each eventKind
field contains the real AWS event kind.In this PR I'm fixing this inconsistency by making every
InterruptionEvent.Kind
field show the actual event kind and not the monitor kind that handled it. I'm also adding a newMonitor
field that will hold the monitor kind for each event. An AWS rebalance recommendation event (or any other) is the same event regardless of where NTH gets it from (IMDS or SQS), so for the sake of consistency now all events in both operation modes will have the sameKind
value and event kinds in SQS mode won't be masked behind a vagueSQS_TERMINATE
term.Sample log lines (anonymized):
Before
After
This PR doesn't change the basic NTH behavior but it does change its output and this might break monitoring systems based on such output. I.e. systems watching
SQSTermination
Kubernetes events or parsing log lines in search for"Kind":"SQS_TERMINATE"
will have to be updated.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.