-
Notifications
You must be signed in to change notification settings - Fork 275
🐛 fixing crashes on State-change events #313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
nice - think I have seen such crashes already would be happy if this fix gets merged 👍 |
@universam1 This is great! Left you two small nitpick comments but this is really awesome, thanks! |
NTH crashes often from "EC2 Instance State-change Notification" events. The underlying issue was that an empty parsed NodeName derived from PrivateDnsName for forwarded unverified, creating cascading problems. This solves the root cause, that "EC2 Instance State-change Notification" can arrive at a time where the instance is in shutting-down, terminated or any other not-online situation where the PrivateDnsName metadata is empty! Instead of just ignoring these errors this implementation gets and decides based on the state of the instance message if it is an error to fail or to ignore. Therefor such messages are dropped in above situation because they are useless.
9e5d9e2
to
93163e0
Compare
if nthConfig.CordonOnly || drainEvent.IsRebalanceRecommendation() { | ||
err := node.Cordon(nodeName) | ||
if err != nil { | ||
if errors.IsNotFound(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haugenj found another situation where events for other nodes that do not belong to this cluster could bubble up an error - I think for this case we can only do best effort to ignore such cases without removing all the safety guards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the right approach. Originally I was thinking we would totally remove the os.Exit(1)
here, so your solution feels like a better middle-ground
93163e0
to
12bde5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM. Thanks!
NTH crashes often from "EC2 Instance State-change Notification" events. The underlying issue was that an empty parsed NodeName derived from PrivateDnsName was forwarded unverified, creating cascading problems. This solves the root cause, that "EC2 Instance State-change Notification" can arrive at a time where the instance is in shutting-down, terminated or any other not-online situation where the PrivateDnsName metadata is empty! Instead of just ignoring these errors this implementation gets and decides based on the state of the instance message if it is an error to fail or to ignore. Therefor such messages are dropped in above situation because they are useless.
NTH crashes often from "EC2 Instance State-change Notification" events.
fixes #307
The underlying issue was that an empty parsed NodeName derived from PrivateDnsName for forwarded unverified, creating cascading problems.
This solves the root cause, that "EC2 Instance State-change Notification" can arrive at a time where the instance is in shutting-down, terminated or any other not-online situation where the PrivateDnsName metadata is empty!
Instead of just ignoring these errors this implementation gets and decides based on the state of the instance message if it is an error to fail or to ignore.
Therefor such messages are dropped in above situation because they are useless.
@haugenj
Background: went all the way back, instead of just removing the os.exit(1) found that actually the nodeName was empty but not verified.
The existing unit test was flawed in this regard.
The empty nodeName parsing derives from
EC2.DescribeInstances
that have the metadata ofPrivateDnsName
empty, simply because the node is not running any more.So created a custom error that allows to handle that case.
example event: