Skip to content

Conversation

universam1
Copy link
Contributor

@universam1 universam1 commented Dec 10, 2020

NTH crashes often from "EC2 Instance State-change Notification" events.

fixes #307

The underlying issue was that an empty parsed NodeName derived from PrivateDnsName for forwarded unverified, creating cascading problems.

This solves the root cause, that "EC2 Instance State-change Notification" can arrive at a time where the instance is in shutting-down, terminated or any other not-online situation where the PrivateDnsName metadata is empty!

Instead of just ignoring these errors this implementation gets and decides based on the state of the instance message if it is an error to fail or to ignore.
Therefor such messages are dropped in above situation because they are useless.

@haugenj

Background: went all the way back, instead of just removing the os.exit(1) found that actually the nodeName was empty but not verified.
The existing unit test was flawed in this regard.

The empty nodeName parsing derives from EC2.DescribeInstances that have the metadata of PrivateDnsName empty, simply because the node is not running any more.

So created a custom error that allows to handle that case.

example event:

{
    "AmiLaunchIndex": 1,
    "Architecture": "x86_64",

    "ImageId": "ami-0e44cbb6cd97xxxx",
    "InstanceId": "i-xxxxxxxx",

    "Platform": null,
    "PrivateDnsName": "",
    "PrivateIpAddress": null,
    "ProductCodes": null,
    "PublicDnsName": "",
    "PublicIpAddress": null,

    "State": {
        "Code": 48,
        "Name": "terminated"
    },
    "StateReason": {
        "Code": "Client.UserInitiatedShutdown",
        "Message": "Client.UserInitiatedShutdown: User initiated shutdown"
    },

}

@universam1 universam1 changed the title 🐛 fixing crashes on State-change events 🐛 fixing crashes on State-change events Dec 10, 2020
@dhohengassner
Copy link

nice - think I have seen such crashes already

would be happy if this fix gets merged 👍

@haugenj
Copy link
Contributor

haugenj commented Dec 10, 2020

@universam1 This is great! Left you two small nitpick comments but this is really awesome, thanks!

NTH crashes often from "EC2 Instance State-change Notification" events.

The underlying issue was that an empty parsed NodeName derived from PrivateDnsName for forwarded unverified, creating cascading problems.

This solves the root cause, that "EC2 Instance State-change Notification" can arrive at a time where the instance is in shutting-down, terminated or any other not-online situation where the PrivateDnsName metadata is empty!

Instead of just ignoring these errors this implementation gets and decides based on the state of the instance message if it is an error to fail or to ignore.
Therefor such messages are dropped in above situation because they are useless.
@universam1 universam1 force-pushed the fixCrashOnStateEvents branch from 9e5d9e2 to 93163e0 Compare December 10, 2020 16:09
if nthConfig.CordonOnly || drainEvent.IsRebalanceRecommendation() {
err := node.Cordon(nodeName)
if err != nil {
if errors.IsNotFound(err) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haugenj found another situation where events for other nodes that do not belong to this cluster could bubble up an error - I think for this case we can only do best effort to ignore such cases without removing all the safety guards

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right approach. Originally I was thinking we would totally remove the os.Exit(1) here, so your solution feels like a better middle-ground

@universam1 universam1 force-pushed the fixCrashOnStateEvents branch from 93163e0 to 12bde5b Compare December 10, 2020 16:15
Copy link
Contributor

@haugenj haugenj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM. Thanks!

@haugenj haugenj merged commit 9f87a47 into aws:main Dec 10, 2020
haugenj pushed a commit to haugenj/aws-node-termination-handler that referenced this pull request Dec 14, 2020
NTH crashes often from "EC2 Instance State-change Notification" events.

The underlying issue was that an empty parsed NodeName derived from PrivateDnsName was forwarded unverified, creating cascading problems.

This solves the root cause, that "EC2 Instance State-change Notification" can arrive at a time where the instance is in shutting-down, terminated or any other not-online situation where the PrivateDnsName metadata is empty!

Instead of just ignoring these errors this implementation gets and decides based on the state of the instance message if it is an error to fail or to ignore.
Therefor such messages are dropped in above situation because they are useless.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple events for the same node cause a crash
3 participants