🐛 fixing crashes on State-change events #313

universam1 · 2020-12-10T14:16:23Z

NTH crashes often from "EC2 Instance State-change Notification" events.

fixes #307

The underlying issue was that an empty parsed NodeName derived from PrivateDnsName for forwarded unverified, creating cascading problems.

This solves the root cause, that "EC2 Instance State-change Notification" can arrive at a time where the instance is in shutting-down, terminated or any other not-online situation where the PrivateDnsName metadata is empty!

Instead of just ignoring these errors this implementation gets and decides based on the state of the instance message if it is an error to fail or to ignore.
Therefor such messages are dropped in above situation because they are useless.

@haugenj

Background: went all the way back, instead of just removing the os.exit(1) found that actually the nodeName was empty but not verified.
The existing unit test was flawed in this regard.

The empty nodeName parsing derives from EC2.DescribeInstances that have the metadata of PrivateDnsName empty, simply because the node is not running any more.

So created a custom error that allows to handle that case.

example event:

{
    "AmiLaunchIndex": 1,
    "Architecture": "x86_64",

    "ImageId": "ami-0e44cbb6cd97xxxx",
    "InstanceId": "i-xxxxxxxx",

    "Platform": null,
    "PrivateDnsName": "",
    "PrivateIpAddress": null,
    "ProductCodes": null,
    "PublicDnsName": "",
    "PublicIpAddress": null,

    "State": {
        "Code": 48,
        "Name": "terminated"
    },
    "StateReason": {
        "Code": "Client.UserInitiatedShutdown",
        "Message": "Client.UserInitiatedShutdown: User initiated shutdown"
    },

}

dhohengassner · 2020-12-10T14:48:46Z

nice - think I have seen such crashes already

would be happy if this fix gets merged 👍

pkg/monitor/sqsevent/sqs-monitor_test.go

pkg/monitor/sqsevent/sqs-monitor.go

haugenj · 2020-12-10T15:29:26Z

@universam1 This is great! Left you two small nitpick comments but this is really awesome, thanks!

NTH crashes often from "EC2 Instance State-change Notification" events. The underlying issue was that an empty parsed NodeName derived from PrivateDnsName for forwarded unverified, creating cascading problems. This solves the root cause, that "EC2 Instance State-change Notification" can arrive at a time where the instance is in shutting-down, terminated or any other not-online situation where the PrivateDnsName metadata is empty! Instead of just ignoring these errors this implementation gets and decides based on the state of the instance message if it is an error to fail or to ignore. Therefor such messages are dropped in above situation because they are useless.

universam1 · 2020-12-10T16:10:44Z

cmd/node-termination-handler.go

 		if nthConfig.CordonOnly || drainEvent.IsRebalanceRecommendation() {
 			err := node.Cordon(nodeName)
 			if err != nil {
+				if errors.IsNotFound(err) {


@haugenj found another situation where events for other nodes that do not belong to this cluster could bubble up an error - I think for this case we can only do best effort to ignore such cases without removing all the safety guards

I think this is the right approach. Originally I was thinking we would totally remove the os.Exit(1) here, so your solution feels like a better middle-ground

haugenj

👍 LGTM. Thanks!

NTH crashes often from "EC2 Instance State-change Notification" events. The underlying issue was that an empty parsed NodeName derived from PrivateDnsName was forwarded unverified, creating cascading problems. This solves the root cause, that "EC2 Instance State-change Notification" can arrive at a time where the instance is in shutting-down, terminated or any other not-online situation where the PrivateDnsName metadata is empty! Instead of just ignoring these errors this implementation gets and decides based on the state of the instance message if it is an error to fail or to ignore. Therefor such messages are dropped in above situation because they are useless.

universam1 changed the title ~~🐛 fixing crashes on State-change events~~ 🐛 fixing crashes on State-change events Dec 10, 2020

universam1 mentioned this pull request Dec 10, 2020

Multiple events for the same node cause a crash #307

Closed

haugenj reviewed Dec 10, 2020

View reviewed changes

pkg/monitor/sqsevent/sqs-monitor_test.go Outdated Show resolved Hide resolved

pkg/monitor/sqsevent/sqs-monitor.go Outdated Show resolved Hide resolved

universam1 force-pushed the fixCrashOnStateEvents branch from 9e5d9e2 to 93163e0 Compare December 10, 2020 16:09

universam1 commented Dec 10, 2020

View reviewed changes

ignore events for nodes that are not belonging to the cluster

12bde5b

universam1 force-pushed the fixCrashOnStateEvents branch from 93163e0 to 12bde5b Compare December 10, 2020 16:15

haugenj approved these changes Dec 10, 2020

View reviewed changes

haugenj merged commit 9f87a47 into aws:main Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🐛 fixing crashes on State-change events #313

🐛 fixing crashes on State-change events #313

Uh oh!

universam1 commented Dec 10, 2020 •

edited

Loading

Uh oh!

dhohengassner commented Dec 10, 2020

Uh oh!

Uh oh!

Uh oh!

haugenj commented Dec 10, 2020

Uh oh!

universam1 Dec 10, 2020

Uh oh!

haugenj Dec 10, 2020

Uh oh!

haugenj left a comment

Uh oh!

Uh oh!

🐛 fixing crashes on State-change events #313

🐛 fixing crashes on State-change events #313

Uh oh!

Conversation

universam1 commented Dec 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dhohengassner commented Dec 10, 2020

Uh oh!

Uh oh!

Uh oh!

haugenj commented Dec 10, 2020

Uh oh!

universam1 Dec 10, 2020

Choose a reason for hiding this comment

Uh oh!

haugenj Dec 10, 2020

Choose a reason for hiding this comment

Uh oh!

haugenj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

universam1 commented Dec 10, 2020 •

edited

Loading