Failing to drain/cordon causes CrashLoopBackOff

[Queue processor specific]

Running into issue here around the [behavior of exiting with an error code of `1`](https://github.com/aws/aws-node-termination-handler/blob/5150b31c819a26c2cc80f812ffb0a0b317290b56/cmd/node-termination-handler.go#L260-L280) when a message is pulled from SQS and then cordoning or draining of that node fails.

In our use-case, we have multiple EKS clusters in an account in a region. When approaching how to handle this with NTH, I was going to run per-cluster SQS queues with CloudWatch events + rules that were tailored to the proper cluster/ASG combo, and this works for ASG events, but EC2 termination events do not have the same filtering capacity -- meaning that you end up with events in SQS queues even though those instances may not be in the cluster. When NTH processes one of those, it will `exit(1)` because there will be no node in the cluster to cordon or drain. After so many, Kubernetes will mark the pod in a `CrashLoopBackOff` state and you will lose all NTH capabilities until the cool-down period expires and Kubernetes starts the pod again.

For an NTH pod that I've been running for a week, it's seen 600+ crashes:
```
aws-node-termination-handler-86d9c789f4-6c9pz   0/1     CrashLoopBackOff     607        6d2h
```

```
2020/10/29 17:29:57 ??? Trying to get token from IMDSv2
2020/10/29 17:29:57 ??? Got token from IMDSv2
2020/10/29 17:29:57 ??? Startup Metadata Retrieved metadata={"accountId":"0xxxxxxxxx","availabilityZone":"us-east-1b","instanceId":"i-0xxxxxxxx","instanceType":"c5.2xlarge","localHostname":"ip-10-xxx-xxx-xxx.ec2.internal","privateIp":"10.xxx.xxx.xxx","publicHostname":"","publicIp":"","region":"us-east-1"}
2020/10/29 17:29:57 ??? aws-node-termination-handler arguments: 
	dry-run: false,
	node-name: ip-10-xxx-xxx-xxx.ec2.internal,
	metadata-url: http://169.254.169.254,
	kubernetes-service-host: 172.20.0.1,
	kubernetes-service-port: 443,
	delete-local-data: true,
	ignore-daemon-sets: true,
	pod-termination-grace-period: -1,
	node-termination-grace-period: 120,
	enable-scheduled-event-draining: false,
	enable-spot-interruption-draining: false,
	enable-sqs-termination-draining: true,
	metadata-tries: 3,
	cordon-only: false,
	taint-node: false,
	json-logging: false,
	log-level: INFO,
	webhook-proxy: ,
	webhook-headers: <not-displayed>,
	webhook-url: ,
	webhook-template: <not-displayed>,
	uptime-from-file: ,
	enable-prometheus-server: false,
	prometheus-server-port: 9092,
	aws-region: us-east-1,
	queue-url: https://sqs.us-east-1.amazonaws.com/0xxxxxxxxxx/node_termination_handler,
	check-asg-tag-before-draining: true,
	aws-endpoint: ,

2020/10/29 17:29:57 ??? Started watching for interruption events
2020/10/29 17:29:57 ??? Kubernetes AWS Node Termination Handler has started successfully!
2020/10/29 17:29:57 ??? Started watching for event cancellations
2020/10/29 17:29:57 ??? Started monitoring for events event_type=SQS_TERMINATE
2020/10/29 17:30:00 ??? Adding new event to the event store event={"Description":"Spot Interruption event received. Instance will be interrupted at 2020-10-29 17:28:15 +0000 UTC \n","Drained":false,"EndTime":"0001-01-01T00:00:00Z","EventID":"spot-itn-event-12345","InstanceID":"i-0xxxxxxxxxx","Kind":"SQS_TERMINATE","NodeName":"ip-10-xxx-xxx-xxx.ec2.internal","StartTime":"2020-10-29T17:28:15Z","State":""}
2020/10/29 17:30:01 ??? Cordoning the node
2020/10/29 17:30:01 WRN Error when trying to list Nodes w/ label, falling back to direct Get lookup of node
2020/10/29 17:30:01 ??? There was a problem while trying to cordon and drain the node error="nodes \"ip-10-xxx-xxx-xxx.ec2.internal\" not found"
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failing to drain/cordon causes CrashLoopBackOff #272

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failing to drain/cordon causes CrashLoopBackOff #272

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions