-
Notifications
You must be signed in to change notification settings - Fork 274
Description
[Queue processor specific]
Running into issue here around the behavior of exiting with an error code of 1
when a message is pulled from SQS and then cordoning or draining of that node fails.
In our use-case, we have multiple EKS clusters in an account in a region. When approaching how to handle this with NTH, I was going to run per-cluster SQS queues with CloudWatch events + rules that were tailored to the proper cluster/ASG combo, and this works for ASG events, but EC2 termination events do not have the same filtering capacity -- meaning that you end up with events in SQS queues even though those instances may not be in the cluster. When NTH processes one of those, it will exit(1)
because there will be no node in the cluster to cordon or drain. After so many, Kubernetes will mark the pod in a CrashLoopBackOff
state and you will lose all NTH capabilities until the cool-down period expires and Kubernetes starts the pod again.
For an NTH pod that I've been running for a week, it's seen 600+ crashes:
aws-node-termination-handler-86d9c789f4-6c9pz 0/1 CrashLoopBackOff 607 6d2h
2020/10/29 17:29:57 ??? Trying to get token from IMDSv2
2020/10/29 17:29:57 ??? Got token from IMDSv2
2020/10/29 17:29:57 ??? Startup Metadata Retrieved metadata={"accountId":"0xxxxxxxxx","availabilityZone":"us-east-1b","instanceId":"i-0xxxxxxxx","instanceType":"c5.2xlarge","localHostname":"ip-10-xxx-xxx-xxx.ec2.internal","privateIp":"10.xxx.xxx.xxx","publicHostname":"","publicIp":"","region":"us-east-1"}
2020/10/29 17:29:57 ??? aws-node-termination-handler arguments:
dry-run: false,
node-name: ip-10-xxx-xxx-xxx.ec2.internal,
metadata-url: http://169.254.169.254,
kubernetes-service-host: 172.20.0.1,
kubernetes-service-port: 443,
delete-local-data: true,
ignore-daemon-sets: true,
pod-termination-grace-period: -1,
node-termination-grace-period: 120,
enable-scheduled-event-draining: false,
enable-spot-interruption-draining: false,
enable-sqs-termination-draining: true,
metadata-tries: 3,
cordon-only: false,
taint-node: false,
json-logging: false,
log-level: INFO,
webhook-proxy: ,
webhook-headers: <not-displayed>,
webhook-url: ,
webhook-template: <not-displayed>,
uptime-from-file: ,
enable-prometheus-server: false,
prometheus-server-port: 9092,
aws-region: us-east-1,
queue-url: https://sqs.us-east-1.amazonaws.com/0xxxxxxxxxx/node_termination_handler,
check-asg-tag-before-draining: true,
aws-endpoint: ,
2020/10/29 17:29:57 ??? Started watching for interruption events
2020/10/29 17:29:57 ??? Kubernetes AWS Node Termination Handler has started successfully!
2020/10/29 17:29:57 ??? Started watching for event cancellations
2020/10/29 17:29:57 ??? Started monitoring for events event_type=SQS_TERMINATE
2020/10/29 17:30:00 ??? Adding new event to the event store event={"Description":"Spot Interruption event received. Instance will be interrupted at 2020-10-29 17:28:15 +0000 UTC \n","Drained":false,"EndTime":"0001-01-01T00:00:00Z","EventID":"spot-itn-event-12345","InstanceID":"i-0xxxxxxxxxx","Kind":"SQS_TERMINATE","NodeName":"ip-10-xxx-xxx-xxx.ec2.internal","StartTime":"2020-10-29T17:28:15Z","State":""}
2020/10/29 17:30:01 ??? Cordoning the node
2020/10/29 17:30:01 WRN Error when trying to list Nodes w/ label, falling back to direct Get lookup of node
2020/10/29 17:30:01 ??? There was a problem while trying to cordon and drain the node error="nodes \"ip-10-xxx-xxx-xxx.ec2.internal\" not found"