-
Notifications
You must be signed in to change notification settings - Fork 275
Description
Describe the feature
If NTH cannot find the node in the cluster via an SQS message, delete the message. Should be configurable, maybe on/off, after x number of tries, etc.
Is the feature request related to a problem?
We use NTH in conjunction with Karpenter. Both of them can handle instance interruption events. Karpenter does not recommend using the 2 together but since it cannot handle rebalance recommendation events, we still want to use NTH to handle that event exclusively.
We have multiple clusters in an AWS account+region. These clusters uses multiple managed node groups (ASGs). But since we only want NTH to handle one event, we can only use Event Bridge to send the events to SQS. These events do not contain information about ASGs so we cannot filter them based on ASGs so we can send them to dedicated queues.
So the setup we have right now is all rebalancing recommendations are sent to all SQS queues (we have 1 queue for each cluster). If the node belongs to the cluster then it will be processed correctly. But for the other clusters, NTH will just report that it cannot find the node and give up. However it does not delete the message from the queue, so after 30 seconds when the message reappears it will repeat it until retention period expires (4 days by default).
We run a large cluster with thousands of spot instances. This will cause the NTH to do a lot of unnecessary work.
Describe alternatives you've considered
For now we can only use a short message retention time to avoid excessive retries.