-
Notifications
You must be signed in to change notification settings - Fork 274
Description
The main()
event loop runs once per second and repeatedly calls GetActiveEvent()
to find nodes that need to be cordoned or drained. It marks each event InProgress before handling it. There is no code which un-sets InProgress.
GetActiveEvent() returns the first event it finds that isn't marked NodeProcessed. It does not check the InProgress flag.
A run of the event loop stops immediately if GetActiveEvent() returns an InProgress event.
There are evidently code paths which leave an event marked InProgress without marking it NodeProcessed. This happens slowly but routinely as NTH runs with a reasonably large EKS cluster.
I believe that result is that the event store gradually fills up with InProgress events which, when GetActiveEvent() returns one, immediately halt the event loop until its next iteration one second later. The fact that GetActiveEvent() chooses what to return by iterating over a Go map, which traverses the map in an undefined and semi-random order, helps disguise this problem -- if map iteration were deterministic the event loop would eventually halt forever. But instead, as more and more stale InProgress events accumulate, the odds of getting lucky and handling a new event before it becomes moot grow lower and lower.
I have a PR which has been helpful for us, which I'll link to this issue.