An event store bug slows event handling over time

The `main()` event loop runs once per second and repeatedly calls `GetActiveEvent()` to find nodes that need to be cordoned or drained. It marks each event InProgress before handling it. There is no code which un-sets InProgress.

GetActiveEvent() returns the first event it finds that isn't marked NodeProcessed. It does not check the InProgress flag.

A run of the event loop stops immediately if GetActiveEvent() returns an InProgress event. 

There are evidently code paths which leave an event marked InProgress without marking it NodeProcessed. This happens slowly but routinely as NTH runs with a reasonably large EKS cluster.

I believe that result is that the event store gradually fills up with InProgress events which, when GetActiveEvent() returns one, immediately halt the event loop until its next iteration one second later. The fact that GetActiveEvent() chooses what to return by iterating over a Go map, which traverses the map in an undefined and semi-random order, helps disguise this problem -- if map iteration were deterministic the event loop would eventually halt forever. But instead, as more and more stale InProgress events accumulate, the odds of getting lucky and handling a new event before it becomes moot grow lower and lower.

I have a PR which has been helpful for us, which I'll link to this issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

An event store bug slows event handling over time #498

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

An event store bug slows event handling over time #498

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions