Skip to content

An event store bug slows event handling over time #498

@mechanical-fish

Description

@mechanical-fish

The main() event loop runs once per second and repeatedly calls GetActiveEvent() to find nodes that need to be cordoned or drained. It marks each event InProgress before handling it. There is no code which un-sets InProgress.

GetActiveEvent() returns the first event it finds that isn't marked NodeProcessed. It does not check the InProgress flag.

A run of the event loop stops immediately if GetActiveEvent() returns an InProgress event.

There are evidently code paths which leave an event marked InProgress without marking it NodeProcessed. This happens slowly but routinely as NTH runs with a reasonably large EKS cluster.

I believe that result is that the event store gradually fills up with InProgress events which, when GetActiveEvent() returns one, immediately halt the event loop until its next iteration one second later. The fact that GetActiveEvent() chooses what to return by iterating over a Go map, which traverses the map in an undefined and semi-random order, helps disguise this problem -- if map iteration were deterministic the event loop would eventually halt forever. But instead, as more and more stale InProgress events accumulate, the odds of getting lucky and handling a new event before it becomes moot grow lower and lower.

I have a PR which has been helpful for us, which I'll link to this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions