Skip to content

Set scheduler log sizes automatically based on available memory #5570

@gjoseph92

Description

@gjoseph92

There are frequent reports of scheduler memory growing over time:

They often involve memory graphs that look like:
image

It's very likely that there is a real bug in the scheduler causing memory to accumulate (#3898 (comment)), but often the steep slope on these graphs is caused by various logs on the scheduler accumulating, such as:

  • transition_log - distributed.scheduler.transition-log-length
  • log - distributed.scheduler.transition-log-length (should maybe be distributed.admin.log-length?)
  • events - distributed.scheduler.events-log-length
  • computations - distributed.diagnostics.computations.max-history
  • Node._deque_handler - distributed.admin.log-length

I propose two things:

  1. Log lengths should be set as a percentage of available memory, not as a length—this is much easier for users to configure
    Note that for some/most of these, that may be difficult to do accurately, since the size of the entries is unknown. A rough estimate is probably okay.
  2. A memory-cleanup callback that runs, say, once a second, and clears our excess logs if the scheduler is under memory pressure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    diagnosticsdocumentationImprove or add to documentationstabilityIssue or feature related to cluster stability (e.g. deadlock)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions