-
-
Notifications
You must be signed in to change notification settings - Fork 741
Open
Labels
diagnosticsdocumentationImprove or add to documentationImprove or add to documentationstabilityIssue or feature related to cluster stability (e.g. deadlock)Issue or feature related to cluster stability (e.g. deadlock)
Description
There are frequent reports of scheduler memory growing over time:
- Scheduler memory just keep increasing in idle mode #5509
- Are reference cycles a performance problem? #4987 (comment)
- Scheduler memory leak / large worker footprint on simple workload #3898 (there is a different problem here; memory is leaking even with logs turned off, but turning off logs was necessary to debug)
- What's the best way of diagnosing scheduler memory issues? #4998
They often involve memory graphs that look like:
It's very likely that there is a real bug in the scheduler causing memory to accumulate (#3898 (comment)), but often the steep slope on these graphs is caused by various logs on the scheduler accumulating, such as:
transition_log
-distributed.scheduler.transition-log-length
log
-distributed.scheduler.transition-log-length
(should maybe bedistributed.admin.log-length
?)events
-distributed.scheduler.events-log-length
computations
-distributed.diagnostics.computations.max-history
Node._deque_handler
-distributed.admin.log-length
I propose two things:
- Log lengths should be set as a percentage of available memory, not as a length—this is much easier for users to configure
Note that for some/most of these, that may be difficult to do accurately, since the size of the entries is unknown. A rough estimate is probably okay. - A memory-cleanup callback that runs, say, once a second, and clears our excess logs if the scheduler is under memory pressure.
Metadata
Metadata
Assignees
Labels
diagnosticsdocumentationImprove or add to documentationImprove or add to documentationstabilityIssue or feature related to cluster stability (e.g. deadlock)Issue or feature related to cluster stability (e.g. deadlock)