-
-
Notifications
You must be signed in to change notification settings - Fork 733
client.restart does not restart connection between scheduler and workers #1946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
First, let me apologize for the long delay in response here. This is an excellent issue. Thank you for putting the time into the explanation, the minimal example, and the logs. In way of apology, let me introduce you to the
http://bencane.com/2012/07/16/tc-adding-simulated-network-latency-to-your-linux-server/ AnalysisThank you for mentioning that you are setting the worker up on the same port. I tried to reproduce without this and wasn't able to. Once I set the port I start getting the same issues that you do.
That seems like a very sensible approach to me. I'd be happy to do this though I'd also welcome the help with a pull request. I'd like to support more people becoming familiar with scheduler internals if possible. Pandas
Yeah, this is an issue. I recommend pushing this upstream a bit. If you're interested in constructing a minimal example that leaks memory without dask that would be welcome on the pandas issue tracker. I've tried to do that here: pandas-dev/pandas#19941 but I think that if this comes from several independent users then the Pandas community will start taking more interest in the problem. |
No problem, thanks for the great, easy to use package! Ah, I didn't know about tc, thanks. I tried trickle but wasn't able to reproduce the issue. I can take a stab at a PR for this sometime this week. |
I'm glad to hear it. May you discover and fix many more bugs :) Let me know if I can help. |
Sorry this is such a long issue, but it is hard to replicate so I wanted to provide enough details.
Summary
It appears that after
client.restart()
on a cluster where thescheduler and worker(s) are on different machines, or at least have
some networking delay, jobs submitted to the cluster after the restart
will fail because the scheduler doesn't think the worker has the
promised key. I've tracked this down to a
CommClosedError
in theTCP rpc connection from the scheduler to the worker.
Workaround
The issue is resolved when the worker rpc connection
(
scheduler.rpc.available[worker]
) from theConnectionPool
isclosed in
Scheduler.remove_worker
so that future connections tothe worker require a new call to connect. I think somewhere in the
underlying network/kernel, the connection is lost on a worker restart
but the
ConnectionPool
thinks the connection is still open.Details
I'm using distributed to process NWP data and compute solar power
forecasts every five minutes. The scheduler and workers are running on
a Kubernetes/OpenShift cluster. I'm making many API requests and using
many pandas objects that seem to retain memory on the workers. To
avoid excess memory usage, I call
client.restart()
after each run.When the next run after the restart, at least one future is
missing
on one of the workers, so the scheduler terminates theworker which eventually triggers pod restart cooldown periods in
Kuberenetes.
The
Workers don't have promised key
error wasn't very helpful. Itracked the source of the issue from
Scheduler.gather
todistributed.utils_comm.gather_from_workers
todistributed.core.rpc.get_data
todistributed.core.send_recv
.There, the scheduler tries to write a get_data message to the rpc comm
between the scheduler and the worker. The worker never gets this
message and the
comm.read()
raises aCommClosedError
.This rpc comm comes from the
ConnectionPool
that is establishedbefore the restart. It seems the restart doesn't close or modify the
connection at all, but something in the networking stack breaks this
connection. Once the scheduler tries to reuse this connection it
thinks is still open, it finds it is closed.
The solution I've found is to close the
scheduler.rpc.available
comms for the worker that is removed via
scheduler.remove_worker
. Since the worker is removed from thescheduler, it makes sense to me that any connections to it should also
be removed from the connection pool.
This may not be an issue when workers are assigned random ports
instead of reusing the same address for firewalls etc.
I'm using
master
(distributed==1.21.6+12.g650118a8), python3.6.4, and I've only been able to replicate the issue when the
scheduler and worker are on different machines. Here's a simple
test script that illustrates the issue:
Test Script
Scheduler Log
Worker Log
Client Log
The text was updated successfully, but these errors were encountered: