Open
Description
Twice now I've seen a slow running job, that's taking up a good amount of memory, but still is okay, slow down exponentially towards the end and then die with just a few tasks left with the following error:
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing. Expected: tcp://127.0.0.1:42369, Got: tcp://127.0.0.1:37509, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1647)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing. Expected: tcp://127.0.0.1:40963, Got: tcp://127.0.0.1:37509, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1646)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing. Expected: tcp://127.0.0.1:41049, Got: tcp://127.0.0.1:37509, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1640)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing. Expected: tcp://127.0.0.1:40963, Got: tcp://127.0.0.1:37509, Key: ('getitem-51ca87464cb65fa6caa7647aca61003c', 1613)
distributed.scheduler - ERROR - 'tcp://127.0.0.1:38763'
Traceback (most recent call last):
File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/distributed/scheduler.py", line 2147, in handle_worker
handler(worker=worker, **msg)
File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/distributed/scheduler.py", line 2041, in handle_missing_data
ws = self.workers[errant_worker]
KeyError: 'tcp://127.0.0.1:38763'
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing. Expected: tcp://127.0.0.1:40089, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1705)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing. Expected: tcp://127.0.0.1:42369, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1748)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing. Expected: tcp://127.0.0.1:41049, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1675)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing. Expected: tcp://127.0.0.1:41049, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1625)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing. Expected: tcp://127.0.0.1:40089, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1638)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing. Expected: tcp://127.0.0.1:46085, Got: tcp://127.0.0.1:32821, Key: ('getitem-607bed68c5b7edc5d31ceee3800d4a54', 1349)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing. Expected: tcp://127.0.0.1:40089, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1526)
tornado.application - ERROR - Exception in Future <tornado.concurrent.Future object at 0x7f95485fb470> after timeout
Traceback (most recent call last):
File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/tornado/gen.py", line 910, in error_callback
future.result()
File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
assert msg == 'started', msg
AssertionError: {'address': 'tcp://127.0.0.1:35799', 'dir': '/home/bird/Dev/mozilla/sb2018/sandpit/dask-worker-space/worker-aydke_wy'}
distributed.nanny - WARNING - Worker process 13244 was killed by signal 15
distributed.nanny - WARNING - Restarting worker
In both cases I was able to successfully get the job to run without distributed using the vanilla dask scheduler in less time with far less memory. I believe the memory requirements were being affected by the pandas issue pandas-dev/pandas#19941
System info:
- Fedora 27
- using
spawn
notforkserver
- dask & dask-core 0.17.1
- distributed 1.21.3
- python 3.6
Metadata
Metadata
Assignees
Labels
No labels