Slow running process dying at the last hurdle

Twice now I've seen a slow running job, that's taking up a good amount of memory, but still is okay, slow down exponentially towards the end and then die with just a few tasks left with the following error:

```
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:42369, Got: tcp://127.0.0.1:37509, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1647)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:40963, Got: tcp://127.0.0.1:37509, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1646)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:41049, Got: tcp://127.0.0.1:37509, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1640)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:40963, Got: tcp://127.0.0.1:37509, Key: ('getitem-51ca87464cb65fa6caa7647aca61003c', 1613)
distributed.scheduler - ERROR - 'tcp://127.0.0.1:38763'
Traceback (most recent call last):
  File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/distributed/scheduler.py", line 2147, in handle_worker
    handler(worker=worker, **msg)
  File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/distributed/scheduler.py", line 2041, in handle_missing_data
    ws = self.workers[errant_worker]
KeyError: 'tcp://127.0.0.1:38763'
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:40089, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1705)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:42369, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1748)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:41049, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1675)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:41049, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1625)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:40089, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1638)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:46085, Got: tcp://127.0.0.1:32821, Key: ('getitem-607bed68c5b7edc5d31ceee3800d4a54', 1349)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:40089, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1526)
tornado.application - ERROR - Exception in Future <tornado.concurrent.Future object at 0x7f95485fb470> after timeout
Traceback (most recent call last):
  File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/tornado/gen.py", line 910, in error_callback
    future.result()
  File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://127.0.0.1:35799', 'dir': '/home/bird/Dev/mozilla/sb2018/sandpit/dask-worker-space/worker-aydke_wy'}
distributed.nanny - WARNING - Worker process 13244 was killed by signal 15
distributed.nanny - WARNING - Restarting worker
```

In both cases I was able to successfully get the job to run without distributed using the vanilla dask scheduler in less time with far less memory. I believe the memory requirements were being affected by the pandas issue https://github.com/pandas-dev/pandas/issues/19941

System info:
- Fedora 27
- using `spawn` not `forkserver`
- dask & dask-core 0.17.1
- distributed 1.21.3
- python 3.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Slow running process dying at the last hurdle #1836

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Slow running process dying at the last hurdle #1836

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions