Skip to content

Slow running process dying at the last hurdle #1836

Open
@birdsarah

Description

@birdsarah

Twice now I've seen a slow running job, that's taking up a good amount of memory, but still is okay, slow down exponentially towards the end and then die with just a few tasks left with the following error:

distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:42369, Got: tcp://127.0.0.1:37509, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1647)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:40963, Got: tcp://127.0.0.1:37509, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1646)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:41049, Got: tcp://127.0.0.1:37509, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1640)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:40963, Got: tcp://127.0.0.1:37509, Key: ('getitem-51ca87464cb65fa6caa7647aca61003c', 1613)
distributed.scheduler - ERROR - 'tcp://127.0.0.1:38763'
Traceback (most recent call last):
  File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/distributed/scheduler.py", line 2147, in handle_worker
    handler(worker=worker, **msg)
  File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/distributed/scheduler.py", line 2041, in handle_missing_data
    ws = self.workers[errant_worker]
KeyError: 'tcp://127.0.0.1:38763'
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:40089, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1705)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:42369, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1748)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:41049, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1675)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:41049, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1625)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:40089, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1638)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:46085, Got: tcp://127.0.0.1:32821, Key: ('getitem-607bed68c5b7edc5d31ceee3800d4a54', 1349)
distributed.scheduler - WARNING - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://127.0.0.1:40089, Got: tcp://127.0.0.1:32821, Key: ('read-parquet-2f32a06e1434bc59edb228d24c886dd2', 1526)
tornado.application - ERROR - Exception in Future <tornado.concurrent.Future object at 0x7f95485fb470> after timeout
Traceback (most recent call last):
  File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/tornado/gen.py", line 910, in error_callback
    future.result()
  File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/home/bird/miniconda3/envs/sb2018/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://127.0.0.1:35799', 'dir': '/home/bird/Dev/mozilla/sb2018/sandpit/dask-worker-space/worker-aydke_wy'}
distributed.nanny - WARNING - Worker process 13244 was killed by signal 15
distributed.nanny - WARNING - Restarting worker

In both cases I was able to successfully get the job to run without distributed using the vanilla dask scheduler in less time with far less memory. I believe the memory requirements were being affected by the pandas issue pandas-dev/pandas#19941

System info:

  • Fedora 27
  • using spawn not forkserver
  • dask & dask-core 0.17.1
  • distributed 1.21.3
  • python 3.6

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions