-
Notifications
You must be signed in to change notification settings - Fork 3
Description
I'm still working on getting stuff integrated with pyiron_workflow, and ran into a case where the SlurmClusterExecutor was hanging indefinitely. I was able to reproduce one of my problems locally using LocalFileExecutor:
import os
import time
from concurrent.futures import ThreadPoolExecutor
from executorlib import SingleNodeExecutor
from pyiron_workflow.executors.wrapped_executorlib import LocalFileExecutor
def foo(x):
time.sleep(x)
return x
for executor_class, kwargs in (
(ThreadPoolExecutor, {}),
(SingleNodeExecutor, {"cache_directory": "cache_sne"}),
(LocalFileExecutor, {"cache_directory": "cache_lfe"}),
):
print(executor_class.__name__)
with executor_class(**kwargs) as executor:
future = executor.submit(foo, 1)
print(future)
print(future.result())
print(future, "\n")Gives
ThreadPoolExecutor
<Future at 0x108cc0860 state=running>
1
<Future at 0x108cc0860 state=finished returned int>
SingleNodeExecutor
<Future at 0x108c71bb0 state=pending>
1
<Future at 0x108c71bb0 state=finished returned int>
LocalFileExecutor
<Future at 0x10fecc440 state=pending>
And then hangs indefinitely. I can see that cache_lfe/..._i.h5 gets written, it's just not running. This can be avoided if the relevant function is fast enough, or adding a long enough sleep to the with body clause, or (obviously) by moving the future.result() into the with clause. I believe what's happening is that the file-based approach takes some extra time somewhere and gets in a race with TaskSchedulerBase.shutdown such that the clauses checking on self._future_queue and self._process are not behaving appropriately. There is something I'm not understanding deeply enough though, because it works fine for the SingleNodeExecutor even when it's using the filesystem cache.