-
Notifications
You must be signed in to change notification settings - Fork 3
Description
I'm running the following notebook cell on MPIsusmat's CM cluster:
import os
import time
from executorlib import SlurmClusterExecutor
def foo(x):
time.sleep(10)
return x + 1
with SlurmClusterExecutor(
cache_directory=os.path.join(os.getcwd(), "foo_dir"),
resource_dict={
"partition": "s.cmfe",
"cores": 1,
}
) as exe:
future = exe.submit(foo, 1)
print(future.result())Initial execution is fine: everything takes a few seconds, because of the sleep command I have a good chance to squeue | grep $USER and see my submit job, and the foo_dir directory gets created and I can see run_queue.sh and time.out and am able to watch foo..._i.h5 appear and disappear and am nicely left with foo..._o.h5.
If I re-execute the cell, everything looks fine from the perspective of my notebook: it prints my result (2) rather quickly. Certainly much more quickly than my sleep(10), so it is definitely using the cache.
But, if I am watching my directory, I can again see foo..._i.h5 get written (with the same key), and I can squeue | grep $USER and sacct -u $USER --start=2025-06-20 | wc -l to see that SLURM is re-running the job.
Since the cache is nicely leveraged in-notebook for my return, I am interpreting this re-submission as a bug. The quick return value implies to me that there is somewhere there is an effective if cache_hit() check, and probably this is as simple as some submit getting a safety valve like if cache_hit: return get_cache(); else: actuall_submit(), but I skimmed through the def submit( and couldn't see where myself.