Skip to content

[Bug] SlurmClusterExecutor resubmitting jobs that have an existing cache #688

@liamhuber

Description

@liamhuber

I'm running the following notebook cell on MPIsusmat's CM cluster:

import os
import time

from executorlib import SlurmClusterExecutor

def foo(x):
    time.sleep(10)
    return x + 1

with SlurmClusterExecutor(
    cache_directory=os.path.join(os.getcwd(), "foo_dir"),
    resource_dict={
        "partition": "s.cmfe", 
        "cores": 1,
    }
) as exe:
    future = exe.submit(foo, 1)
    print(future.result())

Initial execution is fine: everything takes a few seconds, because of the sleep command I have a good chance to squeue | grep $USER and see my submit job, and the foo_dir directory gets created and I can see run_queue.sh and time.out and am able to watch foo..._i.h5 appear and disappear and am nicely left with foo..._o.h5.

If I re-execute the cell, everything looks fine from the perspective of my notebook: it prints my result (2) rather quickly. Certainly much more quickly than my sleep(10), so it is definitely using the cache.

But, if I am watching my directory, I can again see foo..._i.h5 get written (with the same key), and I can squeue | grep $USER and sacct -u $USER --start=2025-06-20 | wc -l to see that SLURM is re-running the job.

Since the cache is nicely leveraged in-notebook for my return, I am interpreting this re-submission as a bug. The quick return value implies to me that there is somewhere there is an effective if cache_hit() check, and probably this is as simple as some submit getting a safety valve like if cache_hit: return get_cache(); else: actuall_submit(), but I skimmed through the def submit( and couldn't see where myself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions