Wrap executorlib executors #678

liamhuber · 2025-06-19T20:40:15Z

To exploit the new caching interface provided in executorlib-1.5.0 so that nodes can rely on their running state and lexical path to access previously executed results.

Locally with the SingleNodeExecutor everything is looking good, but that doesn't natively support terminating the process that submit the job. I'd like to play around with this using the SlurmClusterExecutor on the cluster before making further changes.

To exploit the new caching interface so that nodes can rely on their running state and lexical path to access previously executed results. Locally with the SingleNodeExecutor everything is looking good, but that doesn't natively support terminating the process that submit the job. I'd like to play around with this using the SlurmClusterExecutor on the cluster before making further changes. Signed-off-by: liamhuber <[email protected]>

github-actions · 2025-06-19T20:40:25Z

👈 Launch a binder notebook on branch pyiron/pyiron_workflow/executor

newfadel

This is a great step forward, leveraging executorlib-1.5.0's caching interface will be very beneficial for performance and result reproducibility!

Regarding your work with SlurmClusterExecutor, it's definitely the right approach to test process termination in a real cluster environment. Once you've had a chance to experiment, I'd be particularly interested in understanding:

Process Management and Cleanup: How will SlurmClusterExecutor manage the lifecycle of the submitted jobs on the cluster? Specifically, what mechanisms will be in place to ensure proper termination and cleanup of resources, especially if the submitting Python process exits unexpectedly or is terminated?
Caching Strategy with Distributed Execution: With the caching now active across potentially multiple nodes, have you considered potential challenges around cache invalidation or consistency if underlying data changes or if a previous computation failed? Are there plans to implement strategies to ensure the cached results remain valid and reliable in a distributed setting?
Error Handling and Robustness: For production use on a cluster, robust error handling is crucial. How will the wrapped executors handle common cluster-specific issues like job failures, network interruptions, or resource limits?

Looking forward to seeing the progress on this! 😄

Signed-off-by: liamhuber <[email protected]>

codecov · 2025-06-19T20:52:15Z

Codecov Report

Attention: Patch coverage is 88.88889% with 9 lines in your changes missing coverage. Please review.

Project coverage is 92.05%. Comparing base (4841e34) to head (f579159).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pyiron_workflow/node.py	70.00%	6 Missing ⚠️
pyiron_workflow/executors/wrapped_executorlib.py	96.29%	1 Missing ⚠️
pyiron_workflow/mixin/run.py	96.42%	1 Missing ⚠️
pyiron_workflow/nodes/composite.py	66.66%	1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (88.88%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #678      +/-   ##
==========================================
- Coverage   92.11%   92.05%   -0.07%     
==========================================
  Files          33       34       +1     
  Lines        3665     3725      +60     
==========================================
+ Hits         3376     3429      +53     
- Misses        289      296       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Since we now depend explicitly on a new feature Signed-off-by: liamhuber <[email protected]>

Signed-off-by: liamhuber <[email protected]>

It causes a weird hang that blocks observability. Signed-off-by: liamhuber <[email protected]>

And make the expected file independently accessible Signed-off-by: liamhuber <[email protected]>

And hide it behind its own boolean flag for testing Signed-off-by: liamhuber <[email protected]>

And make the file name fixed and accessible at the class level Signed-off-by: liamhuber <[email protected]>

Signed-off-by: liamhuber <[email protected]>

bot spam

The slurm executor populates this with a submission script, etc. Signed-off-by: liamhuber <[email protected]>

Signed-off-by: liamhuber <[email protected]>

@jan-janssen

From @jan-janssen in [this comment](pyiron/executorlib#708 (comment)) Co-authored-by: Jan Janssen Signed-off-by: liamhuber <[email protected]>

Signed-off-by: liamhuber <[email protected]>

liamhuber · 2025-07-11T21:18:27Z

This is currently working to all attempted behaviour when run together with pyiron/executorlib#712

pyiron_workflow/executors/wrapped_executorlib.py

Signed-off-by: liamhuber <[email protected]>

The local file executor got directly included in executorlib as a testing tool. Signed-off-by: liamhuber <[email protected]>

Signed-off-by: liamhuber <[email protected]>

And always with-execute tuples since there is only ever one instance of this executor. If we have already been assigned an executor _instance_ then we trust the user to be managing its state and submit directly rather than wrapping in a with-clause Signed-off-by: liamhuber <[email protected]>

Recent changes threw off the balance of times in the first vs second run, so rather compare to what you actually care about: that the second run is bypassing the sleep call. Signed-off-by: liamhuber <[email protected]>

Instead of a with-clause. This way the executor is still permitted to release the thread before the job is done, but we still guarantee that executors created by bespoke instructions get shutdown at the end of their one-future lifetime. Signed-off-by: liamhuber <[email protected]>

Signed-off-by: liamhuber <[email protected]>

There was necessarily only the one future, so don't wait at shutdown. This removes the need for accepting the runtime error and prevents the wrapped executorlib executors from hanging indefinitely. Signed-off-by: liamhuber <[email protected]>

liamhuber · 2025-07-15T20:12:38Z

ImportError: cannot import name 'TestClusterExecutor' from 'executorlib.api' (/home/runner/work/pyiron_workflow/pyiron_workflow/cached-miniforge/my-env/lib/python3.13/site-packages/executorlib/api.py). Did you mean: 'FluxClusterExecutor'?

Waiting on executorlib-1.5.3

liamhuber · 2025-07-15T20:45:18Z

Together with pyiron/executorlib#732 this is working very nicely on the cluster now. I can

Let the workflow run
Interrupt the workflow after the slurm job has started
- Restart the workflow after the slurm job is complete
- Restart the workflow while the slurm job is still going

and in all cases everything runs perfectly smoothly. I.e. I can start with this:

import pyiron_workflow as pwf
from pyiron_workflow.executors.wrapped_executorlib import CacheSlurmClusterExecutor


wf = pwf.Workflow("executor_test")
wf.n1 = pwf.std.UserInput(20)
wf.n2 = pwf.std.Sleep(wf.n1)
wf.n3 = pwf.std.UserInput(wf.n2)

wf.n2.executor = (CacheSlurmClusterExecutor, (), {"resource_dict": {"partition": "s.cmfe"}})

wf.run()

And then either let it run, or restart the kernel and follow up with this after the appropriate delay for the case I'm interested in:

import pyiron_workflow as pwf
from pyiron_workflow.executors.wrapped_executorlib import CacheSlurmClusterExecutor

wf = pwf.Workflow("executor_test")
wf.load(filename=wf.label + "/recovery.pckl")

wf.failed = False
wf.use_cache = False
wf.run()

Outside the scope of this PR but on the TODO list is:

Convenience around use_cache and Workflow use_cache is annoying on composites #699
Convenience around running workflows in a background thread so I can make a real save and not rely on a recovery file Let workflows be easily run in the background #700
CI tests on Slurm ala Test with SLURM executorlib#726 (done and merged in Draft a slurm test #704)
Adding the new wrapped executorlib.SlurmClusterExecutor to the API Expose executorlib SLURM power in API #708
Removing the old way of finding serialized slurm results Remove old caching tools #709

Including the lower bound Signed-off-by: liamhuber <[email protected]>

And debug the error message Signed-off-by: liamhuber <[email protected]>

Since that's the way users will typically interact with this field. I also had to change the inheritance order to make sure we were dealing with the user-facing executor and not the task scheduler, but this doesn't impact the submit loop. Signed-off-by: liamhuber <[email protected]>

Signed-off-by: liamhuber <[email protected]>

So we pass throught the Runnable._shutdown_executor_callback process Signed-off-by: liamhuber <[email protected]>

.ci_support/lower_bound.yml

liamhuber · 2025-07-17T14:36:55Z

Codecov complains that Node._clean_wrapped_executorlib_executor_cache is not being tested, but it's just flat out wrong. It is absolutely being invoked in tests.integration.test_wrapped_executorlib.TestWrappedExecutorlib.test_automatic_cleaning.

Signed-off-by: liamhuber <[email protected]>

liamhuber · 2025-07-17T14:41:20Z

CI tests on Slurm

Actually, I'd be more comfortable with this PR if it included these. I'll still leave exposure in the API for later, but let's take a crack at robust testing right here.

* Test slurm submission Signed-off-by: liamhuber <[email protected]> * Don't apply callbacks to cached returns Signed-off-by: liamhuber <[email protected]> * Only validate submission-time resources Otherwise we run into trouble where it loads saved executor instructions (that already have what it would use anyhow) Signed-off-by: liamhuber <[email protected]> * Mark module Signed-off-by: liamhuber <[email protected]> * Test cached result branch Signed-off-by: liamhuber <[email protected]> --------- Signed-off-by: liamhuber <[email protected]>

newfadel previously requested changes Jun 19, 2025

View reviewed changes

liamhuber added 3 commits June 19, 2025 13:43

Be kinder to fstrings

1acffcf

Signed-off-by: liamhuber <[email protected]>

Remove prints

afa60b2

Signed-off-by: liamhuber <[email protected]>

Black

b5191ad

Signed-off-by: liamhuber <[email protected]>

liamhuber added 9 commits June 19, 2025 13:52

Bump lower bound of executorlib

4f08434

Since we now depend explicitly on a new feature Signed-off-by: liamhuber <[email protected]>

Wrap SlurmClusterExecutor

e90d55f

Signed-off-by: liamhuber <[email protected]>

Merge branch 'main' into executor

579a157

Merge branch 'main' into executor

a8f8f60

Don't re-parse executor tuples

15cf97a

It causes a weird hang that blocks observability. Signed-off-by: liamhuber <[email protected]>

Exploit lexical path cleaning

4ec33bf

And make the expected file independently accessible Signed-off-by: liamhuber <[email protected]>

Move cache cleaning into the finally method

9d053ff

And hide it behind its own boolean flag for testing Signed-off-by: liamhuber <[email protected]>

Update executorlib syntax

f7dfc3d

And make the file name fixed and accessible at the class level Signed-off-by: liamhuber <[email protected]>

Test the single node executor

ad3356c

Signed-off-by: liamhuber <[email protected]>

liamhuber added 5 commits July 11, 2025 08:40

Clean the associated cache subdirectory

154b323

The slurm executor populates this with a submission script, etc. Signed-off-by: liamhuber <[email protected]>

Clean up the slurm stuff too

1c471f7

Signed-off-by: liamhuber <[email protected]>

Add local file executor

17e4184

From @jan-janssen in [this comment](pyiron/executorlib#708 (comment)) Co-authored-by: Jan Janssen Signed-off-by: liamhuber <[email protected]>

lint

b624ef8

Signed-off-by: liamhuber <[email protected]>

Test local both executors

48ecfa7

Signed-off-by: liamhuber <[email protected]>

jan-janssen reviewed Jul 12, 2025

View reviewed changes

pyiron_workflow/executors/wrapped_executorlib.py Outdated Show resolved Hide resolved

jan-janssen mentioned this pull request Jul 12, 2025

Add TestClusterExecutor to simplify debugging of SlurmClusterExecutor and FluxClusterExecutor pyiron/executorlib#714

Merged

liamhuber added 5 commits July 14, 2025 13:20

Merge branch 'main' into executor

e3cf308

Add prefix to cleaning directory

2bafc68

Signed-off-by: liamhuber <[email protected]>

Use test executor

fed80d3

The local file executor got directly included in executorlib as a testing tool. Signed-off-by: liamhuber <[email protected]>

Validate executor at assignment

5cd3413

Signed-off-by: liamhuber <[email protected]>

liamhuber added 4 commits July 15, 2025 11:13

Decrease improvement expectation

dfc50af

Recent changes threw off the balance of times in the first vs second run, so rather compare to what you actually care about: that the second run is bypassing the sleep call. Signed-off-by: liamhuber <[email protected]>

Clean up written file

8b03ca6

Signed-off-by: liamhuber <[email protected]>

Don't wait

c46b98b

There was necessarily only the one future, so don't wait at shutdown. This removes the need for accepting the runtime error and prevents the wrapped executorlib executors from hanging indefinitely. Signed-off-by: liamhuber <[email protected]>

liamhuber changed the title ~~[WIP] Wrap executorlib executors~~ Wrap executorlib executors Jul 15, 2025

liamhuber added 7 commits July 17, 2025 06:19

Bump executorlib version

4350f57

Including the lower bound Signed-off-by: liamhuber <[email protected]>

Merge branch 'main' into executor

9ec4512

Test application to non-node

f5b78f7

And debug the error message Signed-off-by: liamhuber <[email protected]>

Test file cleaning

6c6be03

Signed-off-by: liamhuber <[email protected]>

Test uninterpretable executor setting

b509692

Signed-off-by: liamhuber <[email protected]>

Modify test for coverage

b06b914

So we pass throught the Runnable._shutdown_executor_callback process Signed-off-by: liamhuber <[email protected]>

jan-janssen reviewed Jul 17, 2025

View reviewed changes

.ci_support/lower_bound.yml Outdated Show resolved Hide resolved

Decrease lower bound

0182d3a

Signed-off-by: liamhuber <[email protected]>

This was referenced Jul 17, 2025

Weird behaviour in typehinting/type return when using pwf.api.inputs_to* #703

Closed

Expose executorlib SLURM power in API #708

Closed

liamhuber marked this pull request as ready for review July 18, 2025 17:54

liamhuber merged commit 8e1acd8 into main Jul 18, 2025
37 of 39 checks passed

liamhuber deleted the executor branch July 18, 2025 18:56

liamhuber mentioned this pull request Jul 19, 2025

Wrap executorlib caching capability #663

Closed

jan-janssen mentioned this pull request Sep 9, 2025

[Documentation] Users of executorlib pyiron/executorlib#815

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wrap executorlib executors #678

Wrap executorlib executors #678

Uh oh!

liamhuber commented Jun 19, 2025

Uh oh!

github-actions bot commented Jun 19, 2025

Uh oh!

newfadel left a comment

Uh oh!

codecov bot commented Jun 19, 2025 •

edited

Loading

Uh oh!

liamhuber commented Jul 11, 2025

Uh oh!

Uh oh!

liamhuber commented Jul 15, 2025

Uh oh!

liamhuber commented Jul 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

liamhuber commented Jul 17, 2025

Uh oh!

liamhuber commented Jul 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Wrap executorlib executors #678

Wrap executorlib executors #678

Uh oh!

Conversation

liamhuber commented Jun 19, 2025

Uh oh!

github-actions bot commented Jun 19, 2025

Uh oh!

newfadel left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

liamhuber commented Jul 11, 2025

Uh oh!

Uh oh!

liamhuber commented Jul 15, 2025

Uh oh!

liamhuber commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

liamhuber commented Jul 17, 2025

Uh oh!

liamhuber commented Jul 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Jun 19, 2025 •

edited

Loading

liamhuber commented Jul 15, 2025 •

edited

Loading