-
Notifications
You must be signed in to change notification settings - Fork 3
Wrap executorlib executors #678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
To exploit the new caching interface so that nodes can rely on their running state and lexical path to access previously executed results. Locally with the SingleNodeExecutor everything is looking good, but that doesn't natively support terminating the process that submit the job. I'd like to play around with this using the SlurmClusterExecutor on the cluster before making further changes. Signed-off-by: liamhuber <[email protected]>
newfadel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great step forward, leveraging executorlib-1.5.0's caching interface will be very beneficial for performance and result reproducibility!
Regarding your work with SlurmClusterExecutor, it's definitely the right approach to test process termination in a real cluster environment. Once you've had a chance to experiment, I'd be particularly interested in understanding:
- Process Management and Cleanup: How will
SlurmClusterExecutormanage the lifecycle of the submitted jobs on the cluster? Specifically, what mechanisms will be in place to ensure proper termination and cleanup of resources, especially if the submitting Python process exits unexpectedly or is terminated? - Caching Strategy with Distributed Execution: With the caching now active across potentially multiple nodes, have you considered potential challenges around cache invalidation or consistency if underlying data changes or if a previous computation failed? Are there plans to implement strategies to ensure the cached results remain valid and reliable in a distributed setting?
- Error Handling and Robustness: For production use on a cluster, robust error handling is crucial. How will the wrapped executors handle common cluster-specific issues like job failures, network interruptions, or resource limits?
Looking forward to seeing the progress on this! 😄
Signed-off-by: liamhuber <[email protected]>
Signed-off-by: liamhuber <[email protected]>
Signed-off-by: liamhuber <[email protected]>
Codecov ReportAttention: Patch coverage is
❌ Your patch status has failed because the patch coverage (88.88%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #678 +/- ##
==========================================
- Coverage 92.11% 92.05% -0.07%
==========================================
Files 33 34 +1
Lines 3665 3725 +60
==========================================
+ Hits 3376 3429 +53
- Misses 289 296 +7 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Since we now depend explicitly on a new feature Signed-off-by: liamhuber <[email protected]>
Signed-off-by: liamhuber <[email protected]>
It causes a weird hang that blocks observability. Signed-off-by: liamhuber <[email protected]>
And make the expected file independently accessible Signed-off-by: liamhuber <[email protected]>
And hide it behind its own boolean flag for testing Signed-off-by: liamhuber <[email protected]>
And make the file name fixed and accessible at the class level Signed-off-by: liamhuber <[email protected]>
Signed-off-by: liamhuber <[email protected]>
The slurm executor populates this with a submission script, etc. Signed-off-by: liamhuber <[email protected]>
Signed-off-by: liamhuber <[email protected]>
From @jan-janssen in [this comment](pyiron/executorlib#708 (comment)) Co-authored-by: Jan Janssen Signed-off-by: liamhuber <[email protected]>
Signed-off-by: liamhuber <[email protected]>
Signed-off-by: liamhuber <[email protected]>
|
This is currently working to all attempted behaviour when run together with pyiron/executorlib#712 |
Signed-off-by: liamhuber <[email protected]>
The local file executor got directly included in executorlib as a testing tool. Signed-off-by: liamhuber <[email protected]>
Signed-off-by: liamhuber <[email protected]>
And always with-execute tuples since there is only ever one instance of this executor. If we have already been assigned an executor _instance_ then we trust the user to be managing its state and submit directly rather than wrapping in a with-clause Signed-off-by: liamhuber <[email protected]>
Recent changes threw off the balance of times in the first vs second run, so rather compare to what you actually care about: that the second run is bypassing the sleep call. Signed-off-by: liamhuber <[email protected]>
Instead of a with-clause. This way the executor is still permitted to release the thread before the job is done, but we still guarantee that executors created by bespoke instructions get shutdown at the end of their one-future lifetime. Signed-off-by: liamhuber <[email protected]>
Signed-off-by: liamhuber <[email protected]>
There was necessarily only the one future, so don't wait at shutdown. This removes the need for accepting the runtime error and prevents the wrapped executorlib executors from hanging indefinitely. Signed-off-by: liamhuber <[email protected]>
Waiting on |
|
Together with pyiron/executorlib#732 this is working very nicely on the cluster now. I can
and in all cases everything runs perfectly smoothly. I.e. I can start with this: import pyiron_workflow as pwf
from pyiron_workflow.executors.wrapped_executorlib import CacheSlurmClusterExecutor
wf = pwf.Workflow("executor_test")
wf.n1 = pwf.std.UserInput(20)
wf.n2 = pwf.std.Sleep(wf.n1)
wf.n3 = pwf.std.UserInput(wf.n2)
wf.n2.executor = (CacheSlurmClusterExecutor, (), {"resource_dict": {"partition": "s.cmfe"}})
wf.run()And then either let it run, or restart the kernel and follow up with this after the appropriate delay for the case I'm interested in: import pyiron_workflow as pwf
from pyiron_workflow.executors.wrapped_executorlib import CacheSlurmClusterExecutor
wf = pwf.Workflow("executor_test")
wf.load(filename=wf.label + "/recovery.pckl")
wf.failed = False
wf.use_cache = False
wf.run()Outside the scope of this PR but on the TODO list is:
|
Including the lower bound Signed-off-by: liamhuber <[email protected]>
And debug the error message Signed-off-by: liamhuber <[email protected]>
Since that's the way users will typically interact with this field. I also had to change the inheritance order to make sure we were dealing with the user-facing executor and not the task scheduler, but this doesn't impact the submit loop. Signed-off-by: liamhuber <[email protected]>
Signed-off-by: liamhuber <[email protected]>
Signed-off-by: liamhuber <[email protected]>
So we pass throught the Runnable._shutdown_executor_callback process Signed-off-by: liamhuber <[email protected]>
|
Codecov complains that |
Signed-off-by: liamhuber <[email protected]>
Actually, I'd be more comfortable with this PR if it included these. I'll still leave exposure in the API for later, but let's take a crack at robust testing right here. |
* Test slurm submission Signed-off-by: liamhuber <[email protected]> * Don't apply callbacks to cached returns Signed-off-by: liamhuber <[email protected]> * Only validate submission-time resources Otherwise we run into trouble where it loads saved executor instructions (that already have what it would use anyhow) Signed-off-by: liamhuber <[email protected]> * Mark module Signed-off-by: liamhuber <[email protected]> * Test cached result branch Signed-off-by: liamhuber <[email protected]> --------- Signed-off-by: liamhuber <[email protected]>
To exploit the new caching interface provided in
executorlib-1.5.0so that nodes can rely on their running state and lexical path to access previously executed results.Locally with the
SingleNodeExecutoreverything is looking good, but that doesn't natively support terminating the process that submit the job. I'd like to play around with this using theSlurmClusterExecutoron the cluster before making further changes.