-
Notifications
You must be signed in to change notification settings - Fork 53
Migrate from MCAD to AppWrapper v1beta2 #521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7970bed
to
ad10aef
Compare
I've completed the porting. Ready for review. |
rebased yet again. |
12ca9b0
to
b24ea9a
Compare
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While running through a notebook and perform cluster.up()
to start the AppWrapper and RayCluster resource I encounter some inconsistent results.
It looks like the AppWrapper CR goes into a suspending
phase, causing the deletion of the RayCluster CR, and brought back once the AppWrapper is Running again. - It keeps going through this loop.
I found some logs from Kueue after the AW goes into suspending
:
{"level":"Level(-2)","ts":"2024-05-23T14:39:11.000485919Z","caller":"core/workload_controller.go:357","msg":"Start the eviction of the workload due to exceeding the PodsReady timeout","controller":"workload","controllerGroup":"kueue.x-k8s.io","controllerKind":"Workload","Workload":{"name":"appwrapper-jobtest2-6bb70","namespace":"chris"},"namespace":"chris","name":"appwrapper-jobtest2-6bb70","reconcileID":"808f579a-9a04-405b-9d15-52f5b692a344","workload":{"name":"appwrapper-jobtest2-6bb70","namespace":"chris"}}
{"level":"error","ts":"2024-05-23T14:40:18.206413487Z","logger":"workload-reconciler","caller":"core/workload_controller.go:588","msg":"Updating workload in cache","workload":{"name":"raycluster-jobtest2-a8bc6","namespace":"chris"},"queue":"workload.codeflare.dev.admitted","status":"admitted","clusterQueue":"workload.codeflare.dev.admitted","error":"old ClusterQueue doesn't exist","stacktrace":"sigs.k8s.io/kueue/pkg/controller/core.(*WorkloadReconciler).Update\n\t/workspace/pkg/controller/core/workload_controller.go:588\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*EventHandler).OnUpdate\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/event_handler.go:113\nk8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate\n\t/workspace/vendor/k8s.io/client-go/tools/cache/controller.go:246\nk8s.io/client-go/tools/cache.(*processorListener).run.func1\n\t/workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:970\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/…
I see that the AppWrapper is using the specified queue-name
while the RayCluster is using this one:
labels:
controller-tools.k8s.io: '1.0'
kueue.x-k8s.io/queue-name: workload.codeflare.dev.admitted
Probably could be what's causing the issue.
…'t get a localqueue
@ChristianZaccaria -- could you add |
Actually, you must have that already since you are showing INFO level logs. My usual debugging trick for observing an appwrapper is to do a |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: astefanutti, Srihari1192 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
requested changes are not needed
mcad
toappwrapper