Skip to content

Migrate from MCAD to AppWrapper v1beta2 #521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 5, 2024

Conversation

dgrove-oss
Copy link
Collaborator

@dgrove-oss dgrove-oss commented Apr 23, 2024

  1. Rename the flag from mcad to appwrapper
  2. Drop dispatch_priority and related test (obsolete MCAD feature).
  3. Simplify mocked AppWrappers
  4. Port AppWrappers from v1beta1 to v1beta2

@dgrove-oss dgrove-oss force-pushed the mcad-to-aw branch 7 times, most recently from 7970bed to ad10aef Compare May 1, 2024 15:50
@dgrove-oss dgrove-oss changed the title Start migration from MCAD to AppWrapper v1beta2 Migrate from MCAD to AppWrapper v1beta2 May 3, 2024
@dgrove-oss
Copy link
Collaborator Author

I've completed the porting. Ready for review.

@dgrove-oss
Copy link
Collaborator Author

rebased yet again.

@dgrove-oss dgrove-oss force-pushed the mcad-to-aw branch 2 times, most recently from 12ca9b0 to b24ea9a Compare May 20, 2024 13:56
@astefanutti
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 22, 2024
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 22, 2024
@Srihari1192 Srihari1192 self-requested a review May 23, 2024 07:22
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 23, 2024
Copy link
Collaborator

@ChristianZaccaria ChristianZaccaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While running through a notebook and perform cluster.up() to start the AppWrapper and RayCluster resource I encounter some inconsistent results.

It looks like the AppWrapper CR goes into a suspending phase, causing the deletion of the RayCluster CR, and brought back once the AppWrapper is Running again. - It keeps going through this loop.

I found some logs from Kueue after the AW goes into suspending:

{"level":"Level(-2)","ts":"2024-05-23T14:39:11.000485919Z","caller":"core/workload_controller.go:357","msg":"Start the eviction of the workload due to exceeding the PodsReady timeout","controller":"workload","controllerGroup":"kueue.x-k8s.io","controllerKind":"Workload","Workload":{"name":"appwrapper-jobtest2-6bb70","namespace":"chris"},"namespace":"chris","name":"appwrapper-jobtest2-6bb70","reconcileID":"808f579a-9a04-405b-9d15-52f5b692a344","workload":{"name":"appwrapper-jobtest2-6bb70","namespace":"chris"}} {"level":"error","ts":"2024-05-23T14:40:18.206413487Z","logger":"workload-reconciler","caller":"core/workload_controller.go:588","msg":"Updating workload in cache","workload":{"name":"raycluster-jobtest2-a8bc6","namespace":"chris"},"queue":"workload.codeflare.dev.admitted","status":"admitted","clusterQueue":"workload.codeflare.dev.admitted","error":"old ClusterQueue doesn't exist","stacktrace":"sigs.k8s.io/kueue/pkg/controller/core.(*WorkloadReconciler).Update\n\t/workspace/pkg/controller/core/workload_controller.go:588\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*EventHandler).OnUpdate\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/event_handler.go:113\nk8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate\n\t/workspace/vendor/k8s.io/client-go/tools/cache/controller.go:246\nk8s.io/client-go/tools/cache.(*processorListener).run.func1\n\t/workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:970\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/…

I see that the AppWrapper is using the specified queue-name while the RayCluster is using this one:

labels:
    controller-tools.k8s.io: '1.0'
    kueue.x-k8s.io/queue-name: workload.codeflare.dev.admitted

Probably could be what's causing the issue.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 23, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 3, 2024
@dgrove-oss
Copy link
Collaborator Author

@ChristianZaccaria -- could you add - "--zap-log-level=2" to your codeflare operator's command line? When running with info-level logs enabled, the appwrapper controller embedded in the codeflare operator will print a log message on each state transition. That log would provide a reason why the appwrapper transitions from running to suspending.

@dgrove-oss
Copy link
Collaborator Author

dgrove-oss commented Jun 3, 2024

Actually, you must have that already since you are showing INFO level logs.
The detailed information is kept in the AppWrapper's status.Conditions array. In particular, the message/reason fields of the conditions will tell you why the controller went into the Resetting or Suspending state (which should be the INFO log message right before the Suspended one).

My usual debugging trick for observing an appwrapper is to do a kubectl get appwrappers --watch -o yaml and then see what is in the conditions array as it is running.

@astefanutti
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 4, 2024
Copy link
Contributor

openshift-ci bot commented Jun 4, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti, Srihari1192

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 4, 2024
@dimakis dimakis dismissed ChristianZaccaria’s stale review June 5, 2024 09:38

requested changes are not needed

@dimakis dimakis merged commit c0f7d7f into project-codeflare:main Jun 5, 2024
4 checks passed
@dgrove-oss dgrove-oss deleted the mcad-to-aw branch June 5, 2024 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants