Bug 1856270: allow users to manually delete machines stuck in crash loop #111

iamemilio · 2020-08-17T16:52:57Z

machines that fail to provision and cannot be deleted cause CAPO to get stuck in a very specific crash loop. We want to allow users to manually delete the machine when this happens.

openshift-ci-robot · 2020-08-17T16:53:04Z

@iamemilio: This pull request references Bugzilla bug 1856270, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1856270: allow users to manually delete machines stuck in crash loop

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

elmiko

i understand what is happening in this change, but i am curious what happens if a machine has actually been deleted, will we still try to delete the finalizer through this logic?

and if yes, is there an consequence of trying to update something that has been deleted?

iamemilio · 2020-08-17T19:56:40Z

I dont think we can hit this code block if the machine has been deleted because before we get to this point, we check if the machine exists multiple times (im not sure why its that redundant) and exit the function if it does not. This should only trigger if we check that the machine exists, and it does, then try to delete it and get a "resource not found" error. I think there must be a bug in openstack that causes this case to occur.

iamemilio · 2020-08-17T19:57:10Z

That being said, I have been unable to create a runnable release image for almost a week, and so I cannot really verify any of this.

elmiko · 2020-08-17T20:00:09Z

ack, thanks for the explanation @iamemilio, i can see that this code won't be reached unless the machine is not deleted. it makes sense now.

/lgtm

openshift-ci-robot · 2020-08-17T20:00:10Z

@elmiko: changing LGTM is restricted to collaborators

In response to this:

ack, thanks for the explanation @iamemilio, i can see that this code won't be reached unless the machine is not deleted. it makes sense now.

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-08-17T20:00:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, iamemilio

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [iamemilio]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

michaelgugino

/hold

This patch is not the right approach in a variety of ways.

If we remove the finalizer, there will be no indication that there was a problem, the machine will silently go away. There will be nothing left to tell the user to go manually delete an instance from the cloud.

404 on delete needs to be handled by the actuator. General practice is to check if the instance exists in the actuator.Delete() call before attempt to delete the instance, and handle errors appropriately there.

michaelgugino · 2020-08-17T21:44:09Z

For additional reference, take a look at how we handled this in the GCP provider:

https://github.com/openshift/cluster-api-provider-gcp/blob/master/pkg/cloud/gcp/actuators/machine/reconciler.go#L343

https://github.com/openshift/cluster-api-provider-gcp/blob/46c5454e7175245a35bbdfcde33349da48a00686/pkg/cloud/gcp/actuators/machine/reconciler.go#L314

elmiko · 2020-08-17T22:16:57Z

solid advice, thanks Mike!

mandre · 2020-08-18T08:00:51Z

/hold

This patch is not the right approach in a variety of ways.

Not mentioning it's modifying the vendored MAO directly...

Thanks folks for the feedback and pointers to how GCP handles the same issue.

elmiko · 2020-08-18T12:40:06Z

Not mentioning it's modifying the vendored MAO directly...

ouch. can't believe i missed that =(

iamemilio · 2020-08-18T13:19:25Z

Not mentioning it's modifying the vendored MAO directly...

   ouch. can't believe i missed that =(

I cant believe I didnt notice that either haha

iamemilio · 2020-08-18T13:59:08Z

@michaelgugino the issue with this bug is that the machine will show up when we check that it exists, but it will return "404 resource not found" when we try to delete it.

iamemilio · 2020-08-18T15:05:30Z

Considering your comment on the bugzilla, this is not a viable solution. Closing

openshift-ci-robot · 2020-08-18T15:05:33Z

@iamemilio: This pull request references Bugzilla bug 1856270. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

Bug 1856270: allow users to manually delete machines stuck in crash loop

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Switch yaml package from ghodss repo to sigs.k8s.io fork kubernetes-sigs#572

allow users to manually delete machines stuck in crash loop

3c95be0

openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 17, 2020

openshift-ci-robot requested review from mandre and pierreprinetti August 17, 2020 16:53

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 17, 2020

elmiko reviewed Aug 17, 2020

View reviewed changes

michaelgugino suggested changes Aug 17, 2020

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 17, 2020

iamemilio closed this Aug 18, 2020

EmilienM mentioned this pull request Jan 31, 2023

Manual rebase on upstream #252

Closed

pierreprinetti pushed a commit to shiftstack/cluster-api-provider-openstack that referenced this pull request Apr 22, 2024

Switch yaml to sigs.k8s.io/yaml (openshift#111)

f045e9a

Switch yaml package from ghodss repo to sigs.k8s.io fork kubernetes-sigs#572

Bug 1856270: allow users to manually delete machines stuck in crash loop #111

Bug 1856270: allow users to manually delete machines stuck in crash loop #111

Uh oh!

Conversation

iamemilio commented Aug 17, 2020

Uh oh!

openshift-ci-robot commented Aug 17, 2020

Uh oh!

elmiko left a comment

Choose a reason for hiding this comment

Uh oh!

iamemilio commented Aug 17, 2020

Uh oh!

iamemilio commented Aug 17, 2020

Uh oh!

elmiko commented Aug 17, 2020

Uh oh!

openshift-ci-robot commented Aug 17, 2020

Uh oh!

openshift-ci-robot commented Aug 17, 2020

Uh oh!

michaelgugino left a comment

Choose a reason for hiding this comment

Uh oh!

michaelgugino commented Aug 17, 2020

Uh oh!

elmiko commented Aug 17, 2020

Uh oh!

mandre commented Aug 18, 2020

Uh oh!

elmiko commented Aug 18, 2020

Uh oh!

iamemilio commented Aug 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iamemilio commented Aug 18, 2020

Uh oh!

iamemilio commented Aug 18, 2020

Uh oh!

openshift-ci-robot commented Aug 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

iamemilio commented Aug 18, 2020 •

edited

Loading