Skip to content

Conversation

@iamemilio
Copy link

machines that fail to provision and cannot be deleted cause CAPO to get stuck in a very specific crash loop. We want to allow users to manually delete the machine when this happens.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 17, 2020
@openshift-ci-robot
Copy link

@iamemilio: This pull request references Bugzilla bug 1856270, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1856270: allow users to manually delete machines stuck in crash loop

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 17, 2020
Copy link

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand what is happening in this change, but i am curious what happens if a machine has actually been deleted, will we still try to delete the finalizer through this logic?

and if yes, is there an consequence of trying to update something that has been deleted?

@iamemilio
Copy link
Author

I dont think we can hit this code block if the machine has been deleted because before we get to this point, we check if the machine exists multiple times (im not sure why its that redundant) and exit the function if it does not. This should only trigger if we check that the machine exists, and it does, then try to delete it and get a "resource not found" error. I think there must be a bug in openstack that causes this case to occur.

@iamemilio
Copy link
Author

That being said, I have been unable to create a runnable release image for almost a week, and so I cannot really verify any of this.

@elmiko
Copy link

elmiko commented Aug 17, 2020

ack, thanks for the explanation @iamemilio, i can see that this code won't be reached unless the machine is not deleted. it makes sense now.

/lgtm

@openshift-ci-robot
Copy link

@elmiko: changing LGTM is restricted to collaborators

In response to this:

ack, thanks for the explanation @iamemilio, i can see that this code won't be reached unless the machine is not deleted. it makes sense now.

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, iamemilio

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

This patch is not the right approach in a variety of ways.

If we remove the finalizer, there will be no indication that there was a problem, the machine will silently go away. There will be nothing left to tell the user to go manually delete an instance from the cloud.

404 on delete needs to be handled by the actuator. General practice is to check if the instance exists in the actuator.Delete() call before attempt to delete the instance, and handle errors appropriately there.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 17, 2020
@elmiko
Copy link

elmiko commented Aug 17, 2020

solid advice, thanks Mike!

@mandre
Copy link
Member

mandre commented Aug 18, 2020

/hold

This patch is not the right approach in a variety of ways.

Not mentioning it's modifying the vendored MAO directly...

Thanks folks for the feedback and pointers to how GCP handles the same issue.

@elmiko
Copy link

elmiko commented Aug 18, 2020

Not mentioning it's modifying the vendored MAO directly...

ouch. can't believe i missed that =(

@iamemilio
Copy link
Author

iamemilio commented Aug 18, 2020

Not mentioning it's modifying the vendored MAO directly...

   ouch. can't believe i missed that =(

I cant believe I didnt notice that either haha

@iamemilio
Copy link
Author

@michaelgugino the issue with this bug is that the machine will show up when we check that it exists, but it will return "404 resource not found" when we try to delete it.

@iamemilio
Copy link
Author

Considering your comment on the bugzilla, this is not a viable solution. Closing

@iamemilio iamemilio closed this Aug 18, 2020
@openshift-ci-robot
Copy link

@iamemilio: This pull request references Bugzilla bug 1856270. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

Bug 1856270: allow users to manually delete machines stuck in crash loop

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pierreprinetti pushed a commit to shiftstack/cluster-api-provider-openstack that referenced this pull request Apr 22, 2024
Switch yaml package from ghodss repo to sigs.k8s.io fork kubernetes-sigs#572
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants