-
Notifications
You must be signed in to change notification settings - Fork 33
Bug 1856270: allow users to manually delete machines stuck in crash loop #111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@iamemilio: This pull request references Bugzilla bug 1856270, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
elmiko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i understand what is happening in this change, but i am curious what happens if a machine has actually been deleted, will we still try to delete the finalizer through this logic?
and if yes, is there an consequence of trying to update something that has been deleted?
|
I dont think we can hit this code block if the machine has been deleted because before we get to this point, we check if the machine exists multiple times (im not sure why its that redundant) and exit the function if it does not. This should only trigger if we check that the machine exists, and it does, then try to delete it and get a "resource not found" error. I think there must be a bug in openstack that causes this case to occur. |
|
That being said, I have been unable to create a runnable release image for almost a week, and so I cannot really verify any of this. |
|
ack, thanks for the explanation @iamemilio, i can see that this code won't be reached unless the machine is not deleted. it makes sense now. /lgtm |
|
@elmiko: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: elmiko, iamemilio The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
michaelgugino
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
This patch is not the right approach in a variety of ways.
If we remove the finalizer, there will be no indication that there was a problem, the machine will silently go away. There will be nothing left to tell the user to go manually delete an instance from the cloud.
404 on delete needs to be handled by the actuator. General practice is to check if the instance exists in the actuator.Delete() call before attempt to delete the instance, and handle errors appropriately there.
|
For additional reference, take a look at how we handled this in the GCP provider: |
|
solid advice, thanks Mike! |
Not mentioning it's modifying the vendored MAO directly... Thanks folks for the feedback and pointers to how GCP handles the same issue. |
ouch. can't believe i missed that =( |
I cant believe I didnt notice that either haha |
|
@michaelgugino the issue with this bug is that the machine will show up when we check that it exists, but it will return "404 resource not found" when we try to delete it. |
|
Considering your comment on the bugzilla, this is not a viable solution. Closing |
|
@iamemilio: This pull request references Bugzilla bug 1856270. The bug has been updated to no longer refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Switch yaml package from ghodss repo to sigs.k8s.io fork kubernetes-sigs#572
machines that fail to provision and cannot be deleted cause CAPO to get stuck in a very specific crash loop. We want to allow users to manually delete the machine when this happens.