Skip to content

Conversation

sagor999
Copy link
Contributor

@sagor999 sagor999 commented Jul 19, 2022

Description

This PR fixes two issues:

  1. As described in [ws-manager] workspaces failed to start #10315 (comment) it resolves an issue of sending sometimes incorrectly to server empty failed state, even though prior to that we might have sent failure error message. This breaks opening that workspace again, as server will incorrectly assume that there was a backup created (no failure error) and will attempt to restore second workspace from backup, which will fail, thus completely breaking opening that workspace for the customer.
  2. We incorrectly mark completed container as an error. When container is stopped, it is will be in a completed state. That is not an error.
    This two fixes need to go hand in hand.
    For more info, see the link above for the comment describing the issue in greater detail.

Related Issue(s)

Fixes #10315

How to test

Open workspace in preview env
Close it
It should close normally.

Release Notes

none

Documentation

Werft options:

  • /werft with-preview

@werft-gitpod-dev-com
Copy link

started the job as gitpod-build-pavel-10315-1.1 because the annotations in the pull request description changed
(with .werft/ from main)

@sagor999
Copy link
Contributor Author

sagor999 commented Jul 19, 2022

/werft run with-clean-slate-deployment=true

👍 started the job as gitpod-build-pavel-10315-1.2
(with .werft/ from main)

@sagor999 sagor999 changed the title wip [ws-manager] fix incorrect handling of failure state for workspaces Jul 20, 2022
@sagor999 sagor999 marked this pull request as ready for review July 20, 2022 20:32
@sagor999 sagor999 requested a review from a team July 20, 2022 20:32
@github-actions github-actions bot added the team: workspace Issue belongs to the Workspace team label Jul 20, 2022
@sagor999
Copy link
Contributor Author

/hold
I would like to get input from @csweichel and @aledbf for this change, to verify that will not break some other assumptions somewhere else inside the gitpod.

return "", nil
}
return fmt.Sprintf("container %s completed; containers of a workspace pod are not supposed to do that. Reason: %s", cs.Name, terminationState.Message), nil
// container terminated successfully - this is not a failure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original intent with this branch was to cover the case that a regular workspace stops without being deleted. An unintentional stop, so to speak. That said, the failure mode is actually nicer if your workspace just stops compared to get getting an inactionable error message ("containers of a workspace pod are not supposed to do that" 🤷).

The failed condition on workspace status indeed isn't stable (but should be). ws-manager-bridge should protect us against that by not resetting that condition if it ever was set. Because we have nothing like an at-least-once delivery semantic, we might just miss the "failed" message though.

Copy link
Contributor

@csweichel csweichel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will do more good than harm.
We could consider adding a metric to count unintentional pod termination - either in this PR (hence the hold), or in a separate one, or not at all :)

/hold

@sagor999
Copy link
Contributor Author

Added metric to track unintentional stops

@sagor999
Copy link
Contributor Author

/unhold

@roboquat roboquat merged commit 3f92a73 into main Jul 27, 2022
@roboquat roboquat deleted the pavel/10315-1 branch July 27, 2022 19:14
@roboquat roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: workspace Workspace team change is running in production deployed Change is completely running in production release-note-none size/M team: workspace Issue belongs to the Workspace team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ws-manager] workspaces failed to start
4 participants