Skip to content

Allow restarting workspaces with forceDefaultImage=true even when another instance is already running #3993

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

jankeromnes
Copy link
Contributor

Fixes #3777

How to test:

@jankeromnes
Copy link
Contributor Author

jankeromnes commented Apr 19, 2021

@csweichel Could you please take a careful look? (I really don't want to introduce new bugs in the workspace start logic.)

Note: This change seems to work well / as expected. However, I get no Docker build logs in core-dev -- not sure why.

@jankeromnes jankeromnes requested a review from csweichel April 19, 2021 08:43
@jankeromnes jankeromnes marked this pull request as ready for review April 19, 2021 08:53
@jankeromnes
Copy link
Contributor Author

jankeromnes commented Apr 21, 2021

/werft run

👍 started the job as gitpod-build-jx-restart-w-default-image.1

@jankeromnes
Copy link
Contributor Author

jankeromnes commented Apr 21, 2021

/werft run

👍 started the job as gitpod-build-jx-restart-w-default-image.2

Copy link
Contributor Author

@jankeromnes jankeromnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding review comments from a call with @csweichel here for the record (many thanks Chris!)

Comment on lines 438 to 445
// We already have a running workspace. This may happen if we're forcing the default image.
// In that case, we stop the previous first.
await this.internalStopWorkspace({ span }, workspaceId, workspace.ownerId).catch(err => {
log.error(logCtx, "stopWorkspace error: ", err);
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call only blocks until the stop request is sent.

In order to completely prevent multiple instances from running at the same time, we should probably also listen for instance updates here (e.g. via messageBusIntegration.listenForWorkspaceInstanceUpdates()) and only move on when the previous instance is definitely stopped.

There is also a case where the previous instance was in a different region, in which case to be fully pedantic we should also poll the DB, but it's probably okay to not do that here.

@jankeromnes jankeromnes removed the request for review from csweichel April 21, 2021 14:11
@jankeromnes jankeromnes force-pushed the jx/restart-w-default-image branch from f477ac3 to 27056fe Compare April 23, 2021 08:15
@jankeromnes
Copy link
Contributor Author

jankeromnes commented Apr 23, 2021

Code review feedback from @csweichel implemented ✅ and still works as expected. 🎉

Two questions:

  1. Shouldn't this access guard fail:

// no matter if the workspace is shared or not, you cannot create a new instance
await this.guardAccess({ kind: "workspaceInstance", subject: undefined, workspaceOwnerID: workspace.ownerId, workspaceIsShared: false }, "create");

given that the previously running instance is only stopped afterwards?

if (runningInstance) {
// We already had a running workspace. This may happen if we're forcing the default image.
// In that case, we first stop the previous workspace, and wait for it to be completely gone.
await this.internalStopWorkspaceAndWaitForInstance({ span }, workspaceId, workspace.ownerId).catch(err => {
log.error(logCtx, "internalStopWorkspaceAndWaitForInstance error: ", err);
});
}

  1. Also, an observed error logged by server:
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","serviceContext":{"service":"server","version":"jx-restart-w-default-image.3"},"stack_trace":"Error: Unknown workspace manager \"\"\n    at WorkspaceManagerClientProvider.<anonymous> (/app/node_modules/@gitpod/ws-manager/lib/client-provider.js:123:51)\n    at step (/app/node_modules/@gitpod/ws-manager/lib/client-provider.js:58:23)\n    at Object.next (/app/node_modules/@gitpod/ws-manager/lib/client-provider.js:39:53)\n    at fulfilled (/app/node_modules/@gitpod/ws-manager/lib/client-provider.js:30:58)\n    at runMicrotasks (<anonymous>)\n    at processTicksAndRejections (internal/process/task_queues.js:97:5)","component":"server","severity":"ERROR","time":"2021-04-23T08:46:17.766Z","environment":"devstaging","region":"europe-west1","context":{"userId":"7792a3aa-7416-4dee-a899-e132467125d4","workspaceId":"black-goldfish-rtllylsu"},"message":"internalStopWorkspaceAndWaitForInstance error: ","error":"Error: Unknown workspace manager \"\"\n    at WorkspaceManagerClientProvider.<anonymous> (/app/node_modules/@gitpod/ws-manager/lib/client-provider.js:123:51)\n    at step (/app/node_modules/@gitpod/ws-manager/lib/client-provider.js:58:23)\n    at Object.next (/app/node_modules/@gitpod/ws-manager/lib/client-provider.js:39:53)\n    at fulfilled (/app/node_modules/@gitpod/ws-manager/lib/client-provider.js:30:58)\n    at runMicrotasks (<anonymous>)\n    at processTicksAndRejections (internal/process/task_queues.js:97:5)"}

Prettified error stack:

Error: Unknown workspace manager ""
    at WorkspaceManagerClientProvider.<anonymous> (/app/node_modules/@gitpod/ws-manager/lib/client-provider.js:123:51)
    at step (/app/node_modules/@gitpod/ws-manager/lib/client-provider.js:58:23)
    at Object.next (/app/node_modules/@gitpod/ws-manager/lib/client-provider.js:39:53)
    at fulfilled (/app/node_modules/@gitpod/ws-manager/lib/client-provider.js:30:58)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)

Unsure if related or serious. (Could it be that's because we have no GCP region information in core-dev?)

@jankeromnes jankeromnes requested review from aledbf and csweichel April 23, 2021 10:22
@csweichel
Copy link
Contributor

csweichel commented Apr 26, 2021

Code review feedback from @csweichel implemented ✅ and still works as expected. 🎉

Two questions:

  1. Shouldn't this access guard fail:

// no matter if the workspace is shared or not, you cannot create a new instance
await this.guardAccess({ kind: "workspaceInstance", subject: undefined, workspaceOwnerID: workspace.ownerId, workspaceIsShared: false }, "create");

No. It only checks if the user has the permission to execute that operation, but not if the operation makes sense.

given that the previously running instance is only stopped afterwards?

if (runningInstance) {
// We already had a running workspace. This may happen if we're forcing the default image.
// In that case, we first stop the previous workspace, and wait for it to be completely gone.
await this.internalStopWorkspaceAndWaitForInstance({ span }, workspaceId, workspace.ownerId).catch(err => {
log.error(logCtx, "internalStopWorkspaceAndWaitForInstance error: ", err);
});
}

  1. Also, an observed error logged by server:

Prettified error stack:

Error: Unknown workspace manager ""
    at WorkspaceManagerClientProvider.<anonymous> (/app/node_modules/@gitpod/ws-manager/lib/client-provider.js:123:51)
    at step (/app/node_modules/@gitpod/ws-manager/lib/client-provider.js:58:23)
    at Object.next (/app/node_modules/@gitpod/ws-manager/lib/client-provider.js:39:53)
    at fulfilled (/app/node_modules/@gitpod/ws-manager/lib/client-provider.js:30:58)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)

Unsure if related or serious. (Could it be that's because we have no GCP region information in core-dev?)

The problem here is that the instance wasn't actually started yet, i.e. it didn't go through actuallyStartWorkspace yet. In internalStopWorkspaceAndWaitForInstance you could check if the region is set, and if not instead of calling ws-manager just wait for some time - you would not know which workspace manager to talk to after all. There's a chance for a race condition here (some other server instance could be starting the instance currently), so in lieu of proper cross-server-instance-locking waiting for say 10 seconds (please add a comment) should prevent that race.

@jankeromnes jankeromnes force-pushed the jx/restart-w-default-image branch from 27056fe to de1eff9 Compare April 26, 2021 19:22
@jankeromnes
Copy link
Contributor Author

jankeromnes commented Apr 26, 2021

No. It only checks if the user has the permission to execute that operation, but not if the operation makes sense.

Aha, that makes sense. Thanks!

The problem here is that the instance wasn't actually started yet, i.e. it didn't go through actuallyStartWorkspace yet.

Thanks for the pointer! Looking into actuallyStartWorkspace, I now think it's expected that instances don't get assigned a region until after their Docker build is complete, as:

// build workspace image
instance = await this.buildWorkspaceImage({ span }, user, workspace, instance);

happens before:

// tell the world we're starting this instance
const { manager, installation } = await this.clientProvider.getStartManager();
instance.status.phase = "pending";
instance.region = installation;

So in our case (long-running Docker build that we want to interrupt/skip), we should expect not to have a region.

How should we stop "preparing" instances that don't have a region yet? Should we make imageBuilder interruptible? Or "detach" the running instance by deleting its workspace ID? 🤔

@jankeromnes
Copy link
Contributor Author

After discussing this further, we've decided to not make Docker builds interruptible just yet, and instead remove the button when builds are in progress (to align with the previous dashboard design).

Superseded by #4104

@jankeromnes jankeromnes deleted the jx/restart-w-default-image branch May 19, 2021 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Continue with Default Image" button doesn't work when build is still in progress
2 participants