[ws-manager] gracefully shuts down workspace, leaving behind bound PVCs, avoiding backup of user data, after unknown event

### Bug description

In other words, we don't experience data loss, but, the pod stops gracefully, and when the user starts the workspace again, they would not have their data...even though we have it in a PV.

I tried deleting `us72`, but could not because there were two dangling PVC:

```
gitpod /workspace/gitpod (main) $ kubectl get pvc
NAME                                      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS             AGE
ws-ccd64f44-d3b6-49eb-9d1e-9275406745ef   Bound    pvc-35a16057-d21c-44e2-9a76-a36f44fb1866   30Gi       RWO            csi-gce-pd-g1-standard   47h
ws-eb6cb985-86f3-435b-9def-d820d2b9060a   Bound    pvc-50708c34-a721-4f3d-855e-f74c94e2c034   30Gi       RWO            csi-gce-pd-g1-standard   45h
```

For the first...given workspace [logs](https://cloudlogging.app.goo.gl/fpGbkfnmwWbrsajh7) and this workspace [trace](https://jaeger-us-prod.workspace-preview.gitpod-io-dev.com/search?end=1666982940000000&limit=100&lookback=custom&maxDuration&minDuration&service=ws-manager&start=1666637340000000&tags=%7B%22instanceId%22%3A%22ccd64f44-d3b6-49eb-9d1e-9275406745ef%22%7D):

1. Startworkspace is [logged](https://console.cloud.google.com/logs/query;cursorTimestamp=2022-10-26T18:41:31.685397346Z;query=resource.labels.cluster_name%20:%20%2528%22us72%22%2529%0A%22ccd64f44-d3b6-49eb-9d1e-9275406745ef%22%0Atimestamp%3D%222022-10-26T18:41:24Z%22%0AinsertId%3D%225y7xc00fp6ibopz4%22;summaryFields=:false:32:beginning;timeRange=2022-10-26T17:15:00.000Z%2F2022-10-26T19:27:00.000Z?project=workspace-clusters) on workspace
2. It cannot be scheduled (waiting for scale-up)
3. Start workspace is [logged again](https://console.cloud.google.com/logs/query;cursorTimestamp=2022-10-26T18:50:06.290549940Z;query=resource.labels.cluster_name%20:%20%2528%22us72%22%2529%0A%22ccd64f44-d3b6-49eb-9d1e-9275406745ef%22%0Atimestamp%3D%222022-10-26T18:48:24Z%22%0AinsertId%3D%22988cddi21jxe9x84%22;summaryFields=:false:32:beginning;timeRange=2022-10-26T17:15:00.000Z%2F2022-10-26T19:27:00.000Z?project=workspace-clusters) after 7 minutes (still not scheduled to a node)
   1. Which lines up with us seeing [startWorkspace error at 7m](https://jaeger-us-prod.workspace-preview.gitpod-io-dev.com/trace/27a2e77e6b74127e) in the traces
   2. We [poll](https://github.com/gitpod-io/gitpod/blob/6e6c78e9e498dff715795f699459ae698f441977/components/ws-manager/pkg/manager/manager.go#L350) for seven minutes, to see if the pending pod should be recreated, and startWorkspace called again
   3. We force [delete](https://github.com/gitpod-io/gitpod/blob/6e6c78e9e498dff715795f699459ae698f441977/components/ws-manager/pkg/manager/manager.go#L352) the original pod and [try starting again](https://github.com/gitpod-io/gitpod/blob/6e6c78e9e498dff715795f699459ae698f441977/components/ws-manager/pkg/manager/manager.go#L365) using original context :exploding_head:
   * Introduced in https://github.com/gitpod-io/gitpod/pull/11547
4. csi provisioner is [started](https://console.cloud.google.com/logs/query;cursorTimestamp=2022-10-26T18:50:29.462154307Z;query=resource.labels.cluster_name%20:%20%2528%22us72%22%2529%0A%22ccd64f44-d3b6-49eb-9d1e-9275406745ef%22%0Atimestamp%3D%222022-10-26T18:50:19.334437764Z%22%0AinsertId%3D%22bmau2s679btm7tue%22;summaryFields=:false:32:beginning;timeRange=2022-10-26T17:15:00.000Z%2F2022-10-26T19:27:00.000Z?project=workspace-clusters) for this workspace
5. [Ring 0 stops](https://console.cloud.google.com/logs/query;cursorTimestamp=2022-10-26T18:52:51Z;query=resource.labels.cluster_name%20:%20%2528%22us72%22%2529%0A%22ccd64f44-d3b6-49eb-9d1e-9275406745ef%22%0Atimestamp%3D%222022-10-26T18:52:50Z%22%0AinsertId%3D%22k05gx47az4h861tm%22;summaryFields=:false:32:beginning;timeRange=2022-10-26T17:15:00.000Z%2F2022-10-26T19:27:00.000Z?project=workspace-clusters), we must've landed on workspace-ws-us72-standard-pvc-pool-2dvw
6. [workspace cannot connect to ws-daemon](https://console.cloud.google.com/logs/query;cursorTimestamp=2022-10-26T18:52:51.146292119Z;query=resource.labels.cluster_name%20:%20%2528%22us72%22%2529%0A%22ccd64f44-d3b6-49eb-9d1e-9275406745ef%22%0Atimestamp%3D%222022-10-26T18:52:50Z%22%0AinsertId%3D%22qigyp4fvxuvizj2b%22;summaryFields=:false:32:beginning;timeRange=2022-10-26T17:15:00.000Z%2F2022-10-26T19:27:00.000Z?project=workspace-clusters)
7. workspace fails to start, [the volume snapshot is empty](https://console.cloud.google.com/logs/query;cursorTimestamp=2022-10-26T18:52:51Z;query=resource.labels.cluster_name%20:%20%2528%22us72%22%2529%0A%22ccd64f44-d3b6-49eb-9d1e-9275406745ef%22%0Atimestamp%3D%222022-10-26T18:52:51Z%22%0AinsertId%3D%2234wcx7p1e1blp6qc%22;summaryFields=:false:32:beginning;timeRange=2022-10-26T17:15:00.000Z%2F2022-10-26T19:27:00.000Z?project=workspace-clusters)
8. workspace fails to start is [logged again](https://console.cloud.google.com/logs/query;cursorTimestamp=2022-10-26T18:52:51Z;query=resource.labels.cluster_name%20:%20%2528%22us72%22%2529%0A%22ccd64f44-d3b6-49eb-9d1e-9275406745ef%22%0Atimestamp%3D%222022-10-26T18:52:51Z%22%0AinsertId%3D%22f9is8rv7s23swcx0%22;summaryFields=:false:32:beginning;timeRange=2022-10-26T17:15:00.000Z%2F2022-10-26T19:27:00.000Z?project=workspace-clusters)

### Steps to reproduce

This could be because of:
1. node scale-up that is too slow (given the logs, perhaps). 
2. Another possibility, is that we restarted ws-manager "during a key moment", and it the snapshot did not proceed. We restarted ws-manager a couple times this week.
3. The 1h timeout, which is a byproduct from when we persisted using node storage and backed up using GCS

So, either:
1. Create an ephemeral cluster
2. Start many workspaces with loadgen, causing node scale-up, if it takes >7 minutes, you'll hit the code path that was involved for these two workspaces
4. Stop the workspaces
5. Check to see if they backed up, or, left behind PVCs.

or maybe

Stop a bunch of workspaces, and while they're stopping (before, during, and after snapshot) stop `ws-manager`.

or maybe 

timeout after an hour

### Workspace affected

gitpodio-templatetypesc-qxnleu3pzu4

### Expected behavior

There are a few things:
1. We should try to backup for longer than 1 h, this way we do not have to manually snapshot PVCs before we delete workspace clusters.
2. We should have a metric track duration for when a PVC is bound but has no pod, and a trigger an alert when one or more exists for too long.

Questions:
1. The workspace affected was gracefully Stopped (not Failed or Stopping), which indicates the user could have tried restarting their workspace and not restored their files. This would have been very confusing because they will not have their uncommitted files restored. Is this expected?

### Example repository

_No response_

### Anything else?

We currently stop trying to backup after a 1h timeout. This was a design decision for object storage based backups, and should be revisited as part of PVC.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ws-manager] gracefully shuts down workspace, leaving behind bound PVCs, avoiding backup of user data, after unknown event #14266

Bug description

Steps to reproduce

Workspace affected

Expected behavior

Example repository

Anything else?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ws-manager] gracefully shuts down workspace, leaving behind bound PVCs, avoiding backup of user data, after unknown event #14266

Description

Bug description

Steps to reproduce

Workspace affected

Expected behavior

Example repository

Anything else?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions