Closed as not planned
Description
Summary
Better protect user data
Context
Sometimes a workspace, node, or workspace cluster fail and the user data cannot be backed up to cloud storage, resulting in data loss. A related incident for a global outage. A related RFC where we are discussing solutions.
Value
By better handling user data, users will trust that even if the Gitpod service is unavailable, once it is online, they will not lose data.
Acceptance criteria
User data is persisted in such a way that even if there is a workspace, node, or cluster failure, the data is accessible to be backed up at a later time.
Tasks
Ops:
- Workspace Preview (should be done first)
- automate GCP service account creation for CSI driver to function
- automate deployment of GCP CSI driver into cluster as part of cluster creation operation
automate deployment of GCP storageClasses as part of cluster creation operation(specifydiscard
mount option) - automate deployment of snapshotter CRD and controller deployment as part of cluster creation operation, validate snapshotter is snapshots and we can create PVC from snapshot
- gitpod-io/workspace-preview#58
- Preview Environment
- Jobs for
workspace-clusters
Design:
- List impacted components and visualize flows in the RFC
- Double check the estimate for cost impacts
- Compare with other DD, to be consistent / fill in gaps
- Investigate if possible to improve PVC mount time #9054
Product changes:
- Functionality:
- add a feature flag (set at user level) into workspace creation to support PVC\Snapshot pathway #9117
- ws-manager: create PVC before creating workspace pod and specify it in pod spec mounted at /workspace #9142
- add support for restoring workspace from backup that was stored in GCS when using PVC #9442
- ws-manager: add support for snapshotVolume creation and PVC restore from it #9469
- webapp: ensure that we can store snapshotVolumeID and relevant snapshot volume data in DB schema #9984
- ws-manager: add support for deleting snapshotVolumes (to clean up old backups) on workspace delete #10259
- ws-manager: ensure it can reconcile orphaned PVCs without VolumeSnapshot is created #10531
- We did not backup content when the node goes to NotReady + Pod goes to Terminating #11336
- Switch gitpod team (eating your own dog food) to using PVC - regular workspaces only #10886
- Update GCP PD CSI driver from v1.4.0 to v1.7.1 to include faster PVC mount time fix #10210
- Find a way to display workspace ID in GCP Snapshots page #10186
- Have labels on GCP Disks and Snapshots #10612
- Prebuilds: add support for using PVC\VolumeSnapshots to prebuilds. #10260
- Use ConfigCat to throttle adoption of PVC #12745
- Switch select customer teams to using PVC that would benefit from this and are willing to help test
- Backups: allow user to download workspace backup from the volume snapshot #13930
- PVC: Deprecate the download feature #14364
- Switch everyone to using PVC
- ws-manager: volume snapshot metric is not accurate for stops #10334
- ws-manager: Add volume snapshot related events to workspace pod event log #10887
- Remove the PVC object if the workspace pod is never been ready #11635
- [PVC] restart the ws-manager, the ws-manager sometimes panic because the concurrent map read and map write #11786
- [PVC] trigger prebuild affected all the users without PVC feature flag enabled #11769
- [PVC] take prebuild, and write data, all subsequent data won't be recovered #11770
- [PVC] the files and folders under .git/ permission is incorrect #12420
- [PVC] Prebuild+PVC, user's account without PVC, open from prebuild, still use PVC #12463
- [PVC] Prebuild without PVC, user A account with PVC and trigger rebuild, user B account without PVC #12718
- [PVC] Open a fresh new workspace, it uses large node pool. Relaunch the workspace, it uses standard node pool #12666
- [PVC] Can't open from prebuilds if prebuilds with large workspace class but user prefers standard workspace class #12494
- [PVC] Open workspace from prebuild PVC, no message This task ran as a workspace prebuild on CLI #12464
- [PVC] Propagate workspace pod labels to PVC and VolumeSnapshot #12507
- [PVC] ws-manager event workers hang forever once over 100 VolumeSnapshots and ws-manager restart #13007
- [PVC] The workspace with PVC keeps in terminating state #13280
- [PVC] orphan PVC left if the ws-manager unable to start workspace pod #13282
- [PVC] loadgen testing Pod can't mount Volume #13353
- [PVC] unable to open the workspace with download initializer #13531
- [PVC] massive workspace stopping, two workspaces report cannot find workspace from ws-daemon #13856
- [PVC] the prebuild workspace is unable to be up and running #13980
- [pvc]
/workspace
directory is owned bynobody
#14003 - Can not create a snapshot when using PVC for workspaces #14159
- Observability:
- Installer/KOTS
- installer: allow to specify storageClass in gitpod.yaml #10613 (Moves to Epic: PVC (Persistent Volume Claims) on Self-Hosted #11476)
-
installer: add a test to ensure that CSI is working as expected prior to installing Gitpod #10614(Moves to Epic: PVC (Persistent Volume Claims) on Self-Hosted #11476)
Tests:
- Add tests for createPVCForWorkspacePod #10162
- Fix broken workspace integration test TestMissingBackup(PVC) #9990
- [integration tests] add regular workspace PVC integration test from PVC test plans #12497
- [integration tests] add prebuilds workspace PVC integration test from PVC test plans #12638
- Fixes to support PVC on loadgen and integration test #12560
- Performance test PVC with a single node saturated with workspaces #12744
- Performance test PVC with a cluster saturated with workspaces #12747
- [PVC][integration test] restart control plane components when/after volume snapshot in-progess/done #13146
-
Manually test PVC CSI snapshot backup/restore in AWS #10211(Moves to Epic: PVC (Persistent Volume Claims) on Self-Hosted #11476) -
Manually test PVC CSI snapshot backup/restore in Azure #10212(Moves to Epic: PVC (Persistent Volume Claims) on Self-Hosted #11476) - [PVC] integration test for testing .gitpod.yml using incorrect repo #13591
- https://github.com/gitpod-io/ops/issues/6270
Bug
Should solve:
- Prebuild Loading Screen doesn't auto progress when done #7311
- Opening old workspaces stuck on pulling container image #8198
Day 2:
- [PVC] investigate if possible to remove chown #12892
- [ws-manager] cannot restart stopped workspace -
no backup found
is hidden #14451 - [PVC] massive workspace stopping, two workspaces report cannot find workspace from ws-daemon #13856
- [PVC][chore] remove contentDescriptorToLayer in favor of contentDescriptorToLayerPVC #9496