fix(vcluster): etcd migration and recovery

Piotr1215 · Piotr1215 · commit 9f321c52474a · 2025-11-04T15:33:35.000+01:00
diff --git a/vcluster/configure/vcluster-yaml/control-plane/components/backing-store/etcd/embedded.mdx b/vcluster/configure/vcluster-yaml/control-plane/components/backing-store/etcd/embedded.mdx
@@ -157,96 +157,53 @@ kubectl logs [[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-
 
 <TabItem value="first-replica-failing" label="First replica is failing">
 
-<Flow>
-<Step title="Scale down vCluster to 0 replicas">
-Stop all vCluster instances:
-
-<InterpolatedCodeBlock 
-  code={`kubectl scale statefulset [[VAR:VCLUSTER NAME:my-vcluster]] --replicas=0 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
-  language="bash"
-/>
-
-<br />
-
-Confirm all pods have terminated:
+:::info
+vCluster automatically recovers from single replica failures, including replica-0, within 10 minutes. Manual intervention is only required if automatic recovery does not complete or if the majority of replicas have failed simultaneously.
+:::
 
-<InterpolatedCodeBlock 
-  code={`kubectl get pods -l [[VAR:VCLUSTER LABEL:app=vcluster]] -n [[VAR:NAMESPACE:vcluster-my-team]]`}
-  language="bash"
-/>
-</Step>
+:::note
+This recovery procedure works for both `podManagementPolicy: Parallel` (default) and `podManagementPolicy: OrderedReady` (legacy) configurations.
+:::
 
-<Step title="Delete the corrupted PersistentVolumeClaim">
-Delete the corrupted PVC for the first replica:
+<Flow>
+<Step title="Wait for automatic recovery">
+Monitor the vCluster pods and wait for automatic recovery to complete:
 
-<InterpolatedCodeBlock 
-  code={`kubectl delete pvc [[VAR:PVC PREFIX:data]]-[[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
+<InterpolatedCodeBlock
+  code={`kubectl get pods -l [[VAR:VCLUSTER LABEL:app=vcluster]] -n [[VAR:NAMESPACE:vcluster-my-team]] -w`}
   language="bash"
 />
 
 <br />
 
-Verify the PVC has been deleted:
+Check the logs to monitor recovery progress:
 
-<InterpolatedCodeBlock 
-  code={`kubectl get pvc -l [[VAR:VCLUSTER LABEL:app=vcluster]] -n [[VAR:NAMESPACE:vcluster-my-team]]`}
+<InterpolatedCodeBlock
+  code={`kubectl logs -f [[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
   language="bash"
 />
-</Step>
-
-<Step title="Create new PVC from working replica">
-Create a new PVC by [copying from a working replica](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#volume-cloning):
-
-<InterpolatedCodeBlock 
-  code={`apiVersion: v1
-kind: PersistentVolumeClaim
-metadata:
-  name: [[VAR:PVC PREFIX:data]]-[[VAR:VCLUSTER NAME:my-vcluster]]-0
-  namespace: [[VAR:NAMESPACE:vcluster-my-team]]
-spec:
-  accessModes:
-    - ReadWriteOnce
-  resources:
-    requests:
-      storage: [[VAR:STORAGE SIZE:5Gi]]
-  dataSource:
-    name: [[VAR:PVC PREFIX:data]]-[[VAR:VCLUSTER NAME:my-vcluster]]-1
-    kind: PersistentVolumeClaim
-  storageClassName: [[VAR:STORAGE CLASS:gp2]]`}
-  language="yaml"
-  title="pvc-restore.yaml"
-/>
 
 <br />
 
-Apply the PVC:
-
-<InterpolatedCodeBlock 
-  code={`kubectl apply -f pvc-restore.yaml`}
-  language="bash"
-/>
+Automatic recovery removes the failed member from the etcd cluster, deletes the pod, and adds it back. The pod restarts with a fresh PVC and rejoins the cluster automatically.
 </Step>
 
-<Step title="Scale up vCluster to verify recovery">
-Start with one replica to verify the restored data:
+<Step title="Manual recovery if automatic fails">
+If automatic recovery does not complete after 10 minutes, manually trigger recovery by deleting the failed pod and PVC:
 
-<InterpolatedCodeBlock 
-  code={`kubectl scale statefulset [[VAR:VCLUSTER NAME:my-vcluster]] --replicas=1 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
+<InterpolatedCodeBlock
+  code={`kubectl delete pod [[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]
+kubectl delete pvc [[VAR:PVC PREFIX:data]]-[[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
   language="bash"
 />
 
 <br />
 
-Monitor the startup:
+The pod restarts with a new empty PVC and vCluster's automatic recovery rejoins it to the cluster.
 
-<InterpolatedCodeBlock 
-  code={`kubectl logs -f [[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
-  language="bash"
-/>
-
-<br />
-
-After it's stable, scale up to the desired number of replicas.
+:::warning
+Never clone PVCs from other replicas. Cloning PVCs causes etcd member ID conflicts and results in data loss.
+:::
 </Step>
 </Flow>