Skip to content

Commit 9f321c5

Browse files
committed
fix(vcluster): etcd migration and recovery
1 parent 26118fe commit 9f321c5

File tree

1 file changed

+24
-67
lines changed
  • vcluster/configure/vcluster-yaml/control-plane/components/backing-store/etcd

1 file changed

+24
-67
lines changed

vcluster/configure/vcluster-yaml/control-plane/components/backing-store/etcd/embedded.mdx

Lines changed: 24 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -157,96 +157,53 @@ kubectl logs [[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-
157157

158158
<TabItem value="first-replica-failing" label="First replica is failing">
159159

160-
<Flow>
161-
<Step title="Scale down vCluster to 0 replicas">
162-
Stop all vCluster instances:
163-
164-
<InterpolatedCodeBlock
165-
code={`kubectl scale statefulset [[VAR:VCLUSTER NAME:my-vcluster]] --replicas=0 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
166-
language="bash"
167-
/>
168-
169-
<br />
170-
171-
Confirm all pods have terminated:
160+
:::info
161+
vCluster automatically recovers from single replica failures, including replica-0, within 10 minutes. Manual intervention is only required if automatic recovery does not complete or if the majority of replicas have failed simultaneously.
162+
:::
172163

173-
<InterpolatedCodeBlock
174-
code={`kubectl get pods -l [[VAR:VCLUSTER LABEL:app=vcluster]] -n [[VAR:NAMESPACE:vcluster-my-team]]`}
175-
language="bash"
176-
/>
177-
</Step>
164+
:::note
165+
This recovery procedure works for both `podManagementPolicy: Parallel` (default) and `podManagementPolicy: OrderedReady` (legacy) configurations.
166+
:::
178167

179-
<Step title="Delete the corrupted PersistentVolumeClaim">
180-
Delete the corrupted PVC for the first replica:
168+
<Flow>
169+
<Step title="Wait for automatic recovery">
170+
Monitor the vCluster pods and wait for automatic recovery to complete:
181171

182-
<InterpolatedCodeBlock
183-
code={`kubectl delete pvc [[VAR:PVC PREFIX:data]]-[[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
172+
<InterpolatedCodeBlock
173+
code={`kubectl get pods -l [[VAR:VCLUSTER LABEL:app=vcluster]] -n [[VAR:NAMESPACE:vcluster-my-team]] -w`}
184174
language="bash"
185175
/>
186176

187177
<br />
188178

189-
Verify the PVC has been deleted:
179+
Check the logs to monitor recovery progress:
190180

191-
<InterpolatedCodeBlock
192-
code={`kubectl get pvc -l [[VAR:VCLUSTER LABEL:app=vcluster]] -n [[VAR:NAMESPACE:vcluster-my-team]]`}
181+
<InterpolatedCodeBlock
182+
code={`kubectl logs -f [[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
193183
language="bash"
194184
/>
195-
</Step>
196-
197-
<Step title="Create new PVC from working replica">
198-
Create a new PVC by [copying from a working replica](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#volume-cloning):
199-
200-
<InterpolatedCodeBlock
201-
code={`apiVersion: v1
202-
kind: PersistentVolumeClaim
203-
metadata:
204-
name: [[VAR:PVC PREFIX:data]]-[[VAR:VCLUSTER NAME:my-vcluster]]-0
205-
namespace: [[VAR:NAMESPACE:vcluster-my-team]]
206-
spec:
207-
accessModes:
208-
- ReadWriteOnce
209-
resources:
210-
requests:
211-
storage: [[VAR:STORAGE SIZE:5Gi]]
212-
dataSource:
213-
name: [[VAR:PVC PREFIX:data]]-[[VAR:VCLUSTER NAME:my-vcluster]]-1
214-
kind: PersistentVolumeClaim
215-
storageClassName: [[VAR:STORAGE CLASS:gp2]]`}
216-
language="yaml"
217-
title="pvc-restore.yaml"
218-
/>
219185

220186
<br />
221187

222-
Apply the PVC:
223-
224-
<InterpolatedCodeBlock
225-
code={`kubectl apply -f pvc-restore.yaml`}
226-
language="bash"
227-
/>
188+
Automatic recovery removes the failed member from the etcd cluster, deletes the pod, and adds it back. The pod restarts with a fresh PVC and rejoins the cluster automatically.
228189
</Step>
229190

230-
<Step title="Scale up vCluster to verify recovery">
231-
Start with one replica to verify the restored data:
191+
<Step title="Manual recovery if automatic fails">
192+
If automatic recovery does not complete after 10 minutes, manually trigger recovery by deleting the failed pod and PVC:
232193

233-
<InterpolatedCodeBlock
234-
code={`kubectl scale statefulset [[VAR:VCLUSTER NAME:my-vcluster]] --replicas=1 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
194+
<InterpolatedCodeBlock
195+
code={`kubectl delete pod [[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]
196+
kubectl delete pvc [[VAR:PVC PREFIX:data]]-[[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
235197
language="bash"
236198
/>
237199

238200
<br />
239201

240-
Monitor the startup:
202+
The pod restarts with a new empty PVC and vCluster's automatic recovery rejoins it to the cluster.
241203

242-
<InterpolatedCodeBlock
243-
code={`kubectl logs -f [[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
244-
language="bash"
245-
/>
246-
247-
<br />
248-
249-
After it's stable, scale up to the desired number of replicas.
204+
:::warning
205+
Never clone PVCs from other replicas. Cloning PVCs causes etcd member ID conflicts and results in data loss.
206+
:::
250207
</Step>
251208
</Flow>
252209

0 commit comments

Comments
 (0)