Skip to content

Commit 4b9c82e

Browse files
committed
refactor: address pr feedback
1 parent 6b15d54 commit 4b9c82e

File tree

1 file changed

+109
-40
lines changed
  • vcluster/configure/vcluster-yaml/control-plane/components/backing-store/etcd

1 file changed

+109
-40
lines changed

vcluster/configure/vcluster-yaml/control-plane/components/backing-store/etcd/embedded.mdx

Lines changed: 109 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ description: Configure an embedded etcd instance as the virtual cluster's backin
99
import ConfigReference from '../../../../../../_partials/config/controlPlane/backingStore/etcd/embedded.mdx'
1010
import ProAdmonition from '../../../../../../_partials/admonitions/pro-admonition.mdx'
1111
import InterpolatedCodeBlock from "@site/src/components/InterpolatedCodeBlock";
12+
import PageVariables from "@site/src/components/PageVariables";
1213
import Flow, { Step } from '@site/src/components/Flow';
1314
import Tabs from '@theme/Tabs';
1415
import TabItem from '@theme/TabItem';
@@ -95,6 +96,26 @@ Normal pod restarts or terminations do not require manual recovery. These events
9596

9697
Recovery procedures depend on whether the first replica (the pod ending with `-0`) is among the failing replicas.
9798

99+
:::note
100+
The recovery procedure for the first replica also depends on your StatefulSet's `podManagementPolicy` configuration (`Parallel` or `OrderedReady`). See the [first replica recovery section](#migrate-to-parallel) for details on migrating between policies if needed.
101+
:::
102+
103+
:::info Find your vCluster namespace
104+
If using VirtualClusterInstance (platform), the vCluster StatefulSet runs in a different namespace than the VirtualClusterInstance itself. Find the StatefulSet namespace with:
105+
```bash
106+
kubectl get virtualclusterinstance <instance-name> -n <vci-namespace> -o jsonpath='{.spec.clusterRef.namespace}'
107+
```
108+
For example, if your VirtualClusterInstance is named `my-vcluster` in the `p-default` namespace, the StatefulSet might be in `vcluster-my-vcluster-p-default`.
109+
110+
If using Helm, the namespace is what you specified during installation (e.g., `vcluster-my-team`).
111+
:::
112+
113+
<PageVariables
114+
VCLUSTER_NAME="my-vcluster"
115+
NAMESPACE="vcluster-my-team"
116+
VCLUSTER_LABEL="app=vcluster"
117+
/>
118+
98119
Use the following procedures when some replicas are still functioning:
99120
<br />
100121

@@ -106,7 +127,7 @@ Use the following procedures when some replicas are still functioning:
106127
Scale the StatefulSet to one replica:
107128

108129
<InterpolatedCodeBlock
109-
code={`kubectl scale statefulset [[VAR:VCLUSTER NAME:my-vcluster]] --replicas=1 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
130+
code={`kubectl scale statefulset [[GLOBAL:VCLUSTER_NAME]] --replicas=1 -n [[GLOBAL:NAMESPACE]]`}
110131
language="bash"
111132
/>
112133

@@ -115,7 +136,7 @@ Scale the StatefulSet to one replica:
115136
Verify only one pod is running:
116137

117138
<InterpolatedCodeBlock
118-
code={`kubectl get pods -l [[VAR:VCLUSTER LABEL:app=vcluster]] -n [[VAR:NAMESPACE:vcluster-my-team]]`}
139+
code={`kubectl get pods -l [[GLOBAL:VCLUSTER_LABEL]] -n [[GLOBAL:NAMESPACE]]`}
119140
language="bash"
120141
/>
121142
</Step>
@@ -124,7 +145,7 @@ Verify only one pod is running:
124145
Monitor the rebuild process:
125146

126147
<InterpolatedCodeBlock
127-
code={`kubectl logs -f [[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
148+
code={`kubectl logs -f [[GLOBAL:VCLUSTER_NAME]]-0 -n [[GLOBAL:NAMESPACE]]`}
128149
language="bash"
129150
/>
130151

@@ -137,7 +158,7 @@ Watch for log messages indicating etcd is ready and the cluster is in good condi
137158
Scale back up to your target replica count:
138159

139160
<InterpolatedCodeBlock
140-
code={`kubectl scale statefulset [[VAR:VCLUSTER NAME:my-vcluster]] --replicas=[[VAR:DESIRED REPLICA COUNT:3]] -n [[VAR:NAMESPACE:vcluster-my-team]]`}
161+
code={`kubectl scale statefulset [[GLOBAL:VCLUSTER_NAME]] --replicas=[[VAR:REPLICA COUNT:3]] -n [[GLOBAL:NAMESPACE]]`}
141162
language="bash"
142163
/>
143164

@@ -146,8 +167,8 @@ Scale back up to your target replica count:
146167
Verify all replicas are running:
147168

148169
<InterpolatedCodeBlock
149-
code={`kubectl get pods -l [[VAR:VCLUSTER LABEL:app=vcluster]] -n [[VAR:NAMESPACE:vcluster-my-team]]
150-
kubectl logs [[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]] | grep "cluster is ready"`}
170+
code={`kubectl get pods -l [[GLOBAL:VCLUSTER_LABEL]] -n [[GLOBAL:NAMESPACE]]
171+
kubectl logs [[GLOBAL:VCLUSTER_NAME]]-0 -n [[GLOBAL:NAMESPACE]] | grep "cluster is ready"`}
151172
language="bash"
152173
/>
153174
</Step>
@@ -158,15 +179,28 @@ kubectl logs [[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-
158179
<TabItem value="first-replica-failing" label="First replica is failing">
159180

160181
:::warning
161-
Before attempting any recovery procedure, create a backup of the virtual cluster namespace on the host cluster. If using namespace syncing, back up all synced namespaces as well.
182+
Before attempting any recovery procedure, [create a backup](../../../../../../manage/backup-restore/backup.mdx) of your virtual cluster using `vcluster snapshot create --include-volumes`. This ensures both the virtual cluster's etcd data and persistent volumes are backed up.
183+
184+
If the virtual cluster's etcd is in a bad state and the snapshot command fails, you can still back up from the host cluster (which has its own functioning etcd). Use your preferred backup solution (e.g., Velero, Kasten, or cloud-native backup tools) to back up the host cluster namespace containing the vCluster resources. Ensure the backup includes:
185+
- All Kubernetes resources in the vCluster namespace (StatefulSet, Services, etc.)
186+
- PersistentVolumeClaims and their associated volume data (contains the virtual cluster's etcd data)
187+
- Secrets and ConfigMaps
188+
189+
When restored, the vCluster pods will restart and the virtual cluster will be recreated from the backed-up etcd data.
190+
191+
If using namespace syncing, back up all synced namespaces on the host cluster as well.
162192
:::
163193

164194
The recovery procedure depends on your StatefulSet `podManagementPolicy` configuration. vCluster version 0.20 and later use `Parallel` by default. Earlier versions used `OrderedReady`.
165195

196+
:::info
197+
If more than one pod is down with `podManagementPolicy: OrderedReady`, you must first [migrate to `Parallel`](#migrate-to-parallel) before attempting recovery.
198+
:::
199+
166200
Check your configuration:
167201

168202
<InterpolatedCodeBlock
169-
code={`kubectl get statefulset [[VAR:VCLUSTER NAME:my-vcluster]] -n [[VAR:NAMESPACE:vcluster-my-team]] -o jsonpath='{.spec.podManagementPolicy}'`}
203+
code={`kubectl get statefulset [[GLOBAL:VCLUSTER_NAME]] -n [[GLOBAL:NAMESPACE]] -o jsonpath='{.spec.podManagementPolicy}'`}
170204
language="bash"
171205
/>
172206

@@ -175,24 +209,33 @@ Check your configuration:
175209

176210
<Flow>
177211
<Step title="Delete the failed pod and PVC">
178-
Delete the corrupted pod and PVC for replica-0:
212+
First, identify the PVC for replica-0:
213+
214+
<InterpolatedCodeBlock
215+
code={`kubectl get pvc -l [[GLOBAL:VCLUSTER_LABEL]] -n [[GLOBAL:NAMESPACE]]`}
216+
language="bash"
217+
/>
218+
219+
<br />
220+
221+
The PVC name typically follows the pattern `data-<vcluster-name>-0` but may vary if customized in your configuration. Note the exact name from the output above, then delete the corrupted pod and its PVC:
179222

180223
<InterpolatedCodeBlock
181-
code={`kubectl delete pod [[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]
182-
kubectl delete pvc [[VAR:PVC PREFIX:data]]-[[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
224+
code={`kubectl delete pod [[GLOBAL:VCLUSTER_NAME]]-0 -n [[GLOBAL:NAMESPACE]]
225+
kubectl delete pvc [[VAR:PVC PREFIX:data]]-[[GLOBAL:VCLUSTER_NAME]]-0 -n [[GLOBAL:NAMESPACE]]`}
183226
language="bash"
184227
/>
185228

186229
<br />
187230

188-
The pod restarts with a new empty PVC. After 1-3 pod restarts, the automatic recovery adds it back to the etcd cluster.
231+
The pod restarts with a new empty PVC. The initial attempts fail because the new member tries to join the existing etcd cluster but lacks the required data. After 1-3 pod restarts, vCluster's automatic recovery detects the empty member and properly adds it as a new learner, allowing it to sync data from healthy members and join the cluster.
189232
</Step>
190233

191234
<Step title="Monitor recovery">
192235
Monitor the recovery process:
193236

194237
<InterpolatedCodeBlock
195-
code={`kubectl get pods -l [[VAR:VCLUSTER LABEL:app=vcluster]] -n [[VAR:NAMESPACE:vcluster-my-team]] -w`}
238+
code={`kubectl get pods -l [[GLOBAL:VCLUSTER_LABEL]] -n [[GLOBAL:NAMESPACE]] -w`}
196239
language="bash"
197240
/>
198241

@@ -201,7 +244,7 @@ Monitor the recovery process:
201244
Check the logs to verify the pod rejoins successfully:
202245

203246
<InterpolatedCodeBlock
204-
code={`kubectl logs -f [[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
247+
code={`kubectl logs -f [[GLOBAL:VCLUSTER_NAME]]-0 -n [[GLOBAL:NAMESPACE]]`}
205248
language="bash"
206249
/>
207250
</Step>
@@ -220,7 +263,7 @@ If more than one pod is down with `podManagementPolicy: OrderedReady`, migrate t
220263
Check that the StatefulSet retains PVCs on deletion:
221264

222265
<InterpolatedCodeBlock
223-
code={`kubectl get statefulset [[VAR:VCLUSTER NAME:my-vcluster]] -n [[VAR:NAMESPACE:vcluster-my-team]] -o jsonpath='{.spec.persistentVolumeClaimRetentionPolicy}'`}
266+
code={`kubectl get statefulset [[GLOBAL:VCLUSTER_NAME]] -n [[GLOBAL:NAMESPACE]] -o jsonpath='{.spec.persistentVolumeClaimRetentionPolicy}'`}
224267
language="bash"
225268
/>
226269

@@ -233,41 +276,53 @@ The policy should be `Retain`. This is the default but can be overridden by `con
233276
Delete the StatefulSet without deleting the pods:
234277

235278
<InterpolatedCodeBlock
236-
code={`kubectl delete statefulset [[VAR:VCLUSTER NAME:my-vcluster]] -n [[VAR:NAMESPACE:vcluster-my-team]] --cascade=orphan`}
279+
code={`kubectl delete statefulset [[GLOBAL:VCLUSTER_NAME]] -n [[GLOBAL:NAMESPACE]] --cascade=orphan`}
237280
language="bash"
238281
/>
239282
</Step>
240283

241284
<Step title="Update configuration to Parallel">
285+
<a id="migrate-to-parallel"></a>
286+
242287
Update your virtual cluster configuration to use `Parallel` pod management policy.
243288

244-
If using a VirtualClusterInstance:
289+
If using a VirtualClusterInstance, edit the instance and update the `podManagementPolicy`:
245290

246291
<InterpolatedCodeBlock
247-
code={`kubectl edit virtualclusterinstance [[VAR:VCLUSTER NAME:my-vcluster]] -n [[VAR:NAMESPACE:vcluster-my-team]]`}
292+
code={`kubectl edit virtualclusterinstance [[GLOBAL:VCLUSTER_NAME]] -n [[GLOBAL:NAMESPACE]]`}
248293
language="bash"
249294
/>
250295

296+
Then add or update this section in the spec:
297+
298+
```yaml
299+
spec:
300+
template:
301+
helmRelease:
302+
values: |
303+
controlPlane:
304+
statefulSet:
305+
scheduling:
306+
podManagementPolicy: Parallel
307+
```
308+
251309
<br />
252310

253-
Add or update the following configuration:
311+
If using Helm, update your `values.yaml` to set the pod management policy:
254312

255-
<InterpolatedCodeBlock
256-
code={`controlPlane:
313+
```yaml title="values.yaml"
314+
controlPlane:
257315
statefulSet:
258316
scheduling:
259-
podManagementPolicy: Parallel`}
260-
language="yaml"
261-
/>
262-
263-
<br />
317+
podManagementPolicy: Parallel
318+
```
264319

265-
If using Helm, update your `values.yaml` and run:
320+
Then apply the update:
266321

267322
<InterpolatedCodeBlock
268-
code={`helm upgrade [[VAR:VCLUSTER NAME:my-vcluster]] vcluster \
323+
code={`helm upgrade [[GLOBAL:VCLUSTER_NAME]] vcluster \
269324
--repo https://charts.loft.sh \
270-
--namespace [[VAR:NAMESPACE:vcluster-my-team]] \
325+
--namespace [[GLOBAL:NAMESPACE]] \
271326
--reuse-values \
272327
-f values.yaml`}
273328
language="bash"
@@ -279,17 +334,28 @@ The StatefulSet is recreated with `Parallel` policy and pods pick up the existin
279334
</Step>
280335

281336
<Step title="Delete the failed pod and PVC">
282-
Now follow the same procedure as for `Parallel` mode:
337+
Now follow the same procedure as for `Parallel` mode.
338+
339+
First, identify the PVC for replica-0:
340+
341+
<InterpolatedCodeBlock
342+
code={`kubectl get pvc -l [[GLOBAL:VCLUSTER_LABEL]] -n [[GLOBAL:NAMESPACE]]`}
343+
language="bash"
344+
/>
345+
346+
<br />
347+
348+
The PVC name typically follows the pattern `data-<vcluster-name>-0` but may vary if customized in your configuration. Note the exact name from the output above, then delete the corrupted pod and its PVC:
283349

284350
<InterpolatedCodeBlock
285-
code={`kubectl delete pod [[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]
286-
kubectl delete pvc [[VAR:PVC PREFIX:data]]-[[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
351+
code={`kubectl delete pod [[GLOBAL:VCLUSTER_NAME]]-0 -n [[GLOBAL:NAMESPACE]]
352+
kubectl delete pvc [[VAR:PVC PREFIX:data]]-[[GLOBAL:VCLUSTER_NAME]]-0 -n [[GLOBAL:NAMESPACE]]`}
287353
language="bash"
288354
/>
289355

290356
<br />
291357

292-
The pod restarts with a new empty PVC and automatic recovery adds it back to the cluster after 1-3 pod restarts.
358+
The pod restarts with a new empty PVC. The initial attempts fail because the new member tries to join the existing etcd cluster but lacks the required data. After 1-3 pod restarts, vCluster's automatic recovery detects the empty member and properly adds it as a new learner, allowing it to sync data from healthy members and join the cluster.
293359
</Step>
294360
</Flow>
295361

@@ -316,14 +382,14 @@ When the majority of etcd member replicas become corrupted or deleted simultaneo
316382
Verify all PVCs are corrupted or inaccessible:
317383

318384
<InterpolatedCodeBlock
319-
code={`kubectl get pvc -l [[VAR:VCLUSTER LABEL:app=vcluster]] -n [[VAR:NAMESPACE:vcluster-my-team]]`}
385+
code={`kubectl get pvc -l [[GLOBAL:VCLUSTER_LABEL]] -n [[GLOBAL:NAMESPACE]]`}
320386
language="bash"
321387
/>
322388

323389
<br />
324390

325391
<InterpolatedCodeBlock
326-
code={`kubectl describe pvc [[VAR:PVC PREFIX:data]]-[[VAR:VCLUSTER NAME:my-vcluster]]-0 [[VAR:PVC PREFIX:data]]-[[VAR:VCLUSTER NAME:my-vcluster]]-1 [[VAR:PVC PREFIX:data]]-[[VAR:VCLUSTER NAME:my-vcluster]]-2 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
392+
code={`kubectl describe pvc [[VAR:PVC PREFIX:data]]-[[GLOBAL:VCLUSTER_NAME]]-0 [[VAR:PVC PREFIX:data]]-[[GLOBAL:VCLUSTER_NAME]]-1 [[VAR:PVC PREFIX:data]]-[[GLOBAL:VCLUSTER_NAME]]-2 -n [[GLOBAL:NAMESPACE]]`}
327393
language="bash"
328394
/>
329395
</Step>
@@ -332,7 +398,7 @@ Verify all PVCs are corrupted or inaccessible:
332398
Stop all vCluster instances before beginning recovery:
333399

334400
<InterpolatedCodeBlock
335-
code={`kubectl scale statefulset [[VAR:VCLUSTER NAME:my-vcluster]] --replicas=0 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
401+
code={`kubectl scale statefulset [[GLOBAL:VCLUSTER_NAME]] --replicas=0 -n [[GLOBAL:NAMESPACE]]`}
336402
language="bash"
337403
/>
338404
</Step>
@@ -341,13 +407,16 @@ Stop all vCluster instances before beginning recovery:
341407
Delete all corrupted PVCs:
342408

343409
<InterpolatedCodeBlock
344-
code={`kubectl delete pvc [[VAR:PVC PREFIX:data]]-[[VAR:VCLUSTER NAME:my-vcluster]]-0 [[VAR:PVC PREFIX:data]]-[[VAR:VCLUSTER NAME:my-vcluster]]-1 [[VAR:PVC PREFIX:data]]-[[VAR:VCLUSTER NAME:my-vcluster]]-2 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
410+
code={`kubectl delete pvc [[VAR:PVC PREFIX:data]]-[[GLOBAL:VCLUSTER_NAME]]-0 [[VAR:PVC PREFIX:data]]-[[GLOBAL:VCLUSTER_NAME]]-1 [[VAR:PVC PREFIX:data]]-[[GLOBAL:VCLUSTER_NAME]]-2 -n [[GLOBAL:NAMESPACE]]`}
345411
language="bash"
346412
/>
347413
</Step>
348414

349415
<Step title="Restore from backup or snapshot">
350-
Follow a backup restoration procedure. This typically involves restoring PVCs from your backup solution (Velero, CSI snapshots, or similar tools).
416+
Restore from a previous backup using one of these methods:
417+
- [vCluster snapshot restore](../../../../../../manage/backup-restore/restore.mdx) - Built-in snapshot restoration
418+
- [Volume snapshots](../../../../../../manage/backup-restore/volume-snapshots.mdx) - CSI volume snapshot restoration
419+
- [Velero](../../../../../../manage/backup-restore/velero.mdx) - Velero backup restoration
351420

352421
<br />
353422

@@ -363,7 +432,7 @@ Restore from snapshot:
363432
Scale up to a single replica to verify the restoration:
364433

365434
<InterpolatedCodeBlock
366-
code={`kubectl scale statefulset [[VAR:VCLUSTER NAME:my-vcluster]] --replicas=1 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
435+
code={`kubectl scale statefulset [[GLOBAL:VCLUSTER_NAME]] --replicas=1 -n [[GLOBAL:NAMESPACE]]`}
367436
language="bash"
368437
/>
369438

@@ -372,7 +441,7 @@ Scale up to a single replica to verify the restoration:
372441
Monitor logs and verify the cluster starts successfully:
373442

374443
<InterpolatedCodeBlock
375-
code={`kubectl logs -f [[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]`}
444+
code={`kubectl logs -f [[GLOBAL:VCLUSTER_NAME]]-0 -n [[GLOBAL:NAMESPACE]]`}
376445
language="bash"
377446
/>
378447

0 commit comments

Comments
 (0)