You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -95,6 +96,26 @@ Normal pod restarts or terminations do not require manual recovery. These events
95
96
96
97
Recovery procedures depend on whether the first replica (the pod ending with `-0`) is among the failing replicas.
97
98
99
+
:::note
100
+
The recovery procedure for the first replica also depends on your StatefulSet's `podManagementPolicy` configuration (`Parallel` or `OrderedReady`). See the [first replica recovery section](#migrate-to-parallel) for details on migrating between policies if needed.
101
+
:::
102
+
103
+
:::info Find your vCluster namespace
104
+
If using VirtualClusterInstance (platform), the vCluster StatefulSet runs in a different namespace than the VirtualClusterInstance itself. Find the StatefulSet namespace with:
105
+
```bash
106
+
kubectl get virtualclusterinstance <instance-name> -n <vci-namespace> -o jsonpath='{.spec.clusterRef.namespace}'
107
+
```
108
+
For example, if your VirtualClusterInstance is named `my-vcluster` in the `p-default` namespace, the StatefulSet might be in `vcluster-my-vcluster-p-default`.
109
+
110
+
If using Helm, the namespace is what you specified during installation (e.g., `vcluster-my-team`).
111
+
:::
112
+
113
+
<PageVariables
114
+
VCLUSTER_NAME="my-vcluster"
115
+
NAMESPACE="vcluster-my-team"
116
+
VCLUSTER_LABEL="app=vcluster"
117
+
/>
118
+
98
119
Use the following procedures when some replicas are still functioning:
99
120
<br />
100
121
@@ -106,7 +127,7 @@ Use the following procedures when some replicas are still functioning:
<TabItem value="first-replica-failing" label="First replica is failing">
159
180
160
181
:::warning
161
-
Before attempting any recovery procedure, create a backup of the virtual cluster namespace on the host cluster. If using namespace syncing, back up all synced namespaces as well.
182
+
Before attempting any recovery procedure, [create a backup](../../../../../../manage/backup-restore/backup.mdx) of your virtual cluster using `vcluster snapshot create --include-volumes`. This ensures both the virtual cluster's etcd data and persistent volumes are backed up.
183
+
184
+
If the virtual cluster's etcd is in a bad state and the snapshot command fails, you can still back up from the host cluster (which has its own functioning etcd). Use your preferred backup solution (e.g., Velero, Kasten, or cloud-native backup tools) to back up the host cluster namespace containing the vCluster resources. Ensure the backup includes:
185
+
- All Kubernetes resources in the vCluster namespace (StatefulSet, Services, etc.)
186
+
- PersistentVolumeClaims and their associated volume data (contains the virtual cluster's etcd data)
187
+
- Secrets and ConfigMaps
188
+
189
+
When restored, the vCluster pods will restart and the virtual cluster will be recreated from the backed-up etcd data.
190
+
191
+
If using namespace syncing, back up all synced namespaces on the host cluster as well.
162
192
:::
163
193
164
194
The recovery procedure depends on your StatefulSet `podManagementPolicy` configuration. vCluster version 0.20 and later use `Parallel` by default. Earlier versions used `OrderedReady`.
165
195
196
+
:::info
197
+
If more than one pod is down with `podManagementPolicy: OrderedReady`, you must first [migrate to `Parallel`](#migrate-to-parallel) before attempting recovery.
198
+
:::
199
+
166
200
Check your configuration:
167
201
168
202
<InterpolatedCodeBlock
169
-
code={`kubectl get statefulset [[VAR:VCLUSTER NAME:my-vcluster]] -n [[VAR:NAMESPACE:vcluster-my-team]] -o jsonpath='{.spec.podManagementPolicy}'`}
203
+
code={`kubectl get statefulset [[GLOBAL:VCLUSTER_NAME]] -n [[GLOBAL:NAMESPACE]] -o jsonpath='{.spec.podManagementPolicy}'`}
170
204
language="bash"
171
205
/>
172
206
@@ -175,24 +209,33 @@ Check your configuration:
175
209
176
210
<Flow>
177
211
<Step title="Delete the failed pod and PVC">
178
-
Delete the corrupted pod and PVC for replica-0:
212
+
First, identify the PVC for replica-0:
213
+
214
+
<InterpolatedCodeBlock
215
+
code={`kubectl get pvc -l [[GLOBAL:VCLUSTER_LABEL]] -n [[GLOBAL:NAMESPACE]]`}
216
+
language="bash"
217
+
/>
218
+
219
+
<br />
220
+
221
+
The PVC name typically follows the pattern `data-<vcluster-name>-0` but may vary if customized in your configuration. Note the exact name from the output above, then delete the corrupted pod and its PVC:
179
222
180
223
<InterpolatedCodeBlock
181
-
code={`kubectl delete pod [[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]
The pod restarts with a new empty PVC. After 1-3 pod restarts, the automatic recovery adds it back to the etcd cluster.
231
+
The pod restarts with a new empty PVC. The initial attempts fail because the new member tries to join the existing etcd cluster but lacks the required data. After 1-3 pod restarts, vCluster's automatic recovery detects the empty member and properly adds it as a new learner, allowing it to sync data from healthy members and join the cluster.
189
232
</Step>
190
233
191
234
<Step title="Monitor recovery">
192
235
Monitor the recovery process:
193
236
194
237
<InterpolatedCodeBlock
195
-
code={`kubectl get pods -l [[VAR:VCLUSTER LABEL:app=vcluster]] -n [[VAR:NAMESPACE:vcluster-my-team]] -w`}
238
+
code={`kubectl get pods -l [[GLOBAL:VCLUSTER_LABEL]] -n [[GLOBAL:NAMESPACE]] -w`}
196
239
language="bash"
197
240
/>
198
241
@@ -201,7 +244,7 @@ Monitor the recovery process:
201
244
Check the logs to verify the pod rejoins successfully:
@@ -279,17 +334,28 @@ The StatefulSet is recreated with `Parallel` policy and pods pick up the existin
279
334
</Step>
280
335
281
336
<Step title="Delete the failed pod and PVC">
282
-
Now follow the same procedure as for `Parallel` mode:
337
+
Now follow the same procedure as for `Parallel` mode.
338
+
339
+
First, identify the PVC for replica-0:
340
+
341
+
<InterpolatedCodeBlock
342
+
code={`kubectl get pvc -l [[GLOBAL:VCLUSTER_LABEL]] -n [[GLOBAL:NAMESPACE]]`}
343
+
language="bash"
344
+
/>
345
+
346
+
<br />
347
+
348
+
The PVC name typically follows the pattern `data-<vcluster-name>-0` but may vary if customized in your configuration. Note the exact name from the output above, then delete the corrupted pod and its PVC:
283
349
284
350
<InterpolatedCodeBlock
285
-
code={`kubectl delete pod [[VAR:VCLUSTER NAME:my-vcluster]]-0 -n [[VAR:NAMESPACE:vcluster-my-team]]
The pod restarts with a new empty PVC and automatic recovery adds it back to the cluster after 1-3 pod restarts.
358
+
The pod restarts with a new empty PVC. The initial attempts fail because the new member tries to join the existing etcd cluster but lacks the required data. After 1-3 pod restarts, vCluster's automatic recovery detects the empty member and properly adds it as a new learner, allowing it to sync data from healthy members and join the cluster.
293
359
</Step>
294
360
</Flow>
295
361
@@ -316,14 +382,14 @@ When the majority of etcd member replicas become corrupted or deleted simultaneo
316
382
Verify all PVCs are corrupted or inaccessible:
317
383
318
384
<InterpolatedCodeBlock
319
-
code={`kubectl get pvc -l [[VAR:VCLUSTER LABEL:app=vcluster]] -n [[VAR:NAMESPACE:vcluster-my-team]]`}
385
+
code={`kubectl get pvc -l [[GLOBAL:VCLUSTER_LABEL]] -n [[GLOBAL:NAMESPACE]]`}
0 commit comments