Commit 6538475
authored
CLOUDP-350185 Fix flaky e2e_multi_cluster_sharded_snippets test (#503)
# Fix flaky e2e_multi_cluster_sharded_snippets test
## Problem
The `e2e_multi_cluster_sharded_snippets` test fails intermittently when
the Kubernetes API server times out during resource creation.
Example run:
https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_snippets_12f405afd0f823091430f0be8f4ac21d87a9559c_25_10_05_20_58_10/files?execution=0&sorts=STATUS%3AASC
**What I noticed in my investigation:**
1. Test deploys 5 sharded MongoDB clusters simultaneously (~75-100
services across 3 clusters)
2. Around 7-8 minutes in, K8s API server times out on a service update
operation
3. Operator marks resource as Failed with error: `"the server was unable
to return a response in the time allotted, but may still be processing
the request"`
4. Test immediately fails
5. Minutes later, the resource actually reaches Running (the timeout was
transient)
## Investigation
- Operator creates hundreds of K8s API operations during reconciliation
- This overloads the kind cluster's API server
- K8s API timeouts are transient - services and pods are created
successfully, just slower than expected
- After being marked Failed, resources recover within 4-5 minutes
## This Fix
Add K8s API timeout patterns to the `intermediate_events` list in
`mongodb.py`:
- `"but may still be processing the request"` (server-side timeout)
- `"Client.Timeout exceeded while awaiting headers"` (client-side
timeout)
**Effect:**
- When operator marks resource as Failed with K8s API timeout error,
test skips the failure
- Test continues waiting for resource to reach Running
- Test passes once resource recovers (which it does)
This is the same pattern used for other transient failures like agent
registration timeouts and Ops Manager connection issues.
## Proper Fix (Future Work)
The operator should not mark resources as Failed on K8s API timeout.
Instead, for example:
1. Detect K8s API timeout errors
2. Retry with exponential backoff
3. Only mark Failed after multiple consecutive timeouts
## Proof of work
Ran 4 patches to check for flakiness after the fix:
1. [Patch
1](https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_snippets_patch_c5839ff5b5d5b5b1e338b476c9d299513525f506_68e6597888fa050007a68f9e_25_10_08_12_30_49/logs?execution=0)
2. [Patch
2](https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_snippets_patch_c5839ff5b5d5b5b1e338b476c9d299513525f506_68e6597ca47d640007870576_25_10_08_12_30_53/logs?execution=0)
3. [Patch
3](https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_snippets_patch_c5839ff5b5d5b5b1e338b476c9d299513525f506_68e6598068612800074a09bc_25_10_08_12_30_58/logs?execution=0)
4. [Patch
4](https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_snippets_patch_c5839ff5b5d5b5b1e338b476c9d299513525f506_68e6598bb1a26200071edb2c_25_10_08_12_31_08/logs?execution=0)
All patches reached Running despite intermediate failures like:
```
[2025/10/08 15:03:38.199] DEBUG 2025-10-08 13:03:38,198 [mongodb_utils_state] Found intermediate event in failure: Client.Timeout exceeded while awaiting headers in Failed to create configmap: a-1759927824-grtlr6pj55z/pod-template-shards-0-hostname-override in cluster: kind-e2e-cluster-1, err: Put "https://10.97.0.1/api/v1/namespaces/a-1759927824-grtlr6pj55z/configmaps/pod-template-shards-0-hostname-override?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers). Skipping the failure state
```
The test now properly skips these transient API timeout failures and
waits for resources to recover.
`backup_minio` tests are failing, but in many other branches too1 parent c45bc73 commit 6538475
1 file changed
+5
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
89 | 89 | | |
90 | 90 | | |
91 | 91 | | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
92 | 97 | | |
93 | 98 | | |
94 | 99 | | |
| |||
0 commit comments