Fix flakiness in the E2E test `e2e_multi_cluster_replica_set_scale_up` #231

viveksinghggits · 2025-07-04T17:32:50Z

Summary

The E2E test e2e_multi_cluster_replica_set_scale_up has been flaky and @lucian-tosa suggested that we should fix it. It has been failing while waiting for statefulsets (STSs) to have correct number of in case multi cluster mongoDB deployment. And the problem was sometimes after the MongoDBMultiCluster (mdbmc) resource got into running phase (that would mean all the STSs are ready), some of the STSs got into not ready state.
When we see that mdbmc resource is Running we try to make sure that STSs have correct number of replicas but because of above problem (STSs transitioning into not ready state from ready), STS didn't have the correct number of replicas and tests failed.
The reason why STS was transitioning into not ready state from ready is, the pod that it was maintaining did the same, i.e., it transitioned from ready state to not ready state. After looking into it further we got to know that the pod is behaving like this because sometimes, it's readiness probe fails momentarily. And because of that the pod gets to ready and then transitioned to not ready (readiness probe failed) and then eventually becomes ready. This is documented in much more detail in the document here.

The ideal fix of the problem would be to figure out why the readiness probe fails and then fix that. But this PR has the workaround that changes the test slightly to wait for STSs to get correct number of replicas.

Jira ticket: https://jira.mongodb.org/browse/CLOUDP-329422

Proof of Work

Ran the test e2e_multi_cluster_replica_set_scale_up manually locally to make sure that it's passing consistently. I am not able to reproduce the flakiness now.

Checklist

Have you linked a jira ticket and/or is the ticket in the title?
Have you checked whether your jira ticket required DOCSP changes?
Have you checked for release_note changes?

codeowners-service-app · 2025-07-04T17:32:59Z

Assigned lsierant for team kubernetes-hosted because MaciejKaras is out of office.
Assigned nammn for team kubernetes-hosted because SimonBaeumer is out of office.

docker/mongodb-kubernetes-tests/tests/multicluster/multi_cluster_replica_set_scale_up.py

MaciejKaras

Great work Vivek! Nice investigation and I like the description of the issues + the Jira ticket for root cause issue 🥇

1. Improve comment to mention this change can be reverted once proper fix is made to the issue why this test was flaky

lucian-tosa

Nice work, LGTM

…t_scale_up

Fix flakiness in the E2E test e2e_multi_cluster_replica_set_scale_up

3b48988

viveksinghggits requested a review from a team as a code owner July 4, 2025 17:32

viveksinghggits requested review from MaciejKaras and SimonBaeumer July 4, 2025 17:32

codeowners-service-app bot requested a review from lsierant July 4, 2025 17:32

codeowners-service-app bot requested a review from nammn July 4, 2025 17:33

viveksinghggits added 2 commits July 6, 2025 22:17

Run pre commit (make precommit) to fix formatting issues in CI

23d9466

Improve the comments for the changes made

8d12ac4

MaciejKaras reviewed Jul 7, 2025

View reviewed changes

docker/mongodb-kubernetes-tests/tests/multicluster/multi_cluster_replica_set_scale_up.py Outdated Show resolved Hide resolved

MaciejKaras reviewed Jul 7, 2025

View reviewed changes

docker/mongodb-kubernetes-tests/tests/multicluster/multi_cluster_replica_set_scale_up.py Outdated Show resolved Hide resolved

MaciejKaras approved these changes Jul 7, 2025

View reviewed changes

Address review comments

895e106

1. Improve comment to mention this change can be reverted once proper fix is made to the issue why this test was flaky

lucian-tosa approved these changes Jul 7, 2025

View reviewed changes

anandsyncs approved these changes Jul 7, 2025

View reviewed changes

Merge branch 'master' into fix-flakiness-e2e_multi_cluster_replica_se…

0ac2163

…t_scale_up

viveksinghggits merged commit 5710105 into master Jul 8, 2025
35 checks passed

viveksinghggits deleted the fix-flakiness-e2e_multi_cluster_replica_set_scale_up branch July 8, 2025 10:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix flakiness in the E2E test `e2e_multi_cluster_replica_set_scale_up` #231

Fix flakiness in the E2E test `e2e_multi_cluster_replica_set_scale_up` #231

Uh oh!

viveksinghggits commented Jul 4, 2025 •

edited

Loading

Uh oh!

codeowners-service-app bot commented Jul 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

MaciejKaras left a comment

Uh oh!

lucian-tosa left a comment

Uh oh!

Uh oh!

Uh oh!

Fix flakiness in the E2E test e2e_multi_cluster_replica_set_scale_up #231

Fix flakiness in the E2E test e2e_multi_cluster_replica_set_scale_up #231

Uh oh!

Conversation

viveksinghggits commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Proof of Work

Checklist

Uh oh!

codeowners-service-app bot commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MaciejKaras left a comment

Choose a reason for hiding this comment

Uh oh!

lucian-tosa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Fix flakiness in the E2E test `e2e_multi_cluster_replica_set_scale_up` #231

Fix flakiness in the E2E test `e2e_multi_cluster_replica_set_scale_up` #231

viveksinghggits commented Jul 4, 2025 •

edited

Loading

codeowners-service-app bot commented Jul 4, 2025 •

edited

Loading