Skip to content

Fix flakiness in the E2E test e2e_multi_cluster_replica_set_scale_up #231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

viveksinghggits
Copy link
Contributor

@viveksinghggits viveksinghggits commented Jul 4, 2025

Summary

The E2E test e2e_multi_cluster_replica_set_scale_up has been flaky and @lucian-tosa suggested that we should fix it. It has been failing while waiting for statefulsets (STSs) to have correct number of in case multi cluster mongoDB deployment. And the problem was sometimes after the MongoDBMultiCluster (mdbmc) resource got into running phase (that would mean all the STSs are ready), some of the STSs got into not ready state.
When we see that mdbmc resource is Running we try to make sure that STSs have correct number of replicas but because of above problem (STSs transitioning into not ready state from ready), STS didn't have the correct number of replicas and tests failed.
The reason why STS was transitioning into not ready state from ready is, the pod that it was maintaining did the same, i.e., it transitioned from ready state to not ready state. After looking into it further we got to know that the pod is behaving like this because sometimes, it's readiness probe fails momentarily. And because of that the pod gets to ready and then transitioned to not ready (readiness probe failed) and then eventually becomes ready. This is documented in much more detail in the document here.

The ideal fix of the problem would be to figure out why the readiness probe fails and then fix that. But this PR has the workaround that changes the test slightly to wait for STSs to get correct number of replicas.

Jira ticket: https://jira.mongodb.org/browse/CLOUDP-329422

Proof of Work

Ran the test e2e_multi_cluster_replica_set_scale_up manually locally to make sure that it's passing consistently. I am not able to reproduce the flakiness now.

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you checked for release_note changes?

@viveksinghggits viveksinghggits requested a review from a team as a code owner July 4, 2025 17:32
@codeowners-service-app codeowners-service-app bot requested a review from lsierant July 4, 2025 17:32
Copy link

codeowners-service-app bot commented Jul 4, 2025

Assigned lsierant for team kubernetes-hosted because MaciejKaras is out of office.
Assigned nammn for team kubernetes-hosted because SimonBaeumer is out of office.

@codeowners-service-app codeowners-service-app bot requested a review from nammn July 4, 2025 17:33
Copy link
Collaborator

@MaciejKaras MaciejKaras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Vivek! Nice investigation and I like the description of the issues + the Jira ticket for root cause issue 🥇

1. Improve comment to mention this change can be reverted once proper fix is made to the issue why this test was flaky
Copy link
Contributor

@lucian-tosa lucian-tosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, LGTM

@viveksinghggits viveksinghggits merged commit 5710105 into master Jul 8, 2025
35 checks passed
@viveksinghggits viveksinghggits deleted the fix-flakiness-e2e_multi_cluster_replica_set_scale_up branch July 8, 2025 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants