-
Notifications
You must be signed in to change notification settings - Fork 10
Fix flakiness in the E2E test e2e_multi_cluster_replica_set_scale_up
#231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flakiness in the E2E test e2e_multi_cluster_replica_set_scale_up
#231
Conversation
Assigned |
docker/mongodb-kubernetes-tests/tests/multicluster/multi_cluster_replica_set_scale_up.py
Outdated
Show resolved
Hide resolved
docker/mongodb-kubernetes-tests/tests/multicluster/multi_cluster_replica_set_scale_up.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Vivek! Nice investigation and I like the description of the issues + the Jira ticket for root cause issue 🥇
1. Improve comment to mention this change can be reverted once proper fix is made to the issue why this test was flaky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, LGTM
Summary
The E2E test
e2e_multi_cluster_replica_set_scale_up
has been flaky and @lucian-tosa suggested that we should fix it. It has been failing while waiting for statefulsets (STSs) to have correct number of in case multi cluster mongoDB deployment. And the problem was sometimes after theMongoDBMultiCluster (mdbmc)
resource got into running phase (that would mean all the STSs are ready), some of the STSs got into not ready state.When we see that
mdbmc
resource is Running we try to make sure that STSs have correct number of replicas but because of above problem (STSs transitioning into not ready state from ready), STS didn't have the correct number of replicas and tests failed.The reason why STS was transitioning into not ready state from ready is, the pod that it was maintaining did the same, i.e., it transitioned from ready state to not ready state. After looking into it further we got to know that the pod is behaving like this because sometimes, it's readiness probe fails momentarily. And because of that the pod gets to ready and then transitioned to not ready (readiness probe failed) and then eventually becomes ready. This is documented in much more detail in the document here.
The ideal fix of the problem would be to figure out why the readiness probe fails and then fix that. But this PR has the workaround that changes the test slightly to wait for STSs to get correct number of replicas.
Jira ticket: https://jira.mongodb.org/browse/CLOUDP-329422
Proof of Work
Ran the test
e2e_multi_cluster_replica_set_scale_up
manually locally to make sure that it's passing consistently. I am not able to reproduce the flakiness now.Checklist