-
Notifications
You must be signed in to change notification settings - Fork 9.2k
YARN-11357. Fix FederationClientInterceptor#submitApplication Can't Update SubClusterId #5055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…pdate SubClusterId.
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
cnauroth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 (binding). However, I would like to see one other code review from someone who has spent more time working in the YARN federation code.
@slfan1989 , thank you for the contribution.
|
@goiri Can you help to review the code? Thank you very much! In YARN-11342 (#5005), there is a problem with writing a judgment condition, which will cause the subClusterId to fail to update when retrying. The modified content is as follows: |
|
@goiri Can you help review this pr? Thank you very much! |
...java/org/apache/hadoop/yarn/server/router/clientrm/TestFederationClientInterceptorRetry.java
Show resolved
Hide resolved
...java/org/apache/hadoop/yarn/server/router/clientrm/TestSequentialBroadcastPolicyManager.java
Show resolved
Hide resolved
| * Return to badCluster first. | ||
| */ | ||
| List<SubClusterId> subClusterIds = new ArrayList<>(preSelectSubClusters.keySet()); | ||
| if(subClusterIds.size() > 1){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaces after if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix it.
...src/main/java/org/apache/hadoop/yarn/server/router/clientrm/FederationClientInterceptor.java
Show resolved
Hide resolved
|
@goiri Can you help review this pr again? Thank you very much! |
| LOG.info("Test submitApplication with two bad, one good SC."); | ||
|
|
||
| // This test must require the TestSequentialRouterPolicy policy | ||
| if (StringUtils.equals(routerPolicyManagerName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may make sense to use something like:
Assume.assumeThat()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion, I will modify the code.
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
@goiri Can you help to merge this pr into trunk branch? Thank you very much! |
|
@goiri Thank you very much for helping to review the code! |
…pdate SubClusterId (apache#5055)
JIRA: YARN-11357. Fix FederationClientInterceptor#submitApplication Can't Update SubClusterId.
The reason for the test failure:
In YARN-11342 (#5005), I refactored the getNewApplication, submitApplication method, but in the process of implementation, the judgment condition of if was written incorrectly.
If subCluster does not exist, we should add subCluster, if subCluster exists, we should update subCluster.
Q1:Why is the unit test for YARN-11342 passing?
In this test report(#4982 TestReport).
I found that
TestFederationClientInterceptorRetry#testSubmitApplicationOneBadOneGoodwas abnormal.In this test(
testSubmitApplicationOneBadOneGood), we should get good subCluster, but it returns bad SubCluster.When I tested it locally, I found that this unit test sometimes succeeded and sometimes failed.
This is related to the
UniformBroadcastPolicyManager,UniformRandomRouterPolicyis used, this policy will randomlySelect 1 SubCluster from Active SubClusters.
If we get the good subCluster for the first time, then this test can be executed successfully, but if we get the bad subCluster for the first time, then we will retry 3 times, because the judgment of
FederationClientInterceptor#submitApplicationis wrong, As a result, the good subcluster cannot be updated to the stateStore. So We get the wrong subcluster.Q2: How can we verify that this part of the code is accurate after modification.
For unit testing, I redesigned a new RouterPolicy#TestSequentialRouterPolicy, which can return SubClusterId in order.
Example:
We have 3 subClusters, 1 goodSubCluster and 2 badSubClusters.
The scId of goodSubCluster is 0, and the scId of badSubCluster is 1 and 2. We hope Return to badSubCluster first.
We want to return the badSubCluster first, we can sort the subClusterId in reverse order, so we can get [2, 1, 0], which is [bad2,bad1,good]
Test case 1:
We set the number of retries to 1, and then
submitApplicationwill be executed 2 times.The 1st ,
bad2will be selected, The 2nd,bad1will be selected.If we query the StateStore we should get
bad1.Test case 2:
We set the number of retries to 2, and then
submitApplicationwill be executed 3 times.The 1st ,
bad2will be selected, The 2nd,bad1will be selected, The 3rd good cluster will be selected.If we query the StateStore we should get
good.