Skip to content

Conversation

@zhaorongsheng
Copy link
Contributor

What changes were proposed in this pull request?

The root cause of this issue is that ExecutorAllocationListener gets the speculated task end info after the stage end event handling which let numRunningTasks = 0. Then it let numRunningTasks -= 1 so the #numRunningTasks is negative. When calculate #maxNeeded in method maxNumExecutorsNeeded(), the value may be 0 or negative. So ExecutorAllocationManager does not request container and the job will be hung.

This PR changes the method onTaskEnd() in ExecutorAllocationListener. When stageIdToNumTasks contains the taskEnd's stageId, let #numRunningTasks minus 1.

How was this patch tested?

This patch was tested in the method test("SPARK-18981...) of ExecutorAllocationManagerSuite.scala.
Create two taskInfos and one of them is speculated task. After the stage ending event, the speculated task ending event is posted to listener.

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I dont think the change itself might be incorrect : did you see
'No stages are running, but numRunningTasks != 0' in the logs ?

val taskInfo = createTaskInfo(1, 1, "executor-1")
val speculatedTaskInfo = createTaskInfo(2, 1, "executor-1")
sc.listenerBus.postToAll(SparkListenerTaskStart(0, 0, taskInfo))
assert(maxNumExecutorsNeeded(manager) === 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests looks wrong - taskIndex is higher than numTasks ?
Would be better for the test to :

  • Launch stage with 1 task.
  • Launch a normal task and 1 speculative task - with same taskIndex, but different taskId's
  • Finish normal task.
  • Ensure stage is completed.
  • Now finish speculative task and check if bug is not reproduced (it should be reproduced without this fix).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the warning info 'No stages are running, but numRunningTasks != 0' is printed and at that time the #numRunningTasks is set to 0. But after that the speculated task end event is arrived and the #numRunningTasks will plus 1.
The tests are wrong, I will fix it.

@zhaorongsheng
Copy link
Contributor Author

Hi @mridulm . I have modified the tests. Please check it.
Thanks~

@zhaorongsheng
Copy link
Contributor Author

Jenkins, retest this please

@mridulm
Copy link
Contributor

mridulm commented Dec 24, 2016

Does it fail in master without the fix ?

@zhaorongsheng
Copy link
Contributor Author

Yes, I have checked it.

@zhaorongsheng
Copy link
Contributor Author

@mridulm Please check it. Thanks~

@zhaorongsheng
Copy link
Contributor Author

zhaorongsheng commented Dec 28, 2016

Hi, can anyone check this PR?
thanks

@zsxwing
Copy link
Member

zsxwing commented Dec 29, 2016

Can we just not reset numRunningTasks to 0? I think it should include speculative tasks and we can add a comment about it.

@zhaorongsheng
Copy link
Contributor Author

@zsxwing I think it may cause some other problem.
For example, if we got some ExecutorLostFailure and the speculated task was running on it, the numRunningTasks will never be zero.

@zhaorongsheng
Copy link
Contributor Author

@zsxwing @mridulm
Would you check this PR please?

Thanks~

@jinxing64
Copy link

@zhaorongsheng
I think its better to just not reset numRunningTasks to 0. If we got some ExecutorLostFailure, the stage should not be marked as finished.

@HyukjinKwon
Copy link
Member

@zhaorongsheng, is this still active and any opinion on ^?

@asfgit asfgit closed this in 5d2750a May 18, 2017
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
## What changes were proposed in this pull request?

This PR proposes to close PRs ...

  - inactive to the review comments more than a month
  - WIP and inactive more than a month
  - with Jenkins build failure but inactive more than a month
  - suggested to be closed and no comment against that
  - obviously looking inappropriate (e.g., Branch 0.5)

To make sure, I left a comment for each PR about a week ago and I could not have a response back from the author in these PRs below:

Closes apache#11129
Closes apache#12085
Closes apache#12162
Closes apache#12419
Closes apache#12420
Closes apache#12491
Closes apache#13762
Closes apache#13837
Closes apache#13851
Closes apache#13881
Closes apache#13891
Closes apache#13959
Closes apache#14091
Closes apache#14481
Closes apache#14547
Closes apache#14557
Closes apache#14686
Closes apache#15594
Closes apache#15652
Closes apache#15850
Closes apache#15914
Closes apache#15918
Closes apache#16285
Closes apache#16389
Closes apache#16652
Closes apache#16743
Closes apache#16893
Closes apache#16975
Closes apache#17001
Closes apache#17088
Closes apache#17119
Closes apache#17272
Closes apache#17971

Added:
Closes apache#17778
Closes apache#17303
Closes apache#17872

## How was this patch tested?

N/A

Author: hyukjinkwon <[email protected]>

Closes apache#18017 from HyukjinKwon/close-inactive-prs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants