-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[TaskGroup] Fix unlock order, add missing detaches and add more assertions #67590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TaskGroup] Fix unlock order, add missing detaches and add more assertions #67590
Conversation
assert(this->isEmpty() && "Attempted to destroy non-empty task group!"); | ||
// Double check by inspecting the group record, it should contain no children | ||
assert(getTaskRecord()->getFirstChild() == nullptr && "Task group record still has child task!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additional assertions that not only the status claims we're clean and down to 0 tasks, but also that the group record confirms the same.
@swift-ci please test |
Hm CI failed on that new assertion in
|
filecheck input for posterity in case the run gets GC'd
|
Yeah, not good -- seems we uncovered an issue while at it. Will investigate deeper |
a54bd19
to
9fecd4b
Compare
} | ||
|
||
@main struct Main { | ||
static func main() async { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same tests, just moved into methods for readability
return; | ||
// Must unlock before we resume the waiting task | ||
unlock(); | ||
return resumeWaitingTask(completedTask, assumed, hadErrorResult); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as #67700 fixed, but inside the offer impl
@@ -1150,7 +1158,7 @@ void AccumulatingTaskGroup::offer(AsyncTask *completedTask, AsyncContext *contex | |||
// This is wasteful, and the task completion function should be fixed to | |||
// transfer ownership of a retain into this function, in which case we | |||
// will need to release in the other path. | |||
lock(); // TODO: remove fragment lock, and use status for synchronization | |||
lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By now it's not really helpful to have this TODO on every line where the lock is used.
// the group before we've given up the lock. | ||
_swift_taskGroup_detachChild(asAbstract(this), completedTask); | ||
unlock(); | ||
return resumeWaitingTaskWithError(readyErrorItem.getRawError(this), assumed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as #67700 fixed, but inside the offer impl
This and related work combined resulted in two important fixes here:
|
@swift-ci please test |
Had to do a larger revamp here to get the locking the way we need it -- don't review yet, need to verify this more. |
@swift-ci please test |
1 similar comment
@swift-ci please test |
…rior failure was already stored
7e23308
to
060260e
Compare
"rather than return it for scheduling.") | ||
#endif | ||
if (auto waitingTask = prepared.waitingTask) { | ||
// TODO: allow the caller to suggest an executor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an old TODO that we carry around with moving this code into the helper func
@@ -831,7 +875,8 @@ class DiscardingTaskGroup: public TaskGroupBase { | |||
|
|||
private: | |||
/// Resume waiting task with specified error | |||
void resumeWaitingTaskWithError(SwiftError *error, | |||
PreparedWaitingTask prepareWaitingTaskWithError(AsyncTask* waitingTask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The important change here is that we PREPARE, but don't guarantee that we'll run -- it depends what model we're running in, task-to-thread or not.
If the returned prepared task is not null, we'll schedule it, but only AFTER unlocking the group
assert(waitingTask && "cannot resume 'null' waiting task!"); | ||
SWIFT_TASK_GROUP_DEBUG_LOG(this, "resume waiting task = %p, with error = %p", | ||
waitingTask, error); | ||
while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nowadays this loop is meaningless, because we enter here while we hold the group lock, so we cannot fail the status update
33e15ba
to
7d93bba
Compare
@swift-ci please smoke test |
preset=stdlib_S_standalone_minimal_macho_x86_64,build,test |
preset=stdlib_S_standalone_minimal_macho_x86_64,build,test |
Cool, another round of all passes with the locking issue fixed and all task group tests enabled.
Pushed a cleanup so let's give this another full run |
@swift-ci please test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
auto waitingContext = | ||
static_cast<TaskFutureWaitAsyncContext *>( | ||
waitingTask->ResumeContext); | ||
// Run the task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We don't actually run the task here; maybe "Prepare the task to run" would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good nit, I'll change!
auto prepared = prepareWaitingTaskWithTask( | ||
/*complete=*/waitingTask, /*with=*/completedTask, | ||
assumed, hadErrorResult); | ||
unlock(); // we MUST unlock before running the waiting task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A brief explanation of why would be good, for anyone who doesn't notice the larger explanation below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, should explain why and not that we "must" will do 👍
// We grab the waiting task while holding the group lock, because this | ||
// allows a single task to get the waiting task and attempt to complete it. | ||
// As another offer gets to run, it will have either a different waiting task, or no waiting task at all. | ||
auto waitingTask = waitQueue.load(std::memory_order_acquire); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tiniest of tiny whitespace errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoop, I'll run formatter, seems I missed it.
Added code comments in latest commit; Thanks for review folks |
@swift-ci please smoke test and merge |
Description: A task group resumes the "waiting task" in numerous situations. Currently tasks were scheduled and then the group was unlocked -- this can lead to races between the scheduled task and the group unlock and unpredictable behavior. Instead, we must unlock the group and THEN schedule the waiting task on order to avoid potential use-after free of the lock (as the unlock() happens).
Risk: Medium, the change reorganizes code in order to allow us to unlock and THEN schedule the task. This forced some general refactoring in order to be able to get this pattern.
Reward Medium, resolves very rare crashes which could occur when just the right scheduling timing would happen. These issues are very rare, and have remained undetected until recently.
Review by: @mikeash @DougGregor
Testing: CI testing, enabled all task group tests for the first time in a long time and all passing consistently on all platforms.
Radar: rdar://113331923 (test reenable rdar://113016918)
Related Radar: The following was the same issue however in a more crucial code path: rdar://113032582
Picked it over to 5.9 as well: #67819