Skip to content

[TaskGroup] Fix unlock order, add missing detaches and add more assertions #67590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Aug 10, 2023

Conversation

ktoso
Copy link
Contributor

@ktoso ktoso commented Jul 28, 2023

Description: A task group resumes the "waiting task" in numerous situations. Currently tasks were scheduled and then the group was unlocked -- this can lead to races between the scheduled task and the group unlock and unpredictable behavior. Instead, we must unlock the group and THEN schedule the waiting task on order to avoid potential use-after free of the lock (as the unlock() happens).
Risk: Medium, the change reorganizes code in order to allow us to unlock and THEN schedule the task. This forced some general refactoring in order to be able to get this pattern.
Reward Medium, resolves very rare crashes which could occur when just the right scheduling timing would happen. These issues are very rare, and have remained undetected until recently.
Review by: @mikeash @DougGregor
Testing: CI testing, enabled all task group tests for the first time in a long time and all passing consistently on all platforms.
Radar: rdar://113331923 (test reenable rdar://113016918)
Related Radar: The following was the same issue however in a more crucial code path: rdar://113032582


Picked it over to 5.9 as well: #67819

@ktoso ktoso requested a review from kavon as a code owner July 28, 2023 06:03
assert(this->isEmpty() && "Attempted to destroy non-empty task group!");
// Double check by inspecting the group record, it should contain no children
assert(getTaskRecord()->getFirstChild() == nullptr && "Task group record still has child task!");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional assertions that not only the status claims we're clean and down to 0 tasks, but also that the group record confirms the same.

@ktoso ktoso requested a review from al45tair July 28, 2023 06:04
@ktoso
Copy link
Contributor Author

ktoso commented Jul 28, 2023

@swift-ci please test

@kavon
Copy link
Member

kavon commented Jul 31, 2023

Hm CI failed on that new assertion in Swift(macosx-x86_64) :: Concurrency/Runtime/async_taskgroup_throw_rethrow.swift

Assertion failed: (getTaskRecord()->getFirstChild() == nullptr && "Task group record still has child task!"), function destroy, file TaskGroup.cpp, line 1022.

Thread 1 crashed:

 0                  0x00007ff809e4900e __pthread_kill + 10 in libsystem_kernel.dylib
 1 [ra]             0x000000010f4df7b3 (anonymous namespace)::DiscardingTaskGroup::destroy() (.cold.1) + 35 in libswift_Concurrency.dylib
 2 [ra]             0x000000010f4d9e48 (anonymous namespace)::DiscardingTaskGroup::destroy() + 88 in libswift_Concurrency.dylib
 3 [ra]             0x000000010f4a5322 (7) suspend resume partial function for withThrowingDiscardingTaskGroup<A>(returning:body:) + 82 in libswift_Concurrency.dylib
 4 [async]          0x000000010f36a230 (2) await resume partial function for test_discardingTaskGroup_automaticallyRethrows_first_withThrowingBodyFirst() in a.out
 5 [async]          0x000000010f36cb50 (6) await resume partial function for static Main.main() in a.out
 6 [async] [system] 0x000000010f36cd70 (1) await resume partial function for static Main.$main() in a.out
 7 [async] [system] 0x000000010f36ceb0 async_MainTQ0_ in a.out
 8 [async] [thunk]  0x000000010f36cfe0 (1) await resume partial function for thunk for @escaping @convention(thin) @async () -> () in a.out
 9 [async] [thunk]  0x000000010f36d0f0 (1) await resume partial function for partial apply for thunk for @escaping @convention(thin) @async () -> () in a.out
10 [async] [system] 0x000000010f4d5b70 completeTaskWithClosure(swift::AsyncContext*, swift::SwiftError*) in libswift_Concurrency.dylib

@kavon
Copy link
Member

kavon commented Jul 31, 2023

filecheck input for posterity in case the run gets GC'd

Input was:
<<<<<<
           1: ==== test_taskGroup_throws_rethrows() ------ 
           2: next: 1 
           3: next: 2 
           4: error caught and rethrown in group: Boom(id: "main/async_taskgroup_throw_rethrow.swift:34") 
           5: rethrown: Boom(id: "main/async_taskgroup_throw_rethrow.swift:34") 
           6: ==== test_taskGroup_noThrow_ifNotAwaitedThrowingTask() ------ 
           7: Expected no error to be thrown, got: 1 
           8: ==== test_taskGroup_throw_rethrows_waitForAll() ------ 
           9: waitAll rethrown: CancellationError() 
          10: isEmpty: true 
          11: rethrown: CancellationError() 
          12: ==== test_discardingTaskGroup_automaticallyRethrows() ------ 
          13: rethrown: Boom(id: "main/async_taskgroup_throw_rethrow.swift:106") 
          14: ==== test_discardingTaskGroup_automaticallyRethrowsOnlyFirst() ------ 
          15: Throwing: Boom(id: "first, isCancelled:false") 
          16: Awoken, throwing: CancellationError() 
          17: rethrown: Boom(id: "first, isCancelled:false") 
          18: ==== test_discardingTaskGroup_automaticallyRethrows_first_withThrowingBodyFirst() ------ 
label:184                                                                                    X~~~~~~~~~ error: no match found
          19: Throwing: Boom(id: "body, first, isCancelled:false") 
label:184     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          20: Throwing: Boom(id: "task, second, isCancelled:true") 
label:184     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>>

@ktoso
Copy link
Contributor Author

ktoso commented Jul 31, 2023

Yeah, not good -- seems we uncovered an issue while at it. Will investigate deeper

@ktoso ktoso force-pushed the wip-multi-error-group-single-leak branch from a54bd19 to 9fecd4b Compare August 8, 2023 05:02
}

@main struct Main {
static func main() async {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same tests, just moved into methods for readability

return;
// Must unlock before we resume the waiting task
unlock();
return resumeWaitingTask(completedTask, assumed, hadErrorResult);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as #67700 fixed, but inside the offer impl

@@ -1150,7 +1158,7 @@ void AccumulatingTaskGroup::offer(AsyncTask *completedTask, AsyncContext *contex
// This is wasteful, and the task completion function should be fixed to
// transfer ownership of a retain into this function, in which case we
// will need to release in the other path.
lock(); // TODO: remove fragment lock, and use status for synchronization
lock();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By now it's not really helpful to have this TODO on every line where the lock is used.

// the group before we've given up the lock.
_swift_taskGroup_detachChild(asAbstract(this), completedTask);
unlock();
return resumeWaitingTaskWithError(readyErrorItem.getRawError(this), assumed,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as #67700 fixed, but inside the offer impl

@ktoso ktoso changed the title [DiscardingTaskGroup] Properly detach when LAST task is failed, and prior failure was already stored [TaskGroup] Fix unlock order, add missing detaches and add more assertions Aug 8, 2023
@ktoso
Copy link
Contributor Author

ktoso commented Aug 8, 2023

This and related work combined resulted in two important fixes here:

  • [important] Similar to 🍒[5.9][TaskGroup] Reenable test and fix memory issue  #67700 we must not resume the parent waiting task while holding the lock; this can lead to all kinds of weirdness
  • [minor] always detaching child task -- this would not cause "actual" leaks because we always destroy the child tasks and records but could lead to much confusion in the future when a group would be inspected by tools and has child records but the tasks are dead already

@ktoso
Copy link
Contributor Author

ktoso commented Aug 8, 2023

@swift-ci please test

@ktoso
Copy link
Contributor Author

ktoso commented Aug 8, 2023

Had to do a larger revamp here to get the locking the way we need it -- don't review yet, need to verify this more.

@ktoso ktoso marked this pull request as draft August 8, 2023 08:35
@ktoso
Copy link
Contributor Author

ktoso commented Aug 8, 2023

@swift-ci please test

1 similar comment
@ktoso
Copy link
Contributor Author

ktoso commented Aug 8, 2023

@swift-ci please test

@ktoso ktoso force-pushed the wip-multi-error-group-single-leak branch from 7e23308 to 060260e Compare August 8, 2023 23:48
"rather than return it for scheduling.")
#endif
if (auto waitingTask = prepared.waitingTask) {
// TODO: allow the caller to suggest an executor
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an old TODO that we carry around with moving this code into the helper func

@@ -831,7 +875,8 @@ class DiscardingTaskGroup: public TaskGroupBase {

private:
/// Resume waiting task with specified error
void resumeWaitingTaskWithError(SwiftError *error,
PreparedWaitingTask prepareWaitingTaskWithError(AsyncTask* waitingTask,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The important change here is that we PREPARE, but don't guarantee that we'll run -- it depends what model we're running in, task-to-thread or not.

If the returned prepared task is not null, we'll schedule it, but only AFTER unlocking the group

assert(waitingTask && "cannot resume 'null' waiting task!");
SWIFT_TASK_GROUP_DEBUG_LOG(this, "resume waiting task = %p, with error = %p",
waitingTask, error);
while (true) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowadays this loop is meaningless, because we enter here while we hold the group lock, so we cannot fail the status update

@ktoso ktoso force-pushed the wip-multi-error-group-single-leak branch from 33e15ba to 7d93bba Compare August 9, 2023 03:47
@ktoso
Copy link
Contributor Author

ktoso commented Aug 9, 2023

@swift-ci please smoke test

@ktoso
Copy link
Contributor Author

ktoso commented Aug 9, 2023

preset=stdlib_S_standalone_minimal_macho_x86_64,build,test
@swift-ci please clean test with toolchain and preset

@ktoso
Copy link
Contributor Author

ktoso commented Aug 9, 2023

preset=stdlib_S_standalone_minimal_macho_x86_64,build,test
@swift-ci please clean test with toolchain and preset

@ktoso
Copy link
Contributor Author

ktoso commented Aug 9, 2023

Cool, another round of all passes with the locking issue fixed and all task group tests enabled.

Pushed a cleanup so let's give this another full run

@ktoso ktoso marked this pull request as ready for review August 9, 2023 06:16
@ktoso ktoso requested a review from mikeash August 9, 2023 06:21
@ktoso
Copy link
Contributor Author

ktoso commented Aug 9, 2023

@swift-ci please test

@ktoso ktoso added the concurrency runtime Feature: umbrella label for concurrency runtime features label Aug 9, 2023
Copy link
Contributor

@al45tair al45tair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

auto waitingContext =
static_cast<TaskFutureWaitAsyncContext *>(
waitingTask->ResumeContext);
// Run the task.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We don't actually run the task here; maybe "Prepare the task to run" would be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good nit, I'll change!

auto prepared = prepareWaitingTaskWithTask(
/*complete=*/waitingTask, /*with=*/completedTask,
assumed, hadErrorResult);
unlock(); // we MUST unlock before running the waiting task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A brief explanation of why would be good, for anyone who doesn't notice the larger explanation below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, should explain why and not that we "must" will do 👍

// We grab the waiting task while holding the group lock, because this
// allows a single task to get the waiting task and attempt to complete it.
// As another offer gets to run, it will have either a different waiting task, or no waiting task at all.
auto waitingTask = waitQueue.load(std::memory_order_acquire);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tiniest of tiny whitespace errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoop, I'll run formatter, seems I missed it.

@ktoso
Copy link
Contributor Author

ktoso commented Aug 10, 2023

Added code comments in latest commit; Thanks for review folks

@ktoso
Copy link
Contributor Author

ktoso commented Aug 10, 2023

@swift-ci please smoke test and merge

@swift-ci swift-ci merged commit 9668f04 into swiftlang:main Aug 10, 2023
@ktoso ktoso deleted the wip-multi-error-group-single-leak branch August 10, 2023 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
concurrency runtime Feature: umbrella label for concurrency runtime features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants