[TaskGroup] Fix unlock order, add missing detaches and add more assertions #67590

ktoso · 2023-07-28T06:03:29Z

Description: A task group resumes the "waiting task" in numerous situations. Currently tasks were scheduled and then the group was unlocked -- this can lead to races between the scheduled task and the group unlock and unpredictable behavior. Instead, we must unlock the group and THEN schedule the waiting task on order to avoid potential use-after free of the lock (as the unlock() happens).
Risk: Medium, the change reorganizes code in order to allow us to unlock and THEN schedule the task. This forced some general refactoring in order to be able to get this pattern.
Reward Medium, resolves very rare crashes which could occur when just the right scheduling timing would happen. These issues are very rare, and have remained undetected until recently.
Review by: @mikeash @DougGregor
Testing: CI testing, enabled all task group tests for the first time in a long time and all passing consistently on all platforms.
Radar: rdar://113331923 (test reenable rdar://113016918)
Related Radar: The following was the same issue however in a more crucial code path: rdar://113032582

Picked it over to 5.9 as well: #67819

stdlib/public/Concurrency/TaskGroup.cpp

ktoso · 2023-07-28T06:04:41Z

stdlib/public/Concurrency/TaskGroup.cpp

  assert(this->isEmpty() && "Attempted to destroy non-empty task group!");
+  // Double check by inspecting the group record, it should contain no children
+  assert(getTaskRecord()->getFirstChild() == nullptr && "Task group record still has child task!");


additional assertions that not only the status claims we're clean and down to 0 tasks, but also that the group record confirms the same.

ktoso · 2023-07-28T06:04:54Z

@swift-ci please test

kavon · 2023-07-31T21:21:49Z

Hm CI failed on that new assertion in Swift(macosx-x86_64) :: Concurrency/Runtime/async_taskgroup_throw_rethrow.swift

Assertion failed: (getTaskRecord()->getFirstChild() == nullptr && "Task group record still has child task!"), function destroy, file TaskGroup.cpp, line 1022.

Thread 1 crashed:

 0                  0x00007ff809e4900e __pthread_kill + 10 in libsystem_kernel.dylib
 1 [ra]             0x000000010f4df7b3 (anonymous namespace)::DiscardingTaskGroup::destroy() (.cold.1) + 35 in libswift_Concurrency.dylib
 2 [ra]             0x000000010f4d9e48 (anonymous namespace)::DiscardingTaskGroup::destroy() + 88 in libswift_Concurrency.dylib
 3 [ra]             0x000000010f4a5322 (7) suspend resume partial function for withThrowingDiscardingTaskGroup<A>(returning:body:) + 82 in libswift_Concurrency.dylib
 4 [async]          0x000000010f36a230 (2) await resume partial function for test_discardingTaskGroup_automaticallyRethrows_first_withThrowingBodyFirst() in a.out
 5 [async]          0x000000010f36cb50 (6) await resume partial function for static Main.main() in a.out
 6 [async] [system] 0x000000010f36cd70 (1) await resume partial function for static Main.$main() in a.out
 7 [async] [system] 0x000000010f36ceb0 async_MainTQ0_ in a.out
 8 [async] [thunk]  0x000000010f36cfe0 (1) await resume partial function for thunk for @escaping @convention(thin) @async () -> () in a.out
 9 [async] [thunk]  0x000000010f36d0f0 (1) await resume partial function for partial apply for thunk for @escaping @convention(thin) @async () -> () in a.out
10 [async] [system] 0x000000010f4d5b70 completeTaskWithClosure(swift::AsyncContext*, swift::SwiftError*) in libswift_Concurrency.dylib

kavon · 2023-07-31T21:23:21Z

filecheck input for posterity in case the run gets GC'd

Input was:
<<<<<<
           1: ==== test_taskGroup_throws_rethrows() ------ 
           2: next: 1 
           3: next: 2 
           4: error caught and rethrown in group: Boom(id: "main/async_taskgroup_throw_rethrow.swift:34") 
           5: rethrown: Boom(id: "main/async_taskgroup_throw_rethrow.swift:34") 
           6: ==== test_taskGroup_noThrow_ifNotAwaitedThrowingTask() ------ 
           7: Expected no error to be thrown, got: 1 
           8: ==== test_taskGroup_throw_rethrows_waitForAll() ------ 
           9: waitAll rethrown: CancellationError() 
          10: isEmpty: true 
          11: rethrown: CancellationError() 
          12: ==== test_discardingTaskGroup_automaticallyRethrows() ------ 
          13: rethrown: Boom(id: "main/async_taskgroup_throw_rethrow.swift:106") 
          14: ==== test_discardingTaskGroup_automaticallyRethrowsOnlyFirst() ------ 
          15: Throwing: Boom(id: "first, isCancelled:false") 
          16: Awoken, throwing: CancellationError() 
          17: rethrown: Boom(id: "first, isCancelled:false") 
          18: ==== test_discardingTaskGroup_automaticallyRethrows_first_withThrowingBodyFirst() ------ 
label:184                                                                                    X~~~~~~~~~ error: no match found
          19: Throwing: Boom(id: "body, first, isCancelled:false") 
label:184     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          20: Throwing: Boom(id: "task, second, isCancelled:true") 
label:184     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>>

ktoso · 2023-07-31T22:58:40Z

Yeah, not good -- seems we uncovered an issue while at it. Will investigate deeper

ktoso · 2023-08-08T06:15:43Z

test/Concurrency/Runtime/async_taskgroup_discarding_dontLeak.swift

+}
+
+@main struct Main {
+  static func main() async {


This is the same tests, just moved into methods for readability

ktoso · 2023-08-08T06:16:45Z

stdlib/public/Concurrency/TaskGroup.cpp

-    return;
+    // Must unlock before we resume the waiting task
+    unlock();
+    return resumeWaitingTask(completedTask, assumed, hadErrorResult);


Same issue as #67700 fixed, but inside the offer impl

ktoso · 2023-08-08T06:17:09Z

stdlib/public/Concurrency/TaskGroup.cpp

@@ -1150,7 +1158,7 @@ void AccumulatingTaskGroup::offer(AsyncTask *completedTask, AsyncContext *contex
  // This is wasteful, and the task completion function should be fixed to
  // transfer ownership of a retain into this function, in which case we
  // will need to release in the other path.
-  lock(); // TODO: remove fragment lock, and use status for synchronization
+  lock();


By now it's not really helpful to have this TODO on every line where the lock is used.

ktoso · 2023-08-08T06:17:27Z

stdlib/public/Concurrency/TaskGroup.cpp

+            // the group before we've given up the lock.
+            _swift_taskGroup_detachChild(asAbstract(this), completedTask);
+            unlock();
+            return resumeWaitingTaskWithError(readyErrorItem.getRawError(this), assumed,


Same issue as #67700 fixed, but inside the offer impl

ktoso · 2023-08-08T06:20:50Z

This and related work combined resulted in two important fixes here:

[important] Similar to 🍒[5.9][TaskGroup] Reenable test and fix memory issue #67700 we must not resume the parent waiting task while holding the lock; this can lead to all kinds of weirdness
[minor] always detaching child task -- this would not cause "actual" leaks because we always destroy the child tasks and records but could lead to much confusion in the future when a group would be inspected by tools and has child records but the tasks are dead already

ktoso · 2023-08-08T06:21:17Z

@swift-ci please test

ktoso · 2023-08-08T08:35:46Z

Had to do a larger revamp here to get the locking the way we need it -- don't review yet, need to verify this more.

ktoso · 2023-08-08T08:36:17Z

@swift-ci please test

ktoso · 2023-08-08T10:39:43Z

@swift-ci please test

…rior failure was already stored

…ting

ktoso · 2023-08-09T03:44:03Z

stdlib/public/Concurrency/TaskGroup.cpp

+         "rather than return it for scheduling.")
+#endif
+  if (auto waitingTask = prepared.waitingTask) {
+    // TODO: allow the caller to suggest an executor


This is an old TODO that we carry around with moving this code into the helper func

ktoso · 2023-08-09T03:44:52Z

stdlib/public/Concurrency/TaskGroup.cpp

@@ -831,7 +875,8 @@ class DiscardingTaskGroup: public TaskGroupBase {

 private:
  /// Resume waiting task with specified error
-  void resumeWaitingTaskWithError(SwiftError *error,
+  PreparedWaitingTask prepareWaitingTaskWithError(AsyncTask* waitingTask,


The important change here is that we PREPARE, but don't guarantee that we'll run -- it depends what model we're running in, task-to-thread or not.

If the returned prepared task is not null, we'll schedule it, but only AFTER unlocking the group

ktoso · 2023-08-09T03:46:05Z

stdlib/public/Concurrency/TaskGroup.cpp

  assert(waitingTask && "cannot resume 'null' waiting task!");
-  SWIFT_TASK_GROUP_DEBUG_LOG(this, "resume waiting task = %p, with error = %p",
-                       waitingTask, error);
-  while (true) {


Nowadays this loop is meaningless, because we enter here while we hold the group lock, so we cannot fail the status update

ktoso · 2023-08-09T03:47:33Z

@swift-ci please smoke test

ktoso · 2023-08-09T03:48:55Z

preset=stdlib_S_standalone_minimal_macho_x86_64,build,test
@swift-ci please clean test with toolchain and preset

ktoso · 2023-08-09T04:08:52Z

preset=stdlib_S_standalone_minimal_macho_x86_64,build,test
@swift-ci please clean test with toolchain and preset

ktoso · 2023-08-09T06:16:06Z

Cool, another round of all passes with the locking issue fixed and all task group tests enabled.

Pushed a cleanup so let's give this another full run

ktoso · 2023-08-09T06:21:53Z

@swift-ci please test

al45tair

LGTM

al45tair · 2023-08-09T08:46:27Z

stdlib/public/Concurrency/TaskGroup.cpp

-        auto waitingContext =
-            static_cast<TaskFutureWaitAsyncContext *>(
-                waitingTask->ResumeContext);
+  // Run the task.


Nit: We don't actually run the task here; maybe "Prepare the task to run" would be better?

Good nit, I'll change!

mikeash · 2023-08-09T14:44:41Z

stdlib/public/Concurrency/TaskGroup.cpp

+    auto prepared = prepareWaitingTaskWithTask(
+        /*complete=*/waitingTask, /*with=*/completedTask,
+        assumed, hadErrorResult);
+    unlock(); // we MUST unlock before running the waiting task


A brief explanation of why would be good, for anyone who doesn't notice the larger explanation below.

True, should explain why and not that we "must" will do 👍

mikeash · 2023-08-09T14:47:32Z

stdlib/public/Concurrency/TaskGroup.cpp

+    // We grab the waiting task while holding the group lock, because this
+    // allows a single task to get the waiting task and attempt to complete it.
+    // As another offer gets to run, it will have either a different waiting task, or no waiting task at all.
+     auto waitingTask = waitQueue.load(std::memory_order_acquire);


The tiniest of tiny whitespace errors.

Whoop, I'll run formatter, seems I missed it.

ktoso · 2023-08-10T12:17:54Z

Added code comments in latest commit; Thanks for review folks

ktoso · 2023-08-10T12:18:14Z

@swift-ci please smoke test and merge

ktoso requested a review from kavon as a code owner July 28, 2023 06:03

ktoso commented Jul 28, 2023

View reviewed changes

stdlib/public/Concurrency/TaskGroup.cpp Outdated Show resolved Hide resolved

ktoso commented Jul 28, 2023

View reviewed changes

ktoso requested a review from al45tair July 28, 2023 06:04

ktoso force-pushed the wip-multi-error-group-single-leak branch from a54bd19 to 9fecd4b Compare August 8, 2023 05:02

ktoso commented Aug 8, 2023

View reviewed changes

ktoso changed the title ~~[DiscardingTaskGroup] Properly detach when LAST task is failed, and prior failure was already stored~~ [TaskGroup] Fix unlock order, add missing detaches and add more assertions Aug 8, 2023

ktoso marked this pull request as draft August 8, 2023 08:35

ktoso added 10 commits August 9, 2023 08:44

reenable async_taskgroup_discarding_dontLeak.swift

0116b09

[DiscardingTaskGroup] Properly detach when LAST task is failed, and p…

9d9f8cb

…rior failure was already stored

[TaskGroup] Must detach discarded task, THEN unlock before resume wai…

b135ecd

…ting

revamping locking scheme, test this a bunch

61d783c

stabilize println based test a bit more against timing

00f674b

re-enable tsan test: async_taskgroup_next

34f8da3

reenable async_taskgroup_next_on_pending

0cc31ca

disable debugging tricks

383c62f

unlock test: async_taskgroup_asynciterator_semantics

376a9a8

make use of unreachable

060260e

ktoso force-pushed the wip-multi-error-group-single-leak branch from 7e23308 to 060260e Compare August 8, 2023 23:48

ktoso commented Aug 9, 2023

View reviewed changes

cleanups

7d93bba

ktoso force-pushed the wip-multi-error-group-single-leak branch from 33e15ba to 7d93bba Compare August 9, 2023 03:47

cleanup for freestanding mode

8aa70dc

ktoso marked this pull request as ready for review August 9, 2023 06:16

ktoso requested a review from mikeash August 9, 2023 06:21

ktoso mentioned this pull request Aug 9, 2023

🍒[5.9][TaskGroup] Fix unlock order, add missing detaches and add more assertions #67819

Merged

ktoso added the concurrency runtime Feature: umbrella label for concurrency runtime features label Aug 9, 2023

al45tair approved these changes Aug 9, 2023

View reviewed changes

mikeash approved these changes Aug 9, 2023

View reviewed changes

Review feedback: add better code comments

1195955

swift-ci merged commit 9668f04 into swiftlang:main Aug 10, 2023

ktoso deleted the wip-multi-error-group-single-leak branch August 10, 2023 22:16

ktoso mentioned this pull request Aug 11, 2023

🍒[5.9.0][TaskGroup] Fix unlock order, add missing detaches and add more assertions #67892

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TaskGroup] Fix unlock order, add missing detaches and add more assertions #67590

[TaskGroup] Fix unlock order, add missing detaches and add more assertions #67590

ktoso commented Jul 28, 2023 •

edited

Loading

ktoso Jul 28, 2023

ktoso commented Jul 28, 2023

kavon commented Jul 31, 2023 •

edited

Loading

kavon commented Jul 31, 2023

ktoso commented Jul 31, 2023

ktoso Aug 8, 2023

ktoso Aug 8, 2023

ktoso Aug 8, 2023

ktoso Aug 8, 2023

ktoso commented Aug 8, 2023

ktoso commented Aug 8, 2023

ktoso commented Aug 8, 2023

ktoso commented Aug 8, 2023

ktoso commented Aug 8, 2023

ktoso Aug 9, 2023

ktoso Aug 9, 2023

ktoso Aug 9, 2023

ktoso commented Aug 9, 2023

ktoso commented Aug 9, 2023

ktoso commented Aug 9, 2023

ktoso commented Aug 9, 2023

ktoso commented Aug 9, 2023

al45tair left a comment

al45tair Aug 9, 2023

ktoso Aug 9, 2023

mikeash Aug 9, 2023

ktoso Aug 9, 2023

mikeash Aug 9, 2023

ktoso Aug 9, 2023

ktoso commented Aug 10, 2023

ktoso commented Aug 10, 2023

[TaskGroup] Fix unlock order, add missing detaches and add more assertions #67590

[TaskGroup] Fix unlock order, add missing detaches and add more assertions #67590

Conversation

ktoso commented Jul 28, 2023 • edited Loading

Choose a reason for hiding this comment

ktoso commented Jul 28, 2023

kavon commented Jul 31, 2023 • edited Loading

kavon commented Jul 31, 2023

ktoso commented Jul 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ktoso commented Aug 8, 2023

ktoso commented Aug 8, 2023

ktoso commented Aug 8, 2023

ktoso commented Aug 8, 2023

ktoso commented Aug 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ktoso commented Aug 9, 2023

ktoso commented Aug 9, 2023

ktoso commented Aug 9, 2023

ktoso commented Aug 9, 2023

ktoso commented Aug 9, 2023

al45tair left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ktoso commented Aug 10, 2023

ktoso commented Aug 10, 2023

ktoso commented Jul 28, 2023 •

edited

Loading

kavon commented Jul 31, 2023 •

edited

Loading