[bug] clang miscompiles coroutine awaiter, moving write across a critical section #56301

jacobsa · 2022-06-30T10:11:46Z

I've hit what I think is a miscompilation bug in clang, where a write is moved in an illegal way that introduces a data race and/or use of uninitialized memory. Here is a test case reduced from my real codebase (Compiler Explorer):

#include <coroutine>
#include <utility>

struct SomeAwaitable {
  // Resume the supplied handle once the awaitable becomes ready,
  // returning a handle that should be resumed now for the sake of symmetric transfer.
  // If the awaitable is already ready, return an empty handle without doing anything.
  //
  // Defined in another translation unit. Note that this may contain
  // code that synchronizees with another thread.
  std::coroutine_handle<> Register(std::coroutine_handle<>);
};

// Defined in another translation unit.
void DidntSuspend();

struct Awaiter {
  SomeAwaitable&& awaitable;
  bool suspended;

  bool await_ready() { return false; }

  std::coroutine_handle<> await_suspend(const std::coroutine_handle<> h) {
    // Assume we will suspend unless proven otherwise below. We must do
    // this *before* calling Register, since we may be destroyed by another
    // thread asynchronously as soon as we have registered.
    suspended = true;

    // Attempt to hand off responsibility for resuming/destroying the coroutine.
    const auto to_resume = awaitable.Register(h);

    if (!to_resume) {
      // The awaitable is already ready. In this case we know that Register didn't
      // hand off responsibility for the coroutine. So record the fact that we didn't
      // actually suspend, and tell the compiler to resume us inline.
      suspended = false;
      return h;
    }

    // Resume whatever Register wants us to resume.
    return to_resume;
  }

  void await_resume() {
    // If we didn't suspend, make note of that fact.
    if (!suspended) {
      DidntSuspend();
    }
  }
};

struct MyTask{
  struct promise_type {
    MyTask get_return_object() { return {}; }
    std::suspend_never initial_suspend() { return {}; }
    std::suspend_always final_suspend() noexcept { return {}; }
    void unhandled_exception();

    auto await_transform(SomeAwaitable&& awaitable) {
      return Awaiter{std::move(awaitable)};
    }
  };
};

MyTask FooBar() {
  co_await SomeAwaitable();
}

The idea is that the awaiter is implemented by calling a Register function in a foreign translation unit that decides what to do:

If the coroutine should be resumed immediately, it returns a null handle to indicate this.
If the coroutine will be resumed later, it reduces some other handle to resume now, for symmetric control. (Maybe std::noop_coroutine().)

Further, when we don't actually wind up suspending we need await_resume to do some follow-up work, in this case represented by calling the DidntSuspend function. So we use a suspended member to track whether we actually suspended. This is written before calling Register, and read after resuming.

The bug I see in my codebase is that the write of true to suspended is delayed until after the call to Register. In the reduced test case, we have something similar. Here is what Compiler Explorer gives me for clang with -std=c++20 -O1 -fno-exceptions:

FooBar():                             # @FooBar()
        push    rbx
        mov     edi, 32
        call    operator new(unsigned long)
        mov     rbx, rax
        mov     qword ptr [rax], offset FooBar() [clone .resume]
        mov     qword ptr [rax + 8], offset FooBar() [clone .destroy]
        lea     rdi, [rax + 18]
        mov     byte ptr [rax + 17], 0
        mov     rsi, rax
        call    SomeAwaitable::Register(std::__n4861::coroutine_handle<void>)
        mov     qword ptr [rbx + 24], rax
        test    rax, rax
        cmove   rax, rbx
        mov     rdi, rax
        call    qword ptr [rax]
        pop     rbx
        ret
FooBar() [clone .resume]:                      # @FooBar() [clone .resume]
        push    rbx
        mov     rbx, rdi
        cmp     qword ptr [rdi + 24], 0
        jne     .LBB1_2
        call    DidntSuspend()
.LBB1_2:
        mov     qword ptr [rbx], 0
        pop     rbx
        ret
FooBar() [clone .destroy]:                     # @FooBar() [clone .destroy]
        push    rax
        call    operator delete(void*)
        pop     rax
        ret

The coroutine frame address is in rbx. After calling Register, the returned handle is stored into the coroutine frame at offset 24 and then resumed (or the original handle resumed if it's empty), and later in [clone .resume] the handle in the frame at offset 24 is compared to zero to synthesize the if (!suspended) condition.

But it's not safe to store the returned handle in the coroutine frame unless it's zero: any other value indicates that Register took responsibility for the coroutine handle, and may have passed it off to another thread. So another thread may have called destroy on the handle by the time we get around to writing into it. Similarly, the other thread may already have resumed the coroutine and see an uninitialized value at offset 24.

I think this is a miscompilation. Consider for example that Register may contain a critical section under a mutex that hands the coroutine handle off to another thread to resume, with a similar critical section in the other thread synchronizing with the first. (This is the situation in my codebase.) So we have:

The write of suspended in await_suspend is sequenced before the call to Register below it in await_suspend.
The call to Register synchronizes with the function on the other thread that resumes the coroutine.
That synchronization is sequenced before resuming the coroutine handle.
Resuming the coroutine handle is (I believe?) sequenced before the call to await_resume that reads suspended.
Therefore the write of suspended inter-thread happens before the read of suspended.

So there was no data race before, but clang has introduced one by delaying the write to the coroutine frame.

For what it's worth, I spent some time dumping IR after optimization passes with my real codebase, and in that case this seemed to be related to an interaction betweem SROAPass and CoroSplitPass:

Until SROAPass the write was a simple store to the coroutine frame, before the call to Register.
SROAPass eliminated the write altogether, turning it into phi nodes that plumbed the value directly into the branch. The value was plumbed from before the call to Register to after it.
CoroSplitPass re-introduced a store instruction, after the call to Register.

I am far from an expert here, but I wonder if SROAPass should be forbidden from making optimizatons of this sort across an llvm.coro.suspend?

The text was updated successfully, but these errors were encountered:

jacobsa · 2022-06-30T10:18:47Z

By the way, I should mention that I discovered this because tsan reports it as a data race. And I think it's correct: clang has introduced a data race by putting a write after the call to Register, by which time another thread could be using the coroutine frame.

aeubanks · 2022-06-30T15:33:11Z

@ChuanqiXu9

ChuanqiXu9 · 2022-07-01T02:46:34Z

I'm not sure if this is related to a known bug that current coroutine couldn't cache TLS variable correctly.

@jacobsa Could you build clang from source? If yes, could you test it again after applying https://reviews.llvm.org/D125291 and https://reviews.llvm.org/D127383?

jacobsa · 2022-07-01T02:58:04Z

@ChuanqiXu9: just saw your comment after writing this. I'll try that shortly, but it may take me some time because I've never done it before. In the meantime here is some information about the IR—can you tell whether it's related based on that?

Here is an IR dump after each optimization pass made with -mllvm -print-after-all -mllvm -filter-print-funcs=_Z6FooBarv. It was made with a Google-internal build of clang based on db1978b674, and the build settings might be slightly different from the Compiler Explorer link above.

You can see that in the version on line 3669 we still have the correct control flow:

  store i8 1, i8* %44, align 8, !dbg !913
  %45 = getelementptr inbounds %struct.Awaiter, %struct.Awaiter* %5, i64 0, i32 0, !dbg !914
  %46 = load %struct.SomeAwaitable*, %struct.SomeAwaitable** %45, align 8, !dbg !914
  %47 = icmp eq %struct.SomeAwaitable* %46, null, !dbg !915, !nosanitize !82
  br i1 %47, label %48, label %49, !dbg !915, !nosanitize !82

48:                                               ; preds = %37
  call void @llvm.ubsantrap(i8 22) #14, !nosanitize !82
  unreachable, !nosanitize !82

49:                                               ; preds = %37
  %50 = call i8* @_ZN13SomeAwaitable8RegisterENSt3__u16coroutine_handleIvEE(%struct.SomeAwaitable* noundef nonnull align 1 dereferenceable(1) %46, i8* %43) #2, !dbg !915
  call void @llvm.dbg.value(metadata i8* %50, metadata !909, metadata !DIExpression()) #2, !dbg !910
  call void @llvm.dbg.value(metadata %"struct.std::__u::coroutine_handle"* undef, metadata !916, metadata !DIExpression()) #2, !dbg !920
  %51 = icmp eq i8* %50, null, !dbg !923
  br i1 %51, label %52, label %53, !dbg !924

52:                                               ; preds = %49
  store i8 0, i8* %44, align 8, !dbg !925
  br label %53, !dbg !927

However the SROAPass on line 3877 eliminates the stores, turning them into a phi node to select false or true depending on the result of Register, and then later use that to decide whether to call DidntSuspend:

25:                                               ; preds = %21
  %26 = call i8* @_ZN13SomeAwaitable8RegisterENSt3__u16coroutine_handleIvEE(%struct.SomeAwaitable* noundef nonnull align 1 dereferenceable(1) %19, i8* %11) #2, !dbg !910
  call void @llvm.dbg.value(metadata i8* %26, metadata !907, metadata !DIExpression()) #2, !dbg !908
  call void @llvm.dbg.value(metadata %"struct.std::__u::coroutine_handle"* undef, metadata !911, metadata !DIExpression()) #2, !dbg !915
  %27 = icmp eq i8* %26, null, !dbg !918
  br i1 %27, label %28, label %29, !dbg !919

28:                                               ; preds = %25
  br label %29, !dbg !920

29:                                               ; preds = %25, %28
  %30 = phi i8 [ 0, %28 ], [ 1, %25 ], !dbg !908
  %31 = phi i8* [ %11, %28 ], [ %26, %25 ], !dbg !908
  call void @llvm.dbg.value(metadata %"struct.std::__u::coroutine_handle"* undef, metadata !922, metadata !DIExpression()), !dbg !925
  %32 = call i8* @llvm.coro.subfn.addr(i8* %31, i8 0)
  %33 = bitcast i8* %32 to void (i8*)*
  call fastcc void %33(i8* %31) #2, !dbg !890
  %34 = call i8 @llvm.coro.suspend(token %22, i1 false), !dbg !890
  switch i8 %34, label %52 [
    i8 0, label %35
    i8 1, label %46
  ], !dbg !890

35:                                               ; preds = %29, %15
  %36 = phi i8 [ %20, %15 ], [ %30, %29 ], !dbg !927
  call void @llvm.dbg.value(metadata %struct.Awaiter* undef, metadata !928, metadata !DIExpression()) #2, !dbg !931
  %37 = icmp eq i8 %36, 0, !dbg !933
  br i1 %37, label %38, label %39, !dbg !935

38:                                               ; preds = %35
  call void @_Z12DidntSuspendv() #2, !dbg !936
  br label %39, !dbg !938

The lack of a store is preserved up through the version on line 3068:

  %14 = call i8* @_ZN13SomeAwaitable8RegisterENSt3__u16coroutine_handleIvEE(%struct.SomeAwaitable* noundef nonnull align 1 dereferenceable(1) %2, i8* %11) #2, !dbg !841
  call void @llvm.dbg.value(metadata i8* %14, metadata !838, metadata !DIExpression()) #2, !dbg !839
  call void @llvm.dbg.value(metadata %"struct.std::__u::coroutine_handle"* undef, metadata !842, metadata !DIExpression()) #2, !dbg !846
  %15 = icmp eq i8* %14, null, !dbg !849
  %16 = select i1 %15, i8* %11, i8* %14, !dbg !850
  %17 = call i8* @llvm.coro.subfn.addr(i8* %16, i8 0)
  %18 = bitcast i8* %17 to void (i8*)*
  call fastcc void %18(i8* %16) #2, !dbg !832
  %19 = call i8 @llvm.coro.suspend(token %13, i1 false), !dbg !832
  switch i8 %19, label %32 [
    i8 0, label %20
    i8 1, label %26
  ], !dbg !832

20:                                               ; preds = %9
  call void @llvm.dbg.value(metadata %struct.Awaiter* undef, metadata !851, metadata !DIExpression()) #2, !dbg !854
  br i1 %15, label %21, label %22, !dbg !856

21:                                               ; preds = %20
  call void @_Z12DidntSuspendv() #2, !dbg !857
  br label %22, !dbg !860

But then on line 6111 CoroSplitPass takes this and introduces the incorrect unconditional store after Register returns:

  %27 = call i8* @_ZN13SomeAwaitable8RegisterENSt3__u16coroutine_handleIvEE(%struct.SomeAwaitable* noundef nonnull align 1 dereferenceable(1) %19, i8* %13) #2, !dbg !858
  %28 = getelementptr inbounds %_Z6FooBarv.Frame, %_Z6FooBarv.Frame* %14, i32 0, i32 5, !dbg !850
  store i8* %27, i8** %28, align 8, !dbg !850

I'd appreciate anybody's thoughts about what could be done to prevent this.

jacobsa · 2022-07-01T03:41:19Z

@ChuanqiXu9 okay yes, I can reproduce this at 91ab4d4231e5b7456d012776c5eeb69fa61ab994:

> ./bin/clang++ -std=c++20 -O1 -fno-exceptions -S -mllvm --x86-asm-syntax=intel ~/tmp/foo.cc
> grep -A 5 Register foo.s
        call    _ZN13SomeAwaitable8RegisterENSt7__n486116coroutine_handleIvEE@PLT
        mov     qword ptr [rbx + 24], rax
        test    rax, rax
        cmove   rax, rbx
        mov     rdi, rax
        call    qword ptr [rax]

I applied https://reviews.llvm.org/D125291 and https://reviews.llvm.org/D127383 in their current state and rebuilt clang, and still get the same result. I guess that makes sense—there is no TLS here.

ChuanqiXu9 · 2022-07-01T05:17:18Z

I applied https://reviews.llvm.org/D125291 and https://reviews.llvm.org/D127383 in their current state and rebuilt clang, and still get the same result. I guess that makes sense—there is no TLS here.

Oh, sorry for misleading.

I think I get the problem. Long story short, your analysis (and the analysis of tsan) is correct. This is a (potential) miscompile.

Here is the reason:

auto *handle = coro.begin();
bool suspended = false;
call func(handle); // we don't know the body
Use of suspended...

The key issue here is that:

The suspended is/will be a member variable of the coroutine frame.
But the coroutine frame is escaped in the call. So the suspended is escaped too. So we shouldn't sink it.
But other optimizer don't know the information that suspended lives in a structure that handle refers too.

I think we need to introduce something like CoroutineAA to provide the information. I would try to look at it.
And it wouldn't be done in a few days so you probably need to do some workaround. Maybe something like DO_NOT_OPTIMIZE(...)?

I am far from an expert here, but I wonder if SROAPass should be forbidden from making optimizatons of this sort across an llvm.coro.suspend?

This is not an option to me. The key reason why Clang/LLVM want to construct coroutine frames is about the performance. And in fact, there were many such bugs about coroutines, which could be fixed in one shot if we disable the optimizations. So our strategy is always to fix the actual issues. As a heavy user and developer of coroutines, I believe it should be the right choice since the performance is a key reason why we chose C++.

jacobsa · 2022-07-01T09:40:25Z

Yeah, I didn't mean disabling optimizations altogether. Just recognizing that this particular optimization shouldn't be performed for objects that span an llvm.coro.suspend.

It's probably more complicated than I realize. Thanks for looking; I look forward to seeing what fix you come up with. :-)

havardpe · 2022-11-29T17:13:21Z

I have recently run into the same issue using clang 14.0.6. My conclusion is that the await_suspend function is inlined into the coroutine function, and as a side-effect, (in your case) the to_resume variable is converted from a stack variable to a coroutine state variable, which makes it unsafe to check after you have tried to give the coroutine away. A work-around is to tag the await_suspend function with __attribute__((noinline)). gcc (11.2.1) does not seem to have this issue.

ChuanqiXu9 · 2022-11-30T01:58:55Z

I have recently run into the same issue using clang 14.0.6. My conclusion is that the await_suspend function is inlined into the coroutine function, and as a side-effect, (in your case) the to_resume variable is converted from a stack variable to a coroutine state variable, which makes it unsafe to check after you have tried to give the coroutine away. A work-around is to tag the await_suspend function with __attribute__((noinline)). gcc (11.2.1) does not seem to have this issue.

GCC has much less coroutine bugs than clang. Since all the coroutine related works in GCC are done in the frontend. And for clang, the middle end gets involved to optimize coroutines further.

havardpe · 2022-11-30T12:40:56Z

I am aware that the support for coroutines is much more limited in gcc. That is why I am experimenting with clang. I love the fact that clang is able to fully inline non-recursive synchronous generators. Here are some code snippets that might help pinpoint the underlying issue (hopefully the same one observed by @jacobsa).

this code does not trigger the issue:

auto schedule(Executor &executor) {
    struct [[nodiscard]] awaiter {
        Executor &executor;
        awaiter(Executor &executor_in)
            : executor(executor_in) {}
        bool await_ready() const noexcept { return false; }
        void await_suspend(std::coroutine_handle<> handle) {
            struct ResumeTask : Executor::Task {
                std::coroutine_handle<> handle;
                ResumeTask(std::coroutine_handle<> handle_in)
                  : handle(handle_in) {}
                void run() override { handle.resume(); }
            };
            Executor::Task::UP task = std::make_unique<ResumeTask>(handle);
            task = executor.execute(std::move(task));
            if (task) {
                throw ScheduleFailedException("rejected by executor");
            }
        }
        void await_resume() const noexcept {}
    };
    return awaiter(executor);
}

this code triggers the issue if the noinline tag is removed:

auto try_schedule(Executor &executor) {
    struct [[nodiscard]] awaiter {
        Executor &executor;
        bool accepted;
        awaiter(Executor &executor_in)
            : executor(executor_in), accepted(true) {}
        bool await_ready() const noexcept { return false; }
        bool await_suspend(std::coroutine_handle<> handle) __attribute__((noinline)) {
            struct ResumeTask : Executor::Task {
                std::coroutine_handle<> handle;
                ResumeTask(std::coroutine_handle<> handle_in)
                  : handle(handle_in) {}
                void run() override { handle.resume(); }
            };
            Executor::Task::UP task = std::make_unique<ResumeTask>(handle);
            task = executor.execute(std::move(task));
            if (task) {
                // need to start with accepted == true to avoid race
                // with handle.resume() from executor thread before
                // await_suspend has returned.
                accepted = false;
                return false;
            } else {
                return true;
            }
        }
        [[nodiscard]] bool await_resume() const noexcept { return accepted; }
    };
    return awaiter(executor);
}

The issue (according to TSAN) is that the local task variable ends up in the coroutine frame in the second version (but apparently not in the first). This may be caused by its entanglement with the accepted frame variable. It might get tagged with 'needs to be stored in the state since it might be used after the coroutine is suspended'. But in reality the variable needs to perform a reverse-escape from the coroutine frame into the stack in order to live long enough to be checked after the coroutine state has been destroyed by another thread.

ChuanqiXu9 · 2022-12-05T07:24:47Z

Bug report 59221 has a different reason than this. So ignore the above mentioning.

Now I feel this is like a TSan bug/defect instead of a compiler/optimizer bug/defect more. Here are my reasons:

According to the ABI requirement, for a coroutine frame pointer void *frame, for these functions which are not the corresponding resume/destroy functions, is only allowed to access the first 2 fields (the addresses of the resume/destroy functions). And if the user can be sure about the corresponding promise type, it is also allowed to access the corresponding promise type at the address of frame + 0xf.

In another word, in the example, the function Register shouldn't access the variable suspended by the coroutine handle h in any means.

@jacobsa @havardpe Does the explanation make sense for you? If yes, I think it may be better to file another issue to tell the TSan guys to address this.

jacobsa · 2022-12-05T08:59:33Z

Are you saying this hinges on the type erasure of std::coroutine_handle<>, so that SomeAwaitable::Register isn't allowed to access suspended because it doesn't know the type of the promise? If so I don't think that argument applies; the same code is generated even when we make the promise type explicit.

ChuanqiXu9 · 2022-12-05T09:18:42Z

so that SomeAwaitable::Register isn't allowed to access suspended because it doesn't know the type of the promise?

No. I mean SomeAwaitable::Register isn't allowed to access suspended in any means. No matter if it knows the promise or not. This is the ABI requirement for coroutines. The ABI for coroutine frame is as following:

coro_frame {
    void (*__resume_)(), // address of resume function
    void (*__destroy_)(), // address of the destroy function
    promise_type promise, // the corresponding promise
    ... // other necessary part
}

So it is clear that users are allowed to access the resume function, destroy function and the promise (if they know the type). But they are not allowed to access the other necessary part here. In fact, the users can't access the other necessary part legally here. Unless they do some mad address calculation and dereferences, for example:

    std::coroutine_handle<> h;
    int a = (int)(*(h.address() + 1024));

or

    std::coroutine_handle<> h;
    memset(h.address(), 1000, 0);

the above code is clearly not legal.

So I want to say that the users (or SomeAwaitable::Register in this example) shouldn't/can't access the other part of the coroutine frame (or suspended). In the consequences, it is valid that the compiler sink the store beyond the call to SomeAwaitable::Register. And the problem here is that TSan failed to know this. How about this explanation for you?

havardpe · 2022-12-05T14:53:34Z

As far as I can see, both issues (write ordering, local variable placement) boils down to this:

To work well in a multi-threaded environment, a coroutine needs to support being resumed in another thread before the await_suspend function has returned in the thread handling its suspension.

jacobsa · 2022-12-06T01:38:25Z

@ChuanqiXu9 Ah, I see what you mean. I think @havardpe has it right. This issue isn't about the fact that SomeAwaitable::Register may access Awaiter::suspended (it doesn't in my example). It's about the fact that SomeAwaitable::Register may cause the coroutine to be destroyed on another thread, and so it's not safe for FooBar to access the coroutine frame if Register returns a null result to signal this.

The generated assembly code gets this wrong. It unconditionally writes to the coroutine frame after Register returns, despite the C++ only conditionally writing based on the result of Register:

const auto to_resume = awaitable.Register(h);
if (!to_resume) {
  suspended = false;
  return h;
}

call    SomeAwaitable::Register(std::__n4861::coroutine_handle<void>)
mov     qword ptr [rbx + 24], rax

This is not correct, because Register may have passed the coroutine handle off to another thread, which could have destroyed it by the time Register returns. This is a use-after-free bug. And it was introduced by the compiler: it's not present in the original C++ code.

My proof from the original post about the write of suspended happening before the read of suspended was only to show that the input C++ doesn't have a data race, i.e. this is not the result of undefined behavior.

ChuanqiXu9 · 2022-12-06T01:59:50Z

To work well in a multi-threaded environment, a coroutine needs to support being resumed in another thread before the await_suspend function has returned in the thread handling its suspension.

If my reading is correct, you're saying the coroutine FooBar may be destroyed before the await_suspend function has returned, is my reading correct? If yes, this is not valid. Since the spec says it is an UB if we resume/destroy a coroutine which is not suspended.

It's about the fact that SomeAwaitable::Register may cause the coroutine to be destroyed on another thread, and so it's not safe for FooBar to access the coroutine frame if Register returns a null result to signal this.

This is not correct, because Register may have passed the coroutine handle off to another thread, which could have destroyed it by the time Register returns.

So these two sentences look like UB to me if my understanding is right.

jacobsa · 2022-12-06T02:05:09Z

If my reading is correct, you're saying the coroutine FooBar may be destroyed before the await_suspend function has returned, is my reading correct?

This is the problem, yes.

If yes, this is not valid. Since the spec says it is an UB if we resume/destroy a coroutine which is not suspended.

Could you please cite the standard in more detail if you think this is UB? My understanding is that the coroutine is suspended once it enters await_suspend. (If not, what is the definition of "suspended"?)

I don't think there is any alternative to this. There is no way for the implementor to synchronize on await_suspend having fully returned before it's allowed to destroy on another thread. Note that this isn't just destroying by calling std::coroutine_handle<>::destroy—this could be "destroying" just due to resuming the coroutine on another thread and having it run to completion. This is fundamental in how coroutines work.

ChuanqiXu9 · 2022-12-06T02:21:19Z

Could you please cite the standard in more detail if you think this is UB? My understanding is that the coroutine is suspended once it enters await_suspend. (If not, what is the definition of "suspended"?)

http://eel.is/c++draft/coroutine.handle.resumption talks about we can't resume/destroy a non-suspended coroutine.

And about the definition of suspended, hmmmm, according to http://eel.is/c++draft/expr.await#5.1, you are right since the spec says the coroutine is suspended if await_ready returns false. However, in the implementation, we assume the coroutine is suspended if it is exited from the stack (in another word, after the return of await_suspend).

Luckily we found another defect report. Let me see how to fix it... I feel like it is a pretty tough one...

I don't think there is any alternative to this. There is no way for the implementor to synchronize on await_suspend having fully returned before it's allowed to destroy on another thread. Note that this isn't just destroying by calling std::coroutine_handle<>::destroy—this could be "destroying" just due to resuming the coroutine on another thread and having it run to completion. This is fundamental in how coroutines work.

I got your point. Although we did this usually like:

void await_suspend(coroutine_handle<> h) {
     get_scheduler()->schedule(h);
     return;
}

The reason why we don't use coroutine_handle as the return type is that we found we'll get 2 more memory access in that form. (Just an information sharing. Not to suggest you to refactor your codes).

jacobsa · 2022-12-06T02:50:16Z

And about the definition of suspended, hmmmm, according to http://eel.is/c++draft/expr.await#5.1, you are right since the spec says the coroutine is suspended if await_ready returns false. However, in the implementation, we assume the coroutine is suspended if it is exited from the stack (in another word, after the return of await_suspend).

This sounds like it's the heart of the bug; thanks for finding it!

jacobsa · 2022-12-06T02:56:32Z

Also, I think you already agree with me, but I do want to drive home the point that clang's current definition doesn't make sense. Here's your example of more typical code again:

void await_suspend(coroutine_handle<> h) {
     get_scheduler()->schedule(h);
     return;
}

But this code also doesn't synchronize on await_suspend having returned. It's totally possible for the interleaving to look like this:

Thread 1: await_suspend calls Scheduler::schedule.
Thread 1: Scheduler::schedule puts the handle on some queue.
Thread 1: Scheduler::schedule returns.
Thread 2: a work loop takes the handle off the queue and resumes it.
Thread 1: await_suspend returns.

If the definition is "a coroutine is suspended once await_suspend returns", then this is UB because a non-suspended coroutine was resumed in step (4). It's impossible for this definition to work, because there is no hook to find out when await_suspend returns.

ChuanqiXu9 · 2022-12-06T03:11:41Z

Also, I think you already agree with me, but I do want to drive home the point that clang's current definition doesn't make sense. Here's your example of more typical code again:
void await_suspend(coroutine_handle<> h) {
     get_scheduler()->schedule(h);
     return;
}
But this code also doesn't synchronize on await_suspend having returned. It's totally possible for the interleaving to look like this:

Thread 1: await_suspend calls Scheduler::schedule.

Thread 1: Scheduler::schedule puts the handle on some queue.

Thread 1: Scheduler::schedule returns.

Thread 2: a work loop takes the handle off the queue and resumes it.

Thread 1: await_suspend returns.

If the definition is "a coroutine is suspended once await_suspend returns", then this is UB because a non-suspended coroutine was resumed in step (4). It's impossible for this definition to work, because there is no hook to find out when await_suspend returns.

Yeah, you're right and we've thought this before too. Although our previous explanation is "it is OK from the compiler's perspective since there is no code left after get_scheduler()->schedule(h);.", but it indeed actually UB from the spec strictly.

In fact I am surprised that the coroutine is considered suspended in the execution of await_suspend, since it is not suspended and it is still running actually... I mean the there is inconsistency between the design model and the implementation model. And we can't fix it perfectly. We can only workaround it by simulating or pretending the coroutine is suspended in the await_suspend. And I believe it will be tough time. I mean both the user and the implementor may take a relative long time to encounter, to locate and to fix the bugs...

jacobsa · 2022-12-06T03:16:56Z

In fact I am surprised that the coroutine is considered suspended in the execution of await_suspend, since it is not suspended and it is still running actually... I mean the there is inconsistency between the design model and the implementation model.

I leave it to you to say how hard it is to fix the implementation, but I'll defend the standard, since I think its definition makes sense:

As discussed above, it's the only definition that makes sense if the coroutine may be resumed/destroyed on another thread.
The code in await_suspend is not part of the coroutine; it's part of the promise glue. By the time it runs, the code in the coroutine itself has stopped running.

ChuanqiXu9 · 2022-12-06T03:30:27Z

In fact I am surprised that the coroutine is considered suspended in the execution of await_suspend, since it is not suspended and it is still running actually... I mean the there is inconsistency between the design model and the implementation model.

I leave it to you to say how hard it is to fix the implementation, but I'll defend the standard, since I think its definition makes sense:

As discussed above, it's the only definition that makes sense if the coroutine may be resumed/destroyed on another thread.

The code in await_suspend is not part of the coroutine; it's part of the promise glue. By the time it runs, the code in the coroutine itself has stopped running.

I got your point. Your words make sense from the perspective of users.

vogelsgesang · 2022-12-06T19:50:23Z

I mean the there is inconsistency between the design model and the implementation model

Isn't this exactly why the LLVM model has the llvm.coro.save instrinsic? My understanding was that the LLVM code calls llvm.coro.save after await_ready returns false, but before llvm.coro.suspend is called. And already with the llvm.coro.save call, the coroutine is considered to be suspended and might be resumed on a different thread?

… regressions The fix we sent for llvm/llvm-project#56301 may bring performance regressions. But we didn't mention it in the ReleaseNotes so that users may get confused. e.g, llvm/llvm-project#64933. So this patch mentions the possible side effect and the potential solutions in llvm/llvm-project#64945 to avoid misunderstandings.

ChuanqiXu9 · 2023-08-28T01:59:31Z

@ChuanqiXu9 thanks very much for working on this. Your commit does fix my reproducer from this thread, but unfortunately it doesn't fix the real example in my codebase mentioned in this comment. I've filed #65018 with a new, even simpler, reproducer. Could you please take a look if you have some time?

Got it. I'll try to take a look but I can't make a promise.

ChuanqiXu9 · 2023-08-28T05:34:51Z

/branch ChuanqiXu9/llvm-project/release/17.x

llvmbot · 2023-08-28T05:46:25Z

/pull-request llvm/llvm-project-release-prs#655

ChuanqiXu9 · 2023-09-11T08:01:36Z

/branch ChuanqiXu9/llvm-project/release/17.x

tru · 2023-09-18T07:15:31Z

Merged into release/17.x

ChuanqiXu9 · 2023-09-18T07:30:31Z

Remove this from LLVM17.x Release milestone since the fix wouldn't be there.

@rjmccall

… not empty Close llvm#56301 Close llvm#64151 See the summary and the discussion of https://reviews.llvm.org/D157070 to get the full context. As @rjmccall pointed out, the key point of the root cause is that currently we didn't implement the semantics for '@llvm.coro.save' well ("after the await-ready returns false, the coroutine is considered to be suspended ") well. Since the semantics implies that we (the compiler) shouldn't write the spills into the coroutine frame in the await_suspend. But now it is possible due to some combinations of the optimizations so the semantics are broken. And the inlining is the root optimization of such optimizations. So in this patch, we tried to add the `noinline` attribute to the await_suspend call. Also as an optimization, we don't add the `noinline` attribute to the await_suspend call if the awaiter is an empty class. This should be correct since the programmers can't access the local variables in await_suspend if the awaiter is empty. I think this is necessary for the performance since it is pretty common. Another potential optimization is: call @llvm.coro.await_suspend(ptr %awaiter, ptr %handle, ptr @awaitSuspendFn) Then it is much easier to perform the safety analysis in the middle end. If it is safe to inline the call to awaitSuspend, we can replace it in the CoroEarly pass. Otherwise we could replace it in the CoroSplit pass. Reviewed By: rjmccall Differential Revision: https://reviews.llvm.org/D157833

@rjmccall

… not empty Close llvm#56301 Close llvm#64151 See the summary and the discussion of https://reviews.llvm.org/D157070 to get the full context. As @rjmccall pointed out, the key point of the root cause is that currently we didn't implement the semantics for '@llvm.coro.save' well ("after the await-ready returns false, the coroutine is considered to be suspended ") well. Since the semantics implies that we (the compiler) shouldn't write the spills into the coroutine frame in the await_suspend. But now it is possible due to some combinations of the optimizations so the semantics are broken. And the inlining is the root optimization of such optimizations. So in this patch, we tried to add the `noinline` attribute to the await_suspend call. Also as an optimization, we don't add the `noinline` attribute to the await_suspend call if the awaiter is an empty class. This should be correct since the programmers can't access the local variables in await_suspend if the awaiter is empty. I think this is necessary for the performance since it is pretty common. Another potential optimization is: call @llvm.coro.await_suspend(ptr %awaiter, ptr %handle, ptr @awaitSuspendFn) Then it is much easier to perform the safety analysis in the middle end. If it is safe to inline the call to awaitSuspend, we can replace it in the CoroEarly pass. Otherwise we could replace it in the CoroSplit pass. Reviewed By: rjmccall Differential Revision: https://reviews.llvm.org/D157833

@rjmccall

… not empty Close llvm#56301 Close llvm#64151 See the summary and the discussion of https://reviews.llvm.org/D157070 to get the full context. As @rjmccall pointed out, the key point of the root cause is that currently we didn't implement the semantics for '@llvm.coro.save' well ("after the await-ready returns false, the coroutine is considered to be suspended ") well. Since the semantics implies that we (the compiler) shouldn't write the spills into the coroutine frame in the await_suspend. But now it is possible due to some combinations of the optimizations so the semantics are broken. And the inlining is the root optimization of such optimizations. So in this patch, we tried to add the `noinline` attribute to the await_suspend call. Also as an optimization, we don't add the `noinline` attribute to the await_suspend call if the awaiter is an empty class. This should be correct since the programmers can't access the local variables in await_suspend if the awaiter is empty. I think this is necessary for the performance since it is pretty common. Another potential optimization is: call @llvm.coro.await_suspend(ptr %awaiter, ptr %handle, ptr @awaitSuspendFn) Then it is much easier to perform the safety analysis in the middle end. If it is safe to inline the call to awaitSuspend, we can replace it in the CoroEarly pass. Otherwise we could replace it in the CoroSplit pass. Reviewed By: rjmccall Differential Revision: https://reviews.llvm.org/D157833

@rjmccall

… not empty Close llvm#56301 Close llvm#64151 See the summary and the discussion of https://reviews.llvm.org/D157070 to get the full context. As @rjmccall pointed out, the key point of the root cause is that currently we didn't implement the semantics for '@llvm.coro.save' well ("after the await-ready returns false, the coroutine is considered to be suspended ") well. Since the semantics implies that we (the compiler) shouldn't write the spills into the coroutine frame in the await_suspend. But now it is possible due to some combinations of the optimizations so the semantics are broken. And the inlining is the root optimization of such optimizations. So in this patch, we tried to add the `noinline` attribute to the await_suspend call. Also as an optimization, we don't add the `noinline` attribute to the await_suspend call if the awaiter is an empty class. This should be correct since the programmers can't access the local variables in await_suspend if the awaiter is empty. I think this is necessary for the performance since it is pretty common. Another potential optimization is: call @llvm.coro.await_suspend(ptr %awaiter, ptr %handle, ptr @awaitSuspendFn) Then it is much easier to perform the safety analysis in the middle end. If it is safe to inline the call to awaitSuspend, we can replace it in the CoroEarly pass. Otherwise we could replace it in the CoroSplit pass. Reviewed By: rjmccall Differential Revision: https://reviews.llvm.org/D157833

@rjmccall

… not empty Close llvm#56301 Close llvm#64151 See the summary and the discussion of https://reviews.llvm.org/D157070 to get the full context. As @rjmccall pointed out, the key point of the root cause is that currently we didn't implement the semantics for '@llvm.coro.save' well ("after the await-ready returns false, the coroutine is considered to be suspended ") well. Since the semantics implies that we (the compiler) shouldn't write the spills into the coroutine frame in the await_suspend. But now it is possible due to some combinations of the optimizations so the semantics are broken. And the inlining is the root optimization of such optimizations. So in this patch, we tried to add the `noinline` attribute to the await_suspend call. Also as an optimization, we don't add the `noinline` attribute to the await_suspend call if the awaiter is an empty class. This should be correct since the programmers can't access the local variables in await_suspend if the awaiter is empty. I think this is necessary for the performance since it is pretty common. Another potential optimization is: call @llvm.coro.await_suspend(ptr %awaiter, ptr %handle, ptr @awaitSuspendFn) Then it is much easier to perform the safety analysis in the middle end. If it is safe to inline the call to awaitSuspend, we can replace it in the CoroEarly pass. Otherwise we could replace it in the CoroSplit pass. Reviewed By: rjmccall Differential Revision: https://reviews.llvm.org/D157833

@rjmccall

… not empty Close llvm#56301 Close llvm#64151 See the summary and the discussion of https://reviews.llvm.org/D157070 to get the full context. As @rjmccall pointed out, the key point of the root cause is that currently we didn't implement the semantics for '@llvm.coro.save' well ("after the await-ready returns false, the coroutine is considered to be suspended ") well. Since the semantics implies that we (the compiler) shouldn't write the spills into the coroutine frame in the await_suspend. But now it is possible due to some combinations of the optimizations so the semantics are broken. And the inlining is the root optimization of such optimizations. So in this patch, we tried to add the `noinline` attribute to the await_suspend call. Also as an optimization, we don't add the `noinline` attribute to the await_suspend call if the awaiter is an empty class. This should be correct since the programmers can't access the local variables in await_suspend if the awaiter is empty. I think this is necessary for the performance since it is pretty common. Another potential optimization is: call @llvm.coro.await_suspend(ptr %awaiter, ptr %handle, ptr @awaitSuspendFn) Then it is much easier to perform the safety analysis in the middle end. If it is safe to inline the call to awaitSuspend, we can replace it in the CoroEarly pass. Otherwise we could replace it in the CoroSplit pass. Reviewed By: rjmccall Differential Revision: https://reviews.llvm.org/D157833

@rjmccall

… not empty Close llvm#56301 Close llvm#64151 See the summary and the discussion of https://reviews.llvm.org/D157070 to get the full context. As @rjmccall pointed out, the key point of the root cause is that currently we didn't implement the semantics for '@llvm.coro.save' well ("after the await-ready returns false, the coroutine is considered to be suspended ") well. Since the semantics implies that we (the compiler) shouldn't write the spills into the coroutine frame in the await_suspend. But now it is possible due to some combinations of the optimizations so the semantics are broken. And the inlining is the root optimization of such optimizations. So in this patch, we tried to add the `noinline` attribute to the await_suspend call. Also as an optimization, we don't add the `noinline` attribute to the await_suspend call if the awaiter is an empty class. This should be correct since the programmers can't access the local variables in await_suspend if the awaiter is empty. I think this is necessary for the performance since it is pretty common. Another potential optimization is: call @llvm.coro.await_suspend(ptr %awaiter, ptr %handle, ptr @awaitSuspendFn) Then it is much easier to perform the safety analysis in the middle end. If it is safe to inline the call to awaitSuspend, we can replace it in the CoroEarly pass. Otherwise we could replace it in the CoroSplit pass. Reviewed By: rjmccall Differential Revision: https://reviews.llvm.org/D157833

@rjmccall

… not empty Close llvm#56301 Close llvm#64151 See the summary and the discussion of https://reviews.llvm.org/D157070 to get the full context. As @rjmccall pointed out, the key point of the root cause is that currently we didn't implement the semantics for '@llvm.coro.save' well ("after the await-ready returns false, the coroutine is considered to be suspended ") well. Since the semantics implies that we (the compiler) shouldn't write the spills into the coroutine frame in the await_suspend. But now it is possible due to some combinations of the optimizations so the semantics are broken. And the inlining is the root optimization of such optimizations. So in this patch, we tried to add the `noinline` attribute to the await_suspend call. Also as an optimization, we don't add the `noinline` attribute to the await_suspend call if the awaiter is an empty class. This should be correct since the programmers can't access the local variables in await_suspend if the awaiter is empty. I think this is necessary for the performance since it is pretty common. Another potential optimization is: call @llvm.coro.await_suspend(ptr %awaiter, ptr %handle, ptr @awaitSuspendFn) Then it is much easier to perform the safety analysis in the middle end. If it is safe to inline the call to awaitSuspend, we can replace it in the CoroEarly pass. Otherwise we could replace it in the CoroSplit pass. Reviewed By: rjmccall Differential Revision: https://reviews.llvm.org/D157833

github-actions bot added the new issue label Jun 30, 2022

EugeneZelenko added llvm:optimizations and removed new issue labels Jun 30, 2022

ChuanqiXu9 self-assigned this Jul 1, 2022

fhahn added the coroutines C++20 coroutines label Jul 28, 2022

ChuanqiXu9 mentioned this issue Dec 5, 2022

[coroutines] miscompilation in clang 16 trunk vs clang 15 #59221

Closed

tru moved this from Needs Review to Done in LLVM Release Status Aug 27, 2023

EugeneZelenko added the release:merged label Aug 27, 2023

ChuanqiXu9 reopened this Aug 28, 2023

llvmbot mentioned this issue Aug 28, 2023

PR for llvm/llvm-project#56301 llvm/llvm-project-release-prs#655

Closed

ChuanqiXu9 closed this as completed in b32aa72 Aug 28, 2023

ChuanqiXu9 reopened this Aug 28, 2023

jacobsa mentioned this issue Aug 29, 2023

[bug] clang incorrectly uses coroutine frame for scratch space after suspending (take 2) #65054

Closed

tru moved this from Done to Needs Merge in LLVM Release Status Sep 5, 2023

tru moved this from Needs Merge to Done in LLVM Release Status Sep 18, 2023

tru closed this as completed Sep 18, 2023

ChuanqiXu9 removed this from the LLVM 17.0.X Release milestone Sep 18, 2023

ChuanqiXu9 removed release:backport release:merged labels Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] clang miscompiles coroutine awaiter, moving write across a critical section #56301

[bug] clang miscompiles coroutine awaiter, moving write across a critical section #56301

jacobsa commented Jun 30, 2022 •

edited by VoltrexKeyva

Loading

jacobsa commented Jun 30, 2022 •

edited

Loading

aeubanks commented Jun 30, 2022

ChuanqiXu9 commented Jul 1, 2022

jacobsa commented Jul 1, 2022 •

edited

Loading

jacobsa commented Jul 1, 2022 •

edited by VoltrexKeyva

Loading

ChuanqiXu9 commented Jul 1, 2022

jacobsa commented Jul 1, 2022

havardpe commented Nov 29, 2022 •

edited

Loading

ChuanqiXu9 commented Nov 30, 2022

havardpe commented Nov 30, 2022

ChuanqiXu9 commented Dec 5, 2022

jacobsa commented Dec 5, 2022

ChuanqiXu9 commented Dec 5, 2022 •

edited

Loading

havardpe commented Dec 5, 2022

jacobsa commented Dec 6, 2022 •

edited

Loading

ChuanqiXu9 commented Dec 6, 2022

jacobsa commented Dec 6, 2022

ChuanqiXu9 commented Dec 6, 2022

jacobsa commented Dec 6, 2022

jacobsa commented Dec 6, 2022 •

edited

Loading

ChuanqiXu9 commented Dec 6, 2022

jacobsa commented Dec 6, 2022

ChuanqiXu9 commented Dec 6, 2022

vogelsgesang commented Dec 6, 2022

ChuanqiXu9 commented Aug 28, 2023

ChuanqiXu9 commented Aug 28, 2023

llvmbot commented Aug 28, 2023

ChuanqiXu9 commented Sep 11, 2023

tru commented Sep 18, 2023

ChuanqiXu9 commented Sep 18, 2023

[bug] clang miscompiles coroutine awaiter, moving write across a critical section #56301

[bug] clang miscompiles coroutine awaiter, moving write across a critical section #56301

Comments

jacobsa commented Jun 30, 2022 • edited by VoltrexKeyva Loading

jacobsa commented Jun 30, 2022 • edited Loading

aeubanks commented Jun 30, 2022

ChuanqiXu9 commented Jul 1, 2022

jacobsa commented Jul 1, 2022 • edited Loading

jacobsa commented Jul 1, 2022 • edited by VoltrexKeyva Loading

ChuanqiXu9 commented Jul 1, 2022

jacobsa commented Jul 1, 2022

havardpe commented Nov 29, 2022 • edited Loading

ChuanqiXu9 commented Nov 30, 2022

havardpe commented Nov 30, 2022

ChuanqiXu9 commented Dec 5, 2022

jacobsa commented Dec 5, 2022

ChuanqiXu9 commented Dec 5, 2022 • edited Loading

havardpe commented Dec 5, 2022

jacobsa commented Dec 6, 2022 • edited Loading

ChuanqiXu9 commented Dec 6, 2022

jacobsa commented Dec 6, 2022

ChuanqiXu9 commented Dec 6, 2022

jacobsa commented Dec 6, 2022

jacobsa commented Dec 6, 2022 • edited Loading

ChuanqiXu9 commented Dec 6, 2022

jacobsa commented Dec 6, 2022

ChuanqiXu9 commented Dec 6, 2022

vogelsgesang commented Dec 6, 2022

ChuanqiXu9 commented Aug 28, 2023

ChuanqiXu9 commented Aug 28, 2023

llvmbot commented Aug 28, 2023

ChuanqiXu9 commented Sep 11, 2023

tru commented Sep 18, 2023

ChuanqiXu9 commented Sep 18, 2023

jacobsa commented Jun 30, 2022 •

edited by VoltrexKeyva

Loading

jacobsa commented Jun 30, 2022 •

edited

Loading

jacobsa commented Jul 1, 2022 •

edited

Loading

jacobsa commented Jul 1, 2022 •

edited by VoltrexKeyva

Loading

havardpe commented Nov 29, 2022 •

edited

Loading

ChuanqiXu9 commented Dec 5, 2022 •

edited

Loading

jacobsa commented Dec 6, 2022 •

edited

Loading

jacobsa commented Dec 6, 2022 •

edited

Loading