[D114119][libc++] Fix potential lost wake-up in counting semaphore

This issue tracks picking up https://reviews.llvm.org/D114119 from the Phabricator archive.


> The "futex table" in src/atomic.cpp contains 256 platform native futexes
(something like Linux/OpenBSD/NetBSD futex, macOS __ulock_*, FreeBSD
umtx, Windows WaitOnAddress). This table is needed because some platforms
don't natively support waiting on atomics of sizes 8/16/32/64 bits. On Linux,
for example, futexes are always 32 bit. On these platforms, calling
atomic_wait on an atomic of unsupported size then waits using one of
the table futexes "under the hood".
> Each futex of the table has an associated "waiter count" (__contention_state)
which tracks the number of waiters on that futex. This count is used to avoid
calling the platform wake function needlessly.
> If the platform _does_ support the size of the atomic natively then the
wait/notify syscalls (e.g. futex) are called directly on the atomic instead
of using one from the table. However, the "waiter count" of the table futex
is still used to minimize native wake up calls.
> The part of __libcpp_atomic_wait_backoff_impl where the waiting actually
happens looks like this:
```
auto const __monitor = __libcpp_atomic_monitor(__a);
if(__test_fn())
    return true;
__libcpp_atomic_wait(__a, __monitor);
```
> ...so the current value of the futex (either the table one if the size is not
natively supported, or the actual one otherwise) is saved. Then the test
function is called. If the test is unsuccessful (in case of semaphores the
atomic is "0") then we wait on the futex, telling it the __monitor value to
be free of races.
> This will work absolutely fine if one of the table futexes is used. Waking one
of them up will always increase the futex value by one, so the futex wait will
instantly return if the current atomic value is different than __monitor
(which indicates that a wakeup must have happened _after_ the call to
__test_fn but before going to sleep). There is a corner case when there are
~4 billion wakeups in this short interval. This is mitigated by waiting for at
most 2s when the native platform futex size is only 32 bit (explained by
Olivier Giroux here: https://github.com/ogiroux/atomic_wait/issues/3).
> Now, if the native atomic/futex is used (which is the case on macOS for
counting_semaphore!), deadlocks are _much_ more likely to occur, because now
the futex is no longer a simple counter that always increments, but instead
the "real" value (in case of counting_semaphore just the semaphore value
itself). There might be executions like the following:
```
- T1 calls acquire()
- T2 calls try_acquire()
- T3 calls release() two times
- semaphore starts with "0"

T1                        T2                        T3              sem
------------------------------------------------------------------------
                                                                     0
------------------------------------------------------------------------
acquire():
sees that sema=0
goes into
__libcpp_atomic_wait_backoff_impl                                    0
------------------------------------------------------------------------
                                                    release()        1
------------------------------------------------------------------------
__monitor=1                                                          1
------------------------------------------------------------------------
                          try_acquire():
                          acquires one
                          successfully,
                          returns true                               0
------------------------------------------------------------------------
__test_fn():
returns false
since sem==0                                                         0
------------------------------------------------------------------------
                                                    release()        1
------------------------------------------------------------------------
calls
__libcpp_atomic_wait(__a, 1);
--> deadlock!                                                        1
------------------------------------------------------------------------
```
> Here, __libcpp_atomic_wait(__a, 0); should have been called, because
__test_fn() made a decision to sleep based on the value "0" of sem. But
instead, the value of __monitor is used which is "1" and leads to this ABA
style problem where T1 sleeps although the semaphore is unlocked.
> I think to solve this problem you could require the "test function" to save its
current understanding of the atomic into an in/out argument so that the atomic
wait function knows which value to use for its second argument. In case of the
semaphores, it could look like this:
```
     _LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY
     void acquire()
     {
-        auto const __test_fn = [this]() -> bool {
-            auto __old = __a.load(memory_order_relaxed);
-            return (__old != 0) && __a.compare_exchange_strong(__old, __old - 1, memory_order_acquire, memory_order_relaxed);
+        auto const __test_fn = [this](ptrdiff_t& __old) -> bool {
+            while (true) {
+                if (__old == 0)
+                    return false;
+                if (__a.compare_exchange_weak(__old, __old - 1, memory_order_acquire, memory_order_relaxed))
+                    return true;
+            }
         };
-        __cxx_atomic_wait(&__a.__a_, __test_fn);
+        __cxx_atomic_wait_fn(&__a.__a_, __test_fn, memory_order_relaxed);
     }
```
> For semaphores, the resulting __old is always "0" if "false" is returned. So
the second argument to futex is always "0" which should be correct..
I've been hacking on a patch which is very much WIP but I could post it if
there's interest (and if this analysis even makes sense!).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[D114119][libc++] Fix potential lost wake-up in counting semaphore #77659

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[D114119][libc++] Fix potential lost wake-up in counting semaphore #77659

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions