-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Parallelize epoll events on thread pool and process events in the same thread #35330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize epoll events on thread pool and process events in the same thread #35330
Conversation
…work and process write work in same thread
|
Tagging subscribers to this area: @dotnet/ncl |
Perf on JsonPlatform benchmark
12-proc x64 machine
28-proc x64 machine
56-proc x64 machine (2-socket), limited to 1 socket
56-proc x64 machine (2-socket), not limited
32-proc arm64 machine
|
|
The event-processing work items can theoretically become long-running, probably should fix |
stephentoub
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this.
| // Sync operation. Signal waiting thread to continue processing. | ||
| e.Set(); | ||
| } | ||
| else if (processAsyncOperationSynchronously) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to change sightly how this is structured, in particular for the e != null case. We get there if a synchronous operation is performed on a Socket on which an asynchronous operation was ever performed. In such cases, we've permanently moved the socket to be non-blocking, which means we need to simulate the blocking behavior of all subsequent sync operations, and we do that by using a MRES instead of a callback. We don't want to require a thread pool thread just to set that event, as doing so could lead to thread pool starvation, with a sync operation waiting on a thread pool thread for a work item to be processed by the thread pool that will unblock it. So, we want the epoll thread to set such an MRES rather than queuing a work item to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we want the epoll thread to set such an MRES rather than queuing a work item to do it.
I am afraid that it would require us to make the enqueueing more complex and decrease throughput in a noticeable way.
To set MRES on a epoll thread we would need to check two queues (send and receives) which would require to take two locks:
runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Line 832 in 6ef538d
| using (Lock()) |
and then perform 0 to two casts (depending on whether queues are empty or not):
runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Line 137 in 723e7f2
| get { return CallbackOrEvent as ManualResetEventSlim; } |
Ofc we would have to do that for every epoll event returned by epoll wait.
With @kouvel proposal after we receive an epoll notification we just add a bunch of simple events to a queue (this is very fast) and schedule a work item to the thread pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid that it would require us to make the enqueueing more complex and decrease throughput in a noticeable way.
We only need to do it if the event is for a sync operation, which can be checked cheaply.
The alternative is potential deadlock / long delays while waiting for the thread pool's starvation detection to introduce more threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to do it if the event is for a sync operation, which can be checked cheaply.
But we don't know this when we are receiving the epoll notification. To check this we need to translate the socket handle to socket context:
runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Line 333 in 6ef538d
| _handleToContextMap.TryGetValue(handle, out SocketAsyncContext? context); |
and then take the two locks that I've described above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we could consider having a dedicated epoll thread for sync-over-nonblocking sockets. EDIT: I realize this won't work well, add the socket may already be associated with a particular epoll.
I understand your pushback, but I think this is a big deal. Convince me it's not if you disagree :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operation may be cancelled by time this gets dequeued for execution.
Looks like ProcessQueuedOperation accounts for that.
So the suggested change is something like:
processAsyncOperationSynchronouslybecomesprocessAsyncOperationOnConcurrentQueue, and enqueues to theConcurrentQueue<AsyncOperation>HandleEventsgets called directly on the epoll thread,ConcurrentQueueprocessing is deferred to the ThreadPool after callingHandleEvents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking the lock for each operation queue on the epoll thread seems to reduce RPS by ~20-25 K. In the new commits I made changes to track whether the first operation in each queue is synchronous and to check it speculatively from the epoll thread. The speculative check along with the concurrent dictionary dequeue is not making a noticeable difference to RPS. Considering that the alternative is not incorrect, it seems like a speculative check would be enough to avoid the starvation issue.
I don't think there's a need to queue AsyncOperation to the queue, but anyway it involves taking the lock, and doing so seems to reduce RPS by an additional 15-20 K RPS on top of taking the locks, not sure fully why, though there is a bit more work involved in extracting those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code would be simpler using ConcurrentQueue<AsyncOperation>.
Taking the lock for each operation queue on the epoll thread seems to reduce RPS by ~20-25 K.
I plan to take a shot at replacing these locks with Interlocked operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it involves taking the lock, and doing so seems to reduce RPS by an additional 15-20 K RPS on top of taking the locks, not sure fully why, though there is a bit more work involved in extracting those.
What lock does this refer to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the lock taken in SocketAsyncContext.HandleEvent before this change (linked above by @adamsitnik), in this change the lock taken in SocketAsyncContext.ProcessSyncEventOrGetAsyncEvent.
The code would be simpler using
ConcurrentQueue<AsyncOperation>
Agreed, I wanted to get that to work and tried it first but there currently seem to be obstacles to doing that.
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Outdated
Show resolved
Hide resolved
| return error == Interop.Error.SUCCESS; | ||
| } | ||
|
|
||
| private struct Event |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readonly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for readonly, I think that we could also give it a less generic name, EpollEvent for example.
| private struct Event | |
| private readonly struct EpollEvent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW do you think that changing SocketEvent size from 4 bytes to 1 could improve the perf in any way?
runtime/src/libraries/Common/src/Interop/Unix/System.Native/Interop.SocketEvent.cs
Line 13 in 6ef538d
| internal enum SocketEvents : int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're suggesting that to help reduce the size of EpollEvent? It wouldn't change its size; it would just result in more of the same space being padding. In fact, I think it could actually be slightly worse in terms of codegen, e.g.
https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHunHjykDAKIA3GADsM5VrGwYY5ABQBJSQAUMUBgAts4gCYAbGGgYAhAJ7yGI7IYCuMAJQMAvAD4G4mAHdhYySVdA2NTWwdnAG4uHnomAX8JDFIZGDkYUhV1TR09IxMGAEtJGztHFw8vX0TJTOD8sLKomO4W3gZcTXswDBqpNs5qHmH21QwNLQAJPONooZHuOOArGAYANSa5hdb5hbjRJKUxidyQguXrcPK24cHtkemzt1P8rfueDYjnq5g3hYBfNqA3YMNpxTpQbq9A61AY3WKMY45R6veGLRjFXqfRx/EZo9ow5JZcY5eqhIolH5OfF3d7cFHGZ5k3744bY1auUoRXHDYG8mj/IA===
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
By only executing at most N work items and then scheduling a replica and exiting? Seems reasonable. |
|
Yea that's what I was thinking. Maybe a batch of |
|
Smaller threshold would be better for that issue but probably worse for perf. For instance what if processing a request takes 10 ms, it wouldn't take running many of those for the work item to appear long-running. Maybe also a time-based thing like the thread pool does, or something else. |
Co-Authored-By: Stephen Toub <[email protected]>
|
|
||
| // An event was successfully dequeued, and as there may be more events to process, speculatively schedule a work | ||
| // item to parallelize processing of events. Since this is only for additional parallelization, doing so | ||
| // speculatively is ok. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth mentioning explicitly that the parallelization makes it impossible for continuations to block one another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's possible for a continuation to block on another continuation for the same socket even without the parallelization done here. If a continuation blocks on a synchronous socket operation the next epoll event that serves that blocking operation would schedule another work item. The parallelization done here is only for making use of more procs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's possible for a continuation to block on another continuation for the same socket even without the parallelization done here.
It may be on different sockets. And we shouldn't reply on the next epoll_wait with events to get things moving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If parallelization is not done here, and if a blocking socket operation is performed on this thread as a result of running a user continuation (on any socket), then the set of IO events that are already queued would not be releasing that blocking operation anyway since it's a new operation that got added. Only another epoll event would get that blocking operation moving (before this change too).
However, if it's some other kind of blocking related to other already-queued, for example if a user continuation blocks waiting for another already-in-progress socket operation to complete, then the parallelization done here would help with unblocking that more quickly, though not guaranteed and can still lead to thread pool starvation issues, and there would have already been potential thread pool starvation issues before this change from those kinds of blocking.
I'm trying to understand if there would be a correctness issue from not parallelizing here. If there is a possibility of correctness issue then it may be necessary to queue up a replica before the loop instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To alleviate potential issues from other kinds of blocking in user callbacks maybe it would be safer to ensure there is a queued replica (non-speculatively) if it has not yet scheduled a replica.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change I mentioned above in the latest update
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Show resolved
Hide resolved
Maybe something that relates to how many events we get from epoll? |
|
No significant change to perf on the 28-proc x64 machine with the latest changes |
…irst dequeue, delegating scheduling of more work items to other threads
|
No significant change to perf on the 28-proc x64 machine with the latest changes |
| { | ||
| // Sync operation. Signal waiting thread to continue processing. | ||
| e.Set(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I reading the code correctly that we'll likely end up calling Set twice for a sync operation (with the second call just being a nop)? I think that's probably fine, but it'd be worth a comment calling that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think it might be called twice? My intention was that either the epoll thread handles an event or it will queue it for processing in the background, but not both. If the epoll thread correctly sees that there is a pending sync operation next, then it would call set and not queue an operation for that. Otherwise, it would not process the event and queue it instead.
|
|
||
| // Called on the epoll thread, speculatively tries to process synchronous events and errors for synchronous events, and | ||
| // returns any remaining events that remain to be processed. Taking a lock for each operation queue to deterministically | ||
| // handle synchronous events on the epoll thread seems to significantly reduce throughput in benchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there situations where we may not signal a sync op synchronously? What do those look like and how likely are they to happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check done on the epoll thread is speculative. The value that is checked is updated under the lock, and since the epoll thread does not take the lock the value read may be incorrect. It is very unlikely to happen because a wait of any sort would typically involve a memory barrier (often even if it does not actually end up waiting), and the value read would be at least as recent as when the epoll wait was released. Even if the wait did not block, it is possible that (if it doesn't involve a memory barrier directly) the call sequence involved would involve a memory barrier of some sort. Nevermind the estimations on memory barriers, the fact is that at worst we are relying on the latency of processor cache consistency here, and how bad that can be depends entirely on the processor. Some old (especially arm) processors don't have any sort of cache consistency and they rely entirely on software to do the right thing. When a processor has cache consistency the whole idea is that it shouldn't take an inordinate amount of time to make caches consistent, otherwise it would defeat the purpose. For example, using Volatile.Write to exit a lock relies entirely on processor cache consistency latency for it to work reasonably well. My stance remains that considering that the alternative is not functionally incorrect, this should be good enough for the purpose. That is up for debate though, we can sacrifice some perf to guarantee that sync operations are signaled on the epoll thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feasibility of any speculative code like this, and the feasibility of the general change itself, may depend on various dynamics. I'm not an expert in this area. Generally I would recommend that a subject-matter expert review this change. I see that @geoffkizer made some changes here before. Who would be appropriate to review this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kouvel beside Geoff, @stephentoub has the most expertise.
@antonfirsov may be able to chime in as well ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can sacrifice some perf to guarantee that sync operations are signaled on the epoll thread.
@adamsitnik benchmarked non speculatively to be 1090k rps vs 1139k rps for json plaintext (due to taking the lock).
Coincidentally we're fully loading the 1 epoll thread on the Citrine machine with that benchmark, so not taking the lock avoids the epoll bottleneck.
Does adding a Volatile.Read in IsNextOperationSynchronous_Speculative have an effect on performance? It would make more clear at what point we're picking a value that got set on another thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok great, thanks. I think this is an open issue currently, whether it is ok to miss releasing a synchronous operation from the epoll thread sometimes. It could happen due to races, processor cache issues, etc., and could lead to thread pool starvation if all thread pool threads are blocked by the synchronous operations, perhaps with more synchronous operations waiting in the queue, and probably more likely if no more epoll notifications come in for the relevant sockets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does adding a Volatile.Read in IsNextOperationSynchronous_Speculative have an effect on performance? It would make more clear at what point we're picking a value that got set on another thread.
It wouldn't have any effect on x64, would have to check on arm64. On arm64 I think the memory barrier would usually be redundant (and in the wrong place, after the read instead of before), and the overhead would be incurred for each event and each queue. Adding an explicit memory barrier before the loop may be better, but I suspect it would be difficult to quantify how much it would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Edited] Actually the explicit memory barrier may not help much. There is an interlocked operation to schedule a thread to process events. If events were not queued, then it would have taken at least one operation queue lock. And maybe a barrier from the epoll call as well. So caches would likely already be cleared and the speculative read would be reading a recent value, just not under a lock so there could still be races.
stephentoub
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. Thanks for continuing to push on this.
|
Addressed feedback above and added a few more comments. |
|
Updated numbers below with preview 5 SDK. These are with hill climbing disabled. JsonPlatform28-proc x64 machine
12-proc x64 machine
32-proc arm64 machine
I'm not seeing an issue with epoll thread count > 1. Decreasing latency between getting an epoll notification and scheduling a thread didn't seem to help. 12-proc x64 machine with cpuset 0-3To sort of simulate a smaller VM.
FortunesPlatformThis benchmark seems to be affected by the number of connections and epoll threads. On the x64 machines, in some cases with 512 connections and 1 epoll thread the change seems to be performing slightly worse than the baseline, while with 256 connections and 1 epoll thread the change seems to be performing slightly better. 28-proc x64 machine
12-proc x64 machine
32-proc arm64 machine
12-proc x64 machine with cpuset 0-3To sort of simulate a smaller VM.
|
|
For the sync operation case I tried having a server do synchronous reads after an async operation on 256 sockets, while a client writes to those sockets using async operations. The starvation issue appeared very quickly before the fixes (within a couple of seconds). The baseline and after the fixes it did not hit a noticeable starvation issue after minutes. I don't think it would be easy to repro the races, or at least it seems like they would not be frequent enough to trigger a starvation sequence. |
|
Hopefully this PR should be close to checkin now. I probably won't be able to spend much time on it this/next week, but let me know if I can help to unblock. |
adamsitnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! thank you @kouvel !!
|
the PR looks ready to me @stephentoub @tmds is there anything that should be addressed? If not, I would like to merge this PR. |
It does have 2 types of socket connection, one for DB and other for HTTP, so the dual load might be a factor. |
|
@sebastienros kindly collected some numbers on an x64 VM with 4 procs in the same modes as above, here are the results. JsonPlatform
FortunesPlatform
Numbers are pretty close, there doesn't appear to be a noticeable regression. On this VM FortunesPlatform seems to perform better with more epoll threads before and after the change. The extra load may have something to do with it, I'm not seeing a clear pattern yet but this diff between 8 and 1 epoll threads seems to be higher than on the other machines. |
Ok, thanks for fixing and confirming. |
|
First updated numbers should be available tomorrow morning. |
|
|
||
| if ((events & Interop.Sys.SocketEvents.Read) != 0 && | ||
| _receiveQueue.IsNextOperationSynchronous_Speculative && | ||
| _receiveQueue.ProcessSyncEventOrGetAsyncEvent(this) == null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kouvel @stephentoub @adamsitnik I think there is an issue when IsNextOperationSynchronous_Speculative is true but the operation is not really a SyncEvent operation. The queue moves to Processing but no-one dispatches the operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, will fix
Fixes #35330 (comment) by skipping state transitions when an async operation needs to be processed.
Applies the technique from dotnet/runtime#35330 to IOQueue.
It was seen that on larger machines when fewer epoll threads are used, throughput drops significantly. An issue was that the epoll threads were not able to queue work items to the thread pool quickly enough to keep thread pool threads fully occupied. When the thread pool is not fully occupied, thread pool threads end up waiting for work and enqueues are much slower as they need to release a thread, creating a positive feedback loop, and lots of thread pool threads being released to look for work items and not finding any. It also doesn't help that the thread pool requests many more threads than necessary for the number of work items enqueued, and that the enqueue overhead is repeated for each epoll socket event.
Following @adamsitnik's idea on batching the enqueues to the thread pool and requesting only one thread to limit the overhead, this change tries to reduce and delegate the overhead in epoll threads into the thread pool and automatically parallelizing that work, and to decrease the number of redundant thread pool work items a bit.