Parallelize epoll events on thread pool and process events in the same thread #35330

kouvel · 2020-04-23T10:06:07Z

It was seen that on larger machines when fewer epoll threads are used, throughput drops significantly. An issue was that the epoll threads were not able to queue work items to the thread pool quickly enough to keep thread pool threads fully occupied. When the thread pool is not fully occupied, thread pool threads end up waiting for work and enqueues are much slower as they need to release a thread, creating a positive feedback loop, and lots of thread pool threads being released to look for work items and not finding any. It also doesn't help that the thread pool requests many more threads than necessary for the number of work items enqueued, and that the enqueue overhead is repeated for each epoll socket event.

Following @adamsitnik's idea on batching the enqueues to the thread pool and requesting only one thread to limit the overhead, this change tries to reduce and delegate the overhead in epoll threads into the thread pool and automatically parallelizing that work, and to decrease the number of redundant thread pool work items a bit.

The epoll thread enqueues socket events to a concurrent queue specific to the epoll thread. If the queue is busy enough, enqueues would not contend.
The epoll thread then schedules a work item to the thread pool to process events, if a work item is not already scheduled
When the work item runs, it dequeues an event, schedules another work item if necessary to parallelize the work, processes the event, and continues until the event queue is empty
At most one work item is scheduled to the thread pool at a time to avoid over-parallelizing the work
Since socket events are now processed on a thread pool thread already, the change also avoids scheduling a redundant thread pool work item to perform the socket operations and user callbacks
The change is relatively more beneficial when fewer epoll threads are used. A heuristic for that is being discussed separately, for now this change doesn't change the number of epoll threads.

…e thread

…work and process write work in same thread

ghost · 2020-04-23T10:06:10Z

Tagging subscribers to this area: @dotnet/ncl
Notify danmosemsft if you want to be subscribed.

kouvel · 2020-04-23T10:06:27Z

Perf on JsonPlatform benchmark

Connections: 512
The current default number of epoll threads for 512 connections is 16
Hill climbing disabled. Other issues were identified there, I had removed that variable for these tests.
Some of the data was collected on only the first commit. I have verified that perf is about the same with the other commits.

12-proc x64 machine

.	Epoll threads	Before	After	Diff
.	16	455776	493383	8.3%
.	4	484207	517855	6.9%
.	2	505386	546309	8.1%
.	1	526676	561758	6.7%
Max		526676	561758	6.7%

28-proc x64 machine

.	Epoll threads	Before	After	Diff
.	16	914569	966343	5.7%
.	4	`1045350`	1083586	3.7%
.	2	992084	1123756	13.3%
.	1	770677	1145468	48.6%
Max		`1045350`	1145468	9.6%

56-proc x64 machine (2-socket), limited to 1 socket

.	Epoll threads	Before	After	Diff
.	16	1022563	1067926	4.4%
.	4	1101562	1172907	6.5%
.	2	`1098544`	1202171	9.4%
.	1	606518	1191682	96.5%
Max		1101562	1202171	9.1%

56-proc x64 machine (2-socket), not limited

.	Epoll threads	Before	After	Diff
.	16	615418	1177954	91.4%
.	4	510258	1234930	142.0%
.	2	367121	1134470	209.0%
.	1	131383	1153392	777.9%
Max		615418	1234930	100.7%

32-proc arm64 machine

.	Epoll threads	Before	After	Diff
.	16	481059	496148	3.1%
.	4	503988	526638	4.5%
.	2	481798	505688	5.0%
.	1	448148	437307	-2.4%
Max		503988	526638	4.5%

kouvel · 2020-04-23T10:10:01Z

CC @adamsitnik @stephentoub @tmds

kouvel · 2020-04-23T10:44:08Z

The event-processing work items can theoretically become long-running, probably should fix

stephentoub

Thanks for working on this.

stephentoub · 2020-04-23T10:34:05Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

                    // Sync operation.  Signal waiting thread to continue processing.
                    e.Set();
                }
+                else if (processAsyncOperationSynchronously)


I think we need to change sightly how this is structured, in particular for the e != null case. We get there if a synchronous operation is performed on a Socket on which an asynchronous operation was ever performed. In such cases, we've permanently moved the socket to be non-blocking, which means we need to simulate the blocking behavior of all subsequent sync operations, and we do that by using a MRES instead of a callback. We don't want to require a thread pool thread just to set that event, as doing so could lead to thread pool starvation, with a sync operation waiting on a thread pool thread for a work item to be processed by the thread pool that will unblock it. So, we want the epoll thread to set such an MRES rather than queuing a work item to do it.

So, we want the epoll thread to set such an MRES rather than queuing a work item to do it.

I am afraid that it would require us to make the enqueueing more complex and decrease throughput in a noticeable way.

To set MRES on a epoll thread we would need to check two queues (send and receives) which would require to take two locks:

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

Line 832 in 6ef538d

using (Lock())

and then perform 0 to two casts (depending on whether queues are empty or not):

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

Line 137 in 723e7f2

get { return CallbackOrEvent as ManualResetEventSlim; }

Ofc we would have to do that for every epoll event returned by epoll wait.

With @kouvel proposal after we receive an epoll notification we just add a bunch of simple events to a queue (this is very fast) and schedule a work item to the thread pool.

I am afraid that it would require us to make the enqueueing more complex and decrease throughput in a noticeable way.

We only need to do it if the event is for a sync operation, which can be checked cheaply.

The alternative is potential deadlock / long delays while waiting for the thread pool's starvation detection to introduce more threads.

We only need to do it if the event is for a sync operation, which can be checked cheaply.

But we don't know this when we are receiving the epoll notification. To check this we need to translate the socket handle to socket context:

runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

Line 333 in 6ef538d

_handleToContextMap.TryGetValue(handle, out SocketAsyncContext? context);

and then take the two locks that I've described above.

Then we could consider having a dedicated epoll thread for sync-over-nonblocking sockets. EDIT: I realize this won't work well, add the socket may already be associated with a particular epoll.

I understand your pushback, but I think this is a big deal. Convince me it's not if you disagree :)

The operation may be cancelled by time this gets dequeued for execution.

Looks like ProcessQueuedOperation accounts for that.

So the suggested change is something like:

processAsyncOperationSynchronously becomes processAsyncOperationOnConcurrentQueue, and enqueues to the ConcurrentQueue<AsyncOperation>

HandleEvents gets called directly on the epoll thread, ConcurrentQueue processing is deferred to the ThreadPool after calling HandleEvents.

Taking the lock for each operation queue on the epoll thread seems to reduce RPS by ~20-25 K. In the new commits I made changes to track whether the first operation in each queue is synchronous and to check it speculatively from the epoll thread. The speculative check along with the concurrent dictionary dequeue is not making a noticeable difference to RPS. Considering that the alternative is not incorrect, it seems like a speculative check would be enough to avoid the starvation issue.

I don't think there's a need to queue AsyncOperation to the queue, but anyway it involves taking the lock, and doing so seems to reduce RPS by an additional 15-20 K RPS on top of taking the locks, not sure fully why, though there is a bit more work involved in extracting those.

The code would be simpler using ConcurrentQueue<AsyncOperation>.

Taking the lock for each operation queue on the epoll thread seems to reduce RPS by ~20-25 K.

I plan to take a shot at replacing these locks with Interlocked operations.

it involves taking the lock, and doing so seems to reduce RPS by an additional 15-20 K RPS on top of taking the locks, not sure fully why, though there is a bit more work involved in extracting those.

What lock does this refer to?

It's the lock taken in SocketAsyncContext.HandleEvent before this change (linked above by @adamsitnik), in this change the lock taken in SocketAsyncContext.ProcessSyncEventOrGetAsyncEvent.

The code would be simpler using ConcurrentQueue<AsyncOperation>

Agreed, I wanted to get that to work and tried it first but there currently seem to be obstacles to doing that.

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

stephentoub · 2020-04-23T10:46:16Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

            return error == Interop.Error.SUCCESS;
        }
+
+        private struct Event


+1 for readonly, I think that we could also give it a less generic name, EpollEvent for example.

Suggested change

private struct Event

private readonly struct EpollEvent

BTW do you think that changing SocketEvent size from 4 bytes to 1 could improve the perf in any way?

runtime/src/libraries/Common/src/Interop/Unix/System.Native/Interop.SocketEvent.cs

Line 13 in 6ef538d

internal enum SocketEvents : int

You're suggesting that to help reduce the size of EpollEvent? It wouldn't change its size; it would just result in more of the same space being padding. In fact, I think it could actually be slightly worse in terms of codegen, e.g.
https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHunHjykDAKIA3GADsM5VrGwYY5ABQBJSQAUMUBgAts4gCYAbGGgYAhAJ7yGI7IYCuMAJQMAvAD4G4mAHdhYySVdA2NTWwdnAG4uHnomAX8JDFIZGDkYUhV1TR09IxMGAEtJGztHFw8vX0TJTOD8sLKomO4W3gZcTXswDBqpNs5qHmH21QwNLQAJPONooZHuOOArGAYANSa5hdb5hbjRJKUxidyQguXrcPK24cHtkemzt1P8rfueDYjnq5g3hYBfNqA3YMNpxTpQbq9A61AY3WKMY45R6veGLRjFXqfRx/EZo9ow5JZcY5eqhIolH5OfF3d7cFHGZ5k3744bY1auUoRXHDYG8mj/IA===

stephentoub · 2020-04-23T10:51:40Z

The event-processing work items can theoretically become long-running, probably should fix

By only executing at most N work items and then scheduling a replica and exiting? Seems reasonable.

kouvel · 2020-04-23T10:52:31Z

Yea that's what I was thinking. Maybe a batch of EventBufferCount or something like that. Will need to check perf.

kouvel · 2020-04-23T10:57:54Z

Smaller threshold would be better for that issue but probably worse for perf. For instance what if processing a request takes 10 ms, it wouldn't take running many of those for the work item to appear long-running. Maybe also a time-based thing like the thread pool does, or something else.

Co-Authored-By: Stephen Toub <[email protected]>

tmds · 2020-04-23T11:37:27Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs

+
+                // An event was successfully dequeued, and as there may be more events to process, speculatively schedule a work
+                // item to parallelize processing of events. Since this is only for additional parallelization, doing so
+                // speculatively is ok.


Maybe worth mentioning explicitly that the parallelization makes it impossible for continuations to block one another.

I don't think it's possible for a continuation to block on another continuation for the same socket even without the parallelization done here. If a continuation blocks on a synchronous socket operation the next epoll event that serves that blocking operation would schedule another work item. The parallelization done here is only for making use of more procs.

I don't think it's possible for a continuation to block on another continuation for the same socket even without the parallelization done here.

It may be on different sockets. And we shouldn't reply on the next epoll_wait with events to get things moving.

If parallelization is not done here, and if a blocking socket operation is performed on this thread as a result of running a user continuation (on any socket), then the set of IO events that are already queued would not be releasing that blocking operation anyway since it's a new operation that got added. Only another epoll event would get that blocking operation moving (before this change too).

However, if it's some other kind of blocking related to other already-queued, for example if a user continuation blocks waiting for another already-in-progress socket operation to complete, then the parallelization done here would help with unblocking that more quickly, though not guaranteed and can still lead to thread pool starvation issues, and there would have already been potential thread pool starvation issues before this change from those kinds of blocking.

I'm trying to understand if there would be a correctness issue from not parallelizing here. If there is a possibility of correctness issue then it may be necessary to queue up a replica before the loop instead.

To alleviate potential issues from other kinds of blocking in user callbacks maybe it would be safer to ensure there is a queued replica (non-speculatively) if it has not yet scheduled a replica.

Made the change I mentioned above in the latest update

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

tmds · 2020-04-23T11:59:37Z

Maybe a batch of EventBufferCount or something like that.

Maybe something that relates to how many events we get from epoll? EventBufferCount is the upper bound for that.

…s on the epoll thread

kouvel · 2020-04-25T06:24:31Z

No significant change to perf on the 28-proc x64 machine with the latest changes

…irst dequeue, delegating scheduling of more work items to other threads

kouvel · 2020-04-25T17:47:45Z

No significant change to perf on the 28-proc x64 machine with the latest changes

stephentoub · 2020-04-29T01:41:13Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

                {
                    // Sync operation.  Signal waiting thread to continue processing.
                    e.Set();
                }


Am I reading the code correctly that we'll likely end up calling Set twice for a sync operation (with the second call just being a nop)? I think that's probably fine, but it'd be worth a comment calling that out.

Why do you think it might be called twice? My intention was that either the epoll thread handles an event or it will queue it for processing in the background, but not both. If the epoll thread correctly sees that there is a pending sync operation next, then it would call set and not queue an operation for that. Otherwise, it would not process the event and queue it instead.

stephentoub · 2020-04-29T01:42:35Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs


+        // Called on the epoll thread, speculatively tries to process synchronous events and errors for synchronous events, and
+        // returns any remaining events that remain to be processed. Taking a lock for each operation queue to deterministically
+        // handle synchronous events on the epoll thread seems to significantly reduce throughput in benchmarks.


Are there situations where we may not signal a sync op synchronously? What do those look like and how likely are they to happen?

The check done on the epoll thread is speculative. The value that is checked is updated under the lock, and since the epoll thread does not take the lock the value read may be incorrect. It is very unlikely to happen because a wait of any sort would typically involve a memory barrier (often even if it does not actually end up waiting), and the value read would be at least as recent as when the epoll wait was released. Even if the wait did not block, it is possible that (if it doesn't involve a memory barrier directly) the call sequence involved would involve a memory barrier of some sort. Nevermind the estimations on memory barriers, the fact is that at worst we are relying on the latency of processor cache consistency here, and how bad that can be depends entirely on the processor. Some old (especially arm) processors don't have any sort of cache consistency and they rely entirely on software to do the right thing. When a processor has cache consistency the whole idea is that it shouldn't take an inordinate amount of time to make caches consistent, otherwise it would defeat the purpose. For example, using Volatile.Write to exit a lock relies entirely on processor cache consistency latency for it to work reasonably well. My stance remains that considering that the alternative is not functionally incorrect, this should be good enough for the purpose. That is up for debate though, we can sacrifice some perf to guarantee that sync operations are signaled on the epoll thread.

The feasibility of any speculative code like this, and the feasibility of the general change itself, may depend on various dynamics. I'm not an expert in this area. Generally I would recommend that a subject-matter expert review this change. I see that @geoffkizer made some changes here before. Who would be appropriate to review this change?

@kouvel beside Geoff, @stephentoub has the most expertise.
@antonfirsov may be able to chime in as well ...

we can sacrifice some perf to guarantee that sync operations are signaled on the epoll thread.

@adamsitnik benchmarked non speculatively to be 1090k rps vs 1139k rps for json plaintext (due to taking the lock).
Coincidentally we're fully loading the 1 epoll thread on the Citrine machine with that benchmark, so not taking the lock avoids the epoll bottleneck.

Does adding a Volatile.Read in IsNextOperationSynchronous_Speculative have an effect on performance? It would make more clear at what point we're picking a value that got set on another thread.

Ok great, thanks. I think this is an open issue currently, whether it is ok to miss releasing a synchronous operation from the epoll thread sometimes. It could happen due to races, processor cache issues, etc., and could lead to thread pool starvation if all thread pool threads are blocked by the synchronous operations, perhaps with more synchronous operations waiting in the queue, and probably more likely if no more epoll notifications come in for the relevant sockets.

Does adding a Volatile.Read in IsNextOperationSynchronous_Speculative have an effect on performance? It would make more clear at what point we're picking a value that got set on another thread.

It wouldn't have any effect on x64, would have to check on arm64. On arm64 I think the memory barrier would usually be redundant (and in the wrong place, after the read instead of before), and the overhead would be incurred for each event and each queue. Adding an explicit memory barrier before the loop may be better, but I suspect it would be difficult to quantify how much it would help.

[Edited] Actually the explicit memory barrier may not help much. There is an interlocked operation to schedule a thread to process events. If events were not queued, then it would have taken at least one operation queue lock. And maybe a barrier from the epoll call as well. So caches would likely already be cleared and the speculative read would be reading a recent value, just not under a lock so there could still be races.

stephentoub

Generally looks good. Thanks for continuing to push on this.

kouvel · 2020-04-30T15:56:17Z

Addressed feedback above and added a few more comments.

kouvel · 2020-05-01T08:52:18Z

Updated numbers below with preview 5 SDK. These are with hill climbing disabled.

JsonPlatform

28-proc x64 machine

512 connections	Epoll threads	Before	After	Diff
.	16	937492	983232	4.9%
.	4	1054384	`1095836`	3.9%
.	2	1004742	1136945	13.2%
.	1	717291	1175142	63.8%
Max		1054384	1175142	11.5%

12-proc x64 machine

512 connections	Epoll threads	Before	After	Diff
.	16	462025	502741	8.8%
.	4	486645	536467	10.2%
.	2	509969	568554	11.5%
.	1	525168	586676	11.7%
Max		525168	586676	11.7%

32-proc arm64 machine

512 connections	Epoll threads	Before	After	Diff
.	16	478542	501044	4.7%
.	4	527463	533571	1.2%
.	2	495679	516062	4.1%
.	1	471666	448060	-5.0%
Max		527463	533571	1.2%

I'm not seeing an issue with epoll thread count > 1. Decreasing latency between getting an epoll notification and scheduling a thread didn't seem to help.

12-proc x64 machine with cpuset 0-3

To sort of simulate a smaller VM.

256 connections	Epoll threads	Before	After	Diff
.	8	343991	342262	-0.5%
.	4	343323	345447	0.6%
.	2	341536	340198	-0.4%
.	1	339642	341089	0.4%
Max		343991	345447	0.4%

FortunesPlatform

This benchmark seems to be affected by the number of connections and epoll threads. On the x64 machines, in some cases with 512 connections and 1 epoll thread the change seems to be performing slightly worse than the baseline, while with 256 connections and 1 epoll thread the change seems to be performing slightly better.

28-proc x64 machine

256 connections	Epoll threads	Before	After	Diff
.	8	295163	303885	3.0%
.	4	308736	314709	1.9%
.	2	319814	324905	1.6%
.	1	322504	334484	3.7%
Max		322504	334484	3.7%

512 connections	Epoll threads	Before	After	Diff
.	16	295230	306175	3.7%
.	4	313131	303398	-3.1%
.	2	320548	311698	-2.8%
.	1	326749	314887	-3.6%
Max		326749	314887	-3.6%

12-proc x64 machine

256 connections	Epoll threads	Before	After	Diff
.	8	125501	131392	4.7%
.	4	124756	131443	5.4%
.	2	126811	133650	5.4%
.	1	131359	138170	5.2%
Max		131359	138170	5.2%

512 connections	Epoll threads	Before	After	Diff
.	16	123050	131718	7.0%
.	4	122490	128844	5.2%
.	2	125194	128060	2.3%
.	1	127967	129658	1.3%
Max		127967	131718	2.9%

32-proc arm64 machine

256 connections	Epoll threads	Before	After	Diff
.	8	85678	93139	8.7%
.	4	95925	94786	-1.2%
.	2	89028	103297	16.0%
.	1	87266	99467	14.0%
Max		95925	103297	7.7%

512 connections	Epoll threads	Before	After	Diff
.	16	96175	92792	-3.5%
.	4	100371	100783	0.4%
.	2	98390	106206	7.9%
.	1	100739	102455	1.7%
Max		100739	106206	5.4%

12-proc x64 machine with cpuset 0-3

To sort of simulate a smaller VM.

256 connections	Epoll threads	Before	After	Diff
.	8	79052	90081	14.0%
.	4	78456	88628	13.0%
.	2	78437	88895	13.3%
.	1	78357	88789	13.3%
Max		79052	90081	14.0%

kouvel · 2020-05-01T08:52:24Z

For the sync operation case I tried having a server do synchronous reads after an async operation on 256 sockets, while a client writes to those sockets using async operations. The starvation issue appeared very quickly before the fixes (within a couple of seconds). The baseline and after the fixes it did not hit a noticeable starvation issue after minutes. I don't think it would be easy to repro the races, or at least it seems like they would not be frequent enough to trigger a starvation sequence.

kouvel · 2020-05-01T08:53:43Z

Hopefully this PR should be close to checkin now. I probably won't be able to spend much time on it this/next week, but let me know if I can help to unblock.

adamsitnik

Looks great to me! thank you @kouvel !!

adamsitnik · 2020-05-01T09:11:36Z

the PR looks ready to me

@stephentoub @tmds is there anything that should be addressed? If not, I would like to merge this PR.

benaadams · 2020-05-01T10:09:01Z

FortunesPlatform

This benchmark seems to be affected by the number of connections and epoll threads.

It does have 2 types of socket connection, one for DB and other for HTTP, so the dual load might be a factor.

kouvel · 2020-05-01T19:26:07Z

@sebastienros kindly collected some numbers on an x64 VM with 4 procs in the same modes as above, here are the results.

JsonPlatform

256 connections	Epoll threads	Before	After	Diff
.	8	138547	145501	5.0%
.	1	142676	141960	-0.5%
Max		142676	145501	2.0%

FortunesPlatform

256 connections	Epoll threads	Before	After	Diff
.	8	16736	16980	1.5%
.	1	14911	15216	2.0%
Max		16736	16980	1.5%

Numbers are pretty close, there doesn't appear to be a noticeable regression. On this VM FortunesPlatform seems to perform better with more epoll threads before and after the change. The extra load may have something to do with it, I'm not seeing a clear pattern yet but this diff between 8 and 1 epoll threads seems to be higher than on the other machines.

stephentoub · 2020-05-02T21:13:36Z

For the sync operation case I tried having a server do synchronous reads after an async operation on 256 sockets, while a client writes to those sockets using async operations. The starvation issue appeared very quickly before the fixes (within a couple of seconds). The baseline and after the fixes it did not hit a noticeable starvation issue after minutes. I don't think it would be easy to repro the races, or at least it seems like they would not be frequent enough to trigger a starvation sequence.

Ok, thanks for fixing and confirming.

sebastienros · 2020-05-03T18:15:55Z

First updated numbers should be available tomorrow morning.
Edit: Assuming a runtime build is successful before midnight

tmds · 2020-05-06T12:25:39Z

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs

+
+            if ((events & Interop.Sys.SocketEvents.Read) != 0 &&
+                _receiveQueue.IsNextOperationSynchronous_Speculative &&
+                _receiveQueue.ProcessSyncEventOrGetAsyncEvent(this) == null)


@kouvel @stephentoub @adamsitnik I think there is an issue when IsNextOperationSynchronous_Speculative is true but the operation is not really a SyncEvent operation. The queue moves to Processing but no-one dispatches the operation.

Nice catch, will fix

Fixes #35330 (comment) by skipping state transitions when an async operation needs to be processed.

Applies the technique from dotnet/runtime#35330 to IOQueue.

Koundinya Veluri added 4 commits April 23, 2020 00:04

Parallelize epoll events on thread pool and process events in the sam…

bc829b7

…e thread

Use interlocked write instead of volatile write

27992a1

Upon epoll notification for reads and writes to a socket, queue read …

46dc10d

…work and process write work in same thread

Clean up change

df0caf0

kouvel added the area-System.Net.Sockets label Apr 23, 2020

kouvel added this to the 5.0 milestone Apr 23, 2020

kouvel self-assigned this Apr 23, 2020

stephentoub reviewed Apr 23, 2020

View reviewed changes

Update comment as suggested

723e7f2

Co-Authored-By: Stephen Toub <[email protected]>

tmds reviewed Apr 23, 2020

View reviewed changes

src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs Show resolved Hide resolved

adamsitnik added os-linux Linux OS (any supported distro) tenet-performance Performance related issue labels Apr 23, 2020

Readonly for Event struct and rename struct to SocketIOEvent

f12febc

adamsitnik mentioned this pull request Apr 24, 2020

Optimize library code using arm64 intrinsics #33308

Closed

Koundinya Veluri added 2 commits April 24, 2020 23:10

Track and speculatively handle epoll events for synchronous operation…

9c0e0f1

…s on the epoll thread

Prevent event scheduling threads from becoming long-running

6f414e8

Koundinya Veluri added 2 commits April 25, 2020 10:42

Non-speculatively schedule a work item to process epoll events upon f…

7ac0395

…irst dequeue, delegating scheduling of more work items to other threads

Update comment

acd9c4d

Fix test

aea8e26

stephentoub reviewed Apr 29, 2020

View reviewed changes

Address feedback

079e962

adamsitnik approved these changes May 1, 2020

View reviewed changes

stephentoub approved these changes May 2, 2020

View reviewed changes

Update comment

99fc37b

benaadams mentioned this pull request May 2, 2020

Only ensure the ThreadPool is active when queuing tasks #35760

Closed

adamsitnik merged commit 8157271 into dotnet:master May 3, 2020

adamsitnik mentioned this pull request May 5, 2020

Single epoll thread per 28 cores #35800

Merged

tmds reviewed May 6, 2020

View reviewed changes

kouvel deleted the ParallelizeEpollEventsProcessInline branch May 6, 2020 13:12

This was referenced May 7, 2020

Skip processing async operations from epoll thread #35940

Merged

System.Net.Sockets.Tests.DisposedSocket.NonDisposedSocket_SafeHandlesCollected test failures on Linux #35846

Closed

stephentoub pushed a commit that referenced this pull request May 7, 2020

Skip processing async operations from epoll thread (#35940)

faa7456

Fixes #35330 (comment) by skipping state transitions when an async operation needs to be processed.

adamsitnik mentioned this pull request May 14, 2020

ConcurrentQueue is HOT in TechEmpower profiles for machines with MANY cores #36447

Open

tmds added a commit to tmds/aspnetcore that referenced this pull request May 15, 2020

Try make IOQueue auto-parallelizing

778f8d5

Applies the technique from dotnet/runtime#35330 to IOQueue.

tmds mentioned this pull request May 15, 2020

Try make IOQueue auto-parallelizing dotnet/aspnetcore#21873

Closed

adamsitnik mentioned this pull request Aug 4, 2020

Performance regressions on EF scenarios #36292

Closed

tmds mentioned this pull request Oct 6, 2020

SocketAsyncEngine.Unix perf experiments #14304

Closed

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

Parallelize epoll events on thread pool and process events in the same thread #35330

Parallelize epoll events on thread pool and process events in the same thread #35330

Uh oh!

Conversation

kouvel commented Apr 23, 2020

Uh oh!

ghost commented Apr 23, 2020

Uh oh!

kouvel commented Apr 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Perf on JsonPlatform benchmark

12-proc x64 machine

28-proc x64 machine

56-proc x64 machine (2-socket), limited to 1 socket

56-proc x64 machine (2-socket), not limited

32-proc arm64 machine

Uh oh!

kouvel commented Apr 23, 2020

Uh oh!

kouvel commented Apr 23, 2020

Uh oh!

stephentoub left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephentoub Apr 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephentoub Apr 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmds Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephentoub Apr 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephentoub commented Apr 23, 2020

Uh oh!

kouvel commented Apr 23, 2020

Uh oh!

kouvel commented Apr 23, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kouvel commented Apr 23, 2020 •

edited

Loading

stephentoub Apr 23, 2020 •

edited

Loading

stephentoub Apr 23, 2020 •

edited

Loading

tmds Apr 24, 2020 •

edited

Loading

stephentoub Apr 23, 2020 •

edited

Loading

kouvel commented Apr 25, 2020 •

edited

Loading

kouvel Apr 29, 2020 •

edited

Loading

kouvel Apr 29, 2020 •

edited

Loading

kouvel Apr 30, 2020 •

edited

Loading