-
Notifications
You must be signed in to change notification settings - Fork 243
Managed RIO server #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Rename to indicate is managed code rather than native doing most of the work.
Mostly
Performance looks great. We're seeing between 5 and 5.9 million RPS with Immediate issue appears to be that pipelining above depth of 11 results in numbers going down. Our hunch is that there's an issue when reading request payloads across datagrams. Would be great if you could look into that because we still have CPU spare and I'd love to see this thing max it 😄 We'd love to see the "same" server written without RIO (just usual WinSock) to compare the difference, as right now we only have our libuv based servers to compare with. |
Do you know request byte size or is it variable? If you could post a list of 15 that would be helpful e.g.
Can change Host etc as long as character text count is same; will be easier to track down in debugger. |
Shouldn't be too hard (famous last words) |
Should have fixed pipelining |
Something weird still happening at 11; still looking |
OK. I ran it again. Got ~6.7 million RPS at 45% CPU utilization & 7.6Gbps, So we have both CPU and network to spare 😄 |
Request packets shouldn't fragment till past 26+ pipeline depending on Host header size So 11 is really weird... Digging the code; you don't have 2 wrk servers on the 10Gps to hit it with? Maybe wrk breaks? (though is more likely my code) |
I can hit it with two wrk machines to see if that makes a difference. |
Thinking aloud In theory should be able to pipeline 29 deep without the request fragmenting; and get a response back in 2 packets with 20 bytes spare; for cool network saturation. Though would need to find the sweet spot for each load server on that pipelining for the threads. Code isn't breaking on any numbers that make sense; though I'm getting the peak between an Azure Windows VM and Azure Linux VM at 11 - so that would suggest its the code as your test system and my Azure one are different environments. Maybe have buffer size mis-matches that like 1100 bytes but not 1200+ bytes... hmm... |
Request fragment size is assuming a pipelining path of
|
Yep, that's what our pipeline script is doing |
Quadrupled RIO event queue read size; just in case events were getting dropped |
Probably want the newer code too as I was doing shortcuts on the reading |
Sorry man, RSS was already on (we chose the "Web Server" profile in the Intel NIC settings (lots of knobs and dials we could change, profile named "Web Server" seemed like a good fit 😉). Trying new code now. |
New code is about the same as before. |
I played around with the wrk parameters a bit and just broke 7 million RPS for the first time: Yep, I pipelined at depth 100 😄 CPU gets lower as pipeline depth grows, which I think makes sense as network reads become more efficient (less datagrams with only part of a request). |
Also would break the read->write dependency in the same way upping the connections will as request packets can be in flight while responses are being generated; whereas in the single packet request->response->request is serialised. Definitely sounds like a bug somewhere, however, hard to test as it happily crushes any network I put it on 😆 |
I think that's also an effect of RIO as it sends receives in batches rather than per packet; so if 5 packets come in at same time it will send 1 IOCP message rather than 5 IOCP messages in the more traditional winsock. |
Added dynamic Date: header and changed response string to "Hello, World!"
Since there is spare CPU & network, added changes so it would pass the TechEmpower Test type 6: Plaintext at 256, 1024, 4096 concurrency; specifically: Body value
Compose headers and body:
Add Date header; that is current date in RFC1123 format
Also tweaked final send in a batch; not sure if it will do anything; but since there is CPU to play with... Not sure if 16,384 concurrency would break it but probably; there are some allocations that are function of # threads which might need tweaking for that; but trying to run with that concurrency breaks my wrk server. Response payload is x1.5 bigger; and has some dynamic elements (Date) so may perform worse; but the response end tweak might help? |
Think I found the other bug... |
Should work better |
Should also pass the 16,384 concurrency.test on TechEmpower's 40 core enviorment |
seem to exhaust them in some tests when exact amount; currently over provided by 4 time. only seem to need 4.
We moved things around in the repo. Can you rebase and move your project to the |
Beyond my git ken have created another PR |
#9 Managed Registered IO Server, using IOCP