Managed RIO server #10

benaadams · 2015-07-15T20:36:21Z

#9 Managed Registered IO Server, using IOCP

Rename to indicate is managed code rather than native doing most of the work.

Mostly

DamianEdwards · 2015-07-20T21:38:39Z

Performance looks great. We're seeing between 5 and 5.9 million RPS with wrk -c 256 -t 32 and pipelining at depth of 11. Any change to those parameters in either direction and it drops. We have CPU to spare still , between 50% and 30% (remaining).

Immediate issue appears to be that pipelining above depth of 11 results in numbers going down. Our hunch is that there's an issue when reading request payloads across datagrams. Would be great if you could look into that because we still have CPU spare and I'd love to see this thing max it 😄

We'd love to see the "same" server written without RIO (just usual WinSock) to compare the difference, as right now we only have our libuv based servers to compare with.

benaadams · 2015-07-20T22:11:04Z

Immediate issue appears to be that pipelining above depth of 11 results in numbers going down.

Do you know request byte size or is it variable? If you could post a list of 15 that would be helpful e.g.

GET / HTTP/1.0
Connection: Keep-Alive
Host: 127.0.0.1:5000
User-Agent: Test
Accept: */*

GET / HTTP/1.0
Connection: Keep-Alive
Host: 127.0.0.1:5000
User-Agent: Test
Accept: */*

GET / HTTP/1.0
Connection: Keep-Alive
Host: 127.0.0.1:5000
User-Agent: Test
Accept: */*

GET / HTTP/1.0
Connection: Keep-Alive
Host: 127.0.0.1:5000
User-Agent: Test
Accept: */*

GET / HTTP/1.0
Connection: Keep-Alive
Host: 127.0.0.1:5000
User-Agent: Test
Accept: */*

GET / HTTP/1.0
Connection: Keep-Alive
Host: 127.0.0.1:5000
User-Agent: Test
Accept: */*

GET / HTTP/1.0
Connection: Keep-Alive
Host: 127.0.0.1:5000
User-Agent: Test
Accept: */*

GET / HTTP/1.0
Connection: Keep-Alive
Host: 127.0.0.1:5000
User-Agent: Test
Accept: */*

etc...

Can change Host etc as long as character text count is same; will be easier to track down in debugger.

benaadams · 2015-07-20T22:14:00Z

We'd love to see the "same" server written without RIO (just usual WinSock) to compare the difference

Shouldn't be too hard (famous last words)

benaadams · 2015-07-20T23:53:52Z

Should have fixed pipelining

benaadams · 2015-07-21T00:48:25Z

Something weird still happening at 11; still looking

DamianEdwards · 2015-07-21T01:09:00Z

OK. I ran it again. Got ~6.7 million RPS at 45% CPU utilization & 7.6Gbps, So we have both CPU and network to spare 😄

benaadams · 2015-07-21T01:19:37Z

Request packets shouldn't fragment till past 26+ pipeline depending on Host header size
Response packets should fragment on 15 as only 1460 bytes + 40 tcp/ip header

So 11 is really weird...

Digging the code; you don't have 2 wrk servers on the 10Gps to hit it with? Maybe wrk breaks? (though is more likely my code)

DamianEdwards · 2015-07-21T01:31:23Z

I can hit it with two wrk machines to see if that makes a difference.

benaadams · 2015-07-21T02:06:23Z

Thinking aloud

In theory should be able to pipeline 29 deep without the request fragmenting; and get a response back in 2 packets with 20 bytes spare; for cool network saturation. Though would need to find the sweet spot for each load server on that pipelining for the threads.

Code isn't breaking on any numbers that make sense; though I'm getting the peak between an Azure Windows VM and Azure Linux VM at 11 - so that would suggest its the code as your test system and my Azure one are different environments.

Maybe have buffer size mis-matches that like 1100 bytes but not 1200+ bytes... hmm...

benaadams · 2015-07-21T02:10:58Z

Request fragment size is assuming a pipelining path of

 r[1] = wrk.format(nil, "/")

DamianEdwards · 2015-07-21T02:24:20Z

Yep, that's what our pipeline script is doing

benaadams · 2015-07-21T02:36:08Z

Quadrupled RIO event queue read size; just in case events were getting dropped

benaadams · 2015-07-21T03:06:43Z

Wow; please tell me you don't have Receive Side Scaling switched on - then go switch it on:

My speed shot through the roof...

benaadams · 2015-07-21T03:07:45Z

Reference: https://msdn.microsoft.com/en-us/library/windows/hardware/ff556942(v=vs.85).aspx

benaadams · 2015-07-21T03:09:17Z

Probably want the newer code too as I was doing shortcuts on the reading

DamianEdwards · 2015-07-21T03:56:16Z

Sorry man, RSS was already on (we chose the "Web Server" profile in the Intel NIC settings (lots of knobs and dials we could change, profile named "Web Server" seemed like a good fit 😉).

Trying new code now.

DamianEdwards · 2015-07-21T04:06:09Z

New code is about the same as before.

DamianEdwards · 2015-07-21T04:11:07Z

I played around with the wrk parameters a bit and just broke 7 million RPS for the first time: wrk -c 450 -t 32 -s ./scripts/pipeline.lua http://10.0.100:5000 --100

Yep, I pipelined at depth 100 😄

CPU gets lower as pipeline depth grows, which I think makes sense as network reads become more efficient (less datagrams with only part of a request).

benaadams · 2015-07-21T11:29:33Z

Also would break the read->write dependency in the same way upping the connections will as request packets can be in flight while responses are being generated; whereas in the single packet request->response->request is serialised.

Definitely sounds like a bug somewhere, however, hard to test as it happily crushes any network I put it on 😆

benaadams · 2015-07-21T11:38:04Z

CPU gets lower as pipeline depth grows, which I think makes sense as network reads become more efficient (less datagrams with only part of a request).

I think that's also an effect of RIO as it sends receives in batches rather than per packet; so if 5 packets come in at same time it will send 1 IOCP message rather than 5 IOCP messages in the more traditional winsock.

Added dynamic Date: header and changed response string to "Hello, World!"

benaadams · 2015-07-21T13:19:23Z

Since there is spare CPU & network, added changes so it would pass the TechEmpower Test type 6: Plaintext at 256, 1024, 4096 concurrency; specifically:

Body value

The response body must be Hello, World!

Compose headers and body:

the response must be fully composed from this and its headers within the scope of each request and it is not acceptable to store the entire payload of the response, headers inclusive, as a pre-rendered buffer.

Add Date header; that is current date in RFC1123 format

The response headers must include Server and Date.

Also tweaked final send in a batch; not sure if it will do anything; but since there is CPU to play with...

Not sure if 16,384 concurrency would break it but probably; there are some allocations that are function of # threads which might need tweaking for that; but trying to run with that concurrency breaks my wrk server.

Response payload is x1.5 bigger; and has some dynamic elements (Date) so may perform worse; but the response end tweak might help?

benaadams · 2015-07-21T19:26:16Z

Think I found the other bug...

benaadams · 2015-07-21T20:15:20Z

Should work better

benaadams · 2015-07-21T20:36:38Z

Should also pass the 16,384 concurrency.test on TechEmpower's 40 core enviorment

seem to exhaust them in some tests when exact amount; currently over provided by 4 time. only seem to need 4.

DamianEdwards · 2015-07-22T02:30:42Z

We moved things around in the repo. Can you rebase and move your project to the experimental folder please?

benaadams · 2015-07-22T03:25:13Z

Beyond my git ken have created another PR

benaadams added 15 commits July 15, 2015 20:59

Managed Registered IO Server

be560ef

Sync base

674efcc

Receive Task header

de346d6

NativeRIOHttpServer -> ManagedRIOHttpServer

5ff3900

Rename to indicate is managed code rather than native doing most of the work.

x64 check

2d243cd

check next in queue

4d39c89

Increase listen queue

a34c092

Fix non keep-alive

0df8b53

Remove angry console writeline

6628751

Move to v4.6 project

394e797

Pipelining fixes + reduced buffers

0eadff9

File per class

47e5be5

Mostly

Force occasional send flush

c93a368

File shuffling, name threads, inc buffers

0796a69

Listen on all IP addresses

4599ce3

Fixed some dumb + squeezed some gotos in 😉

fe0a849

benaadams added 2 commits July 21, 2015 03:34

Increase result read buffer

4eb9369

port fix

787d45e

Use RIO properly, have spare CPU

c5aac2b

Change to pass TechEmpower req for 256, 1024, 4096

2e8d781

Added dynamic Date: header and changed response string to "Hello, World!"

Allocate less

8662117

Report errors, prevent errors.

4660a22

3rd time the charm

bc74b68

benaadams added 2 commits July 22, 2015 01:11

Over allocate IOCP by 4

442f3c5

seem to exhaust them in some tests when exact amount; currently over provided by 4 time. only seem to need 4.

Is server, go back to original send > receive ratio

eb6c6db

benaadams mentioned this pull request Jul 22, 2015

Managed RIO Server #12

Closed

benaadams closed this Jul 22, 2015

Managed RIO server #10

Managed RIO server #10

Uh oh!

Conversation

benaadams commented Jul 15, 2015

Uh oh!

DamianEdwards commented Jul 20, 2015

Uh oh!

benaadams commented Jul 20, 2015

Uh oh!

benaadams commented Jul 20, 2015

Uh oh!

benaadams commented Jul 20, 2015

Uh oh!

benaadams commented Jul 21, 2015

Uh oh!

DamianEdwards commented Jul 21, 2015

Uh oh!

benaadams commented Jul 21, 2015

Uh oh!

DamianEdwards commented Jul 21, 2015

Uh oh!

benaadams commented Jul 21, 2015

Uh oh!

benaadams commented Jul 21, 2015

Uh oh!

DamianEdwards commented Jul 21, 2015

Uh oh!

benaadams commented Jul 21, 2015

Uh oh!

benaadams commented Jul 21, 2015

Uh oh!

benaadams commented Jul 21, 2015

Uh oh!

benaadams commented Jul 21, 2015

Uh oh!

DamianEdwards commented Jul 21, 2015

Uh oh!

DamianEdwards commented Jul 21, 2015

Uh oh!

DamianEdwards commented Jul 21, 2015

Uh oh!

benaadams commented Jul 21, 2015

Uh oh!

benaadams commented Jul 21, 2015

Uh oh!

benaadams commented Jul 21, 2015

Uh oh!

benaadams commented Jul 21, 2015

Uh oh!

benaadams commented Jul 21, 2015

Uh oh!

benaadams commented Jul 21, 2015

Uh oh!

DamianEdwards commented Jul 22, 2015

Uh oh!

benaadams commented Jul 22, 2015

Uh oh!

Uh oh!