runtime: reduce scheduling overhead on large ppc64 systems

Please answer these questions before submitting your issue. Thanks!
1. What version of Go are you using (`go version`)?
   Go 1.7beta2
2. What operating system and processor architecture are you using (`go env`)?
   X86 (Ivy Bridge) - Ubuntu 16.04
   POWER8 - Ubuntu 15.10

This issue/proposal is a bit long, and likely only of interest to those interested in goroutine scheduling.

I work on the Hyperledger fabric (https://github.com/hyperledger/fabric), a large Go application implementing a permissioned blockchain. As part of this work I have observed what I would call "excessive" amounts of cumulative time spent in `runtime.findrunnable` when running on large systems with GOMAXPROCS defaulting to the number of available CPUs. In the following I assume the reader is familiar with the `findrunnable` routine in `proc.go`.

Drilling down into a `findrunnable` profile, the obvious culprit is seen to be the work-stealing loop. This loop is inefficient on large systems for several reasons:

1) "Spinners" poll the system 4 times while holding a P, and all threads poll once again after releasing their P.

2) The stealing loop checks for stealable work from all Ps, including Ps that have no possibility of having any work to steal. The atomic operations used to load the queue pointers in runqgrab require synchronization primitives on some architectures, and a subroutine call overhead on all architectures. This global polling is disruptive in an SMP-coherence sense, since the poller must pull cache lines from around the system in order to examine only a few fields of each line. The randomized polling order also defeats the hardware's prefetching heuristics.

Regarding 1): I understand why it is good to poll at least twice - First for ez-pickin's from the local run queues, and a second pass for the longer-latency `runnext` stealing. It occurred to me that perhaps 4 loops were made in Go 1.6 because the randomization used there was not guaranteed to visit every P, so polling 4X increased the odds of looking at every local queue. Now that this has been fixed in Go 1.7, polling more than twice is arguably not necessary. The polling with `runnext` grabs included is so thourough that once this loop is finished there is no _a priori_ reason to expect that another pass will bear fruit.

Regarding 2): Note that the answer to the question: "Could this P possibly have any work to steal?" can be efficiently centralized since the answer is relatively rarely modified but relatively often observed. I've created a modified scheduler that includes a global array called `mayhavework` that is indexed by the id of a P. Currently, `mayhavework[i]` is `false` whenever a P is queued in the list of idle Ps, and `true` otherwise. More aggressive update protocols are also possible, but this simple protocol is sufficient to illustrate the benefit.

Setting/clearing `mayhavework[i]` adds a small overhead to queue management of idle Ps, as well as a test during polling loops. Note that the polling loop in the "delicate dance" already includes what appears to be a redundant guard of `allp[i] != nil` which is not made by the work-stealing loop.

Here are some results for an example Hyperledger fabric benchmark running on a 4-socket X86 Ivy Bridge server with 120 hardware threads. These examples are for illustration only and are not claimed to be exhaustive; The arguments for the proposal should be valid based on first principles. Performance (throuhgput) of the server is measured in transactions per second (TPS). Cumulative profile percentages were reported by the Golang `net/http/pprof` profiling service running in the application. Results for GOMAXPROCS eqaul to 12 and 120 (the default) are presented.

```
GOMAXPROCS = 12
-------------------------------------------------------------------------
                        Baseline   2 Stealing Loops Only   Full Proposal
-------------------------------------------------------------------------
Throughput               996 TPS          987 TPS              997 TPS
runtime.findrunnable      14.0%            13.5%                14.1%
-------------------------------------------------------------------------

GOMAXPROCS = 120
-------------------------------------------------------------------------
                        Baseline   2 Stealing Loops Only   Full Proposal
-------------------------------------------------------------------------
Throughput               991 TPS          963 TPS              997 TPS
runtime.findrunnable      28.2%            21.9%                16.5%
-------------------------------------------------------------------------
```

This full proposal has no effect on `findrunnable` overhead or performance on this system with GOMAXPROCS=12. However I have also run the experiment on a POWER8 server and observed a reduction from 14.5% to 9.4% of `findrunnable` overhead on that system with GOMAXPROCS=12. This may be due to the fact that `atomic.Load` includes a synchronization instruction on POWER.

For the full system there is a significant reduction in scheduling overhead. It is not clear whether the slight performance drop with 2 stealing loops only is real, or due to experimental variation. In a number of experiments (on POWER8) I have seen what I believe are small, real performance increases _and_ decreases from these modified heuristics, which vary based on the particular benchmark.

To summarize the proposal:

1) Only poll twice in the work stealing loop;

2) Implement an efficient centralized data structure that records which Ps might possibly have any work to steal.

Bishop Brock


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime: reduce scheduling overhead on large ppc64 systems #16476

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

runtime: reduce scheduling overhead on large ppc64 systems #16476

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions