Skip to content

runtime: reduce scheduling overhead on large ppc64 systems #16476

@bcbrock

Description

@bcbrock

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?
    Go 1.7beta2
  2. What operating system and processor architecture are you using (go env)?
    X86 (Ivy Bridge) - Ubuntu 16.04
    POWER8 - Ubuntu 15.10

This issue/proposal is a bit long, and likely only of interest to those interested in goroutine scheduling.

I work on the Hyperledger fabric (https://github.com/hyperledger/fabric), a large Go application implementing a permissioned blockchain. As part of this work I have observed what I would call "excessive" amounts of cumulative time spent in runtime.findrunnable when running on large systems with GOMAXPROCS defaulting to the number of available CPUs. In the following I assume the reader is familiar with the findrunnable routine in proc.go.

Drilling down into a findrunnable profile, the obvious culprit is seen to be the work-stealing loop. This loop is inefficient on large systems for several reasons:

  1. "Spinners" poll the system 4 times while holding a P, and all threads poll once again after releasing their P.

  2. The stealing loop checks for stealable work from all Ps, including Ps that have no possibility of having any work to steal. The atomic operations used to load the queue pointers in runqgrab require synchronization primitives on some architectures, and a subroutine call overhead on all architectures. This global polling is disruptive in an SMP-coherence sense, since the poller must pull cache lines from around the system in order to examine only a few fields of each line. The randomized polling order also defeats the hardware's prefetching heuristics.

Regarding 1): I understand why it is good to poll at least twice - First for ez-pickin's from the local run queues, and a second pass for the longer-latency runnext stealing. It occurred to me that perhaps 4 loops were made in Go 1.6 because the randomization used there was not guaranteed to visit every P, so polling 4X increased the odds of looking at every local queue. Now that this has been fixed in Go 1.7, polling more than twice is arguably not necessary. The polling with runnext grabs included is so thourough that once this loop is finished there is no a priori reason to expect that another pass will bear fruit.

Regarding 2): Note that the answer to the question: "Could this P possibly have any work to steal?" can be efficiently centralized since the answer is relatively rarely modified but relatively often observed. I've created a modified scheduler that includes a global array called mayhavework that is indexed by the id of a P. Currently, mayhavework[i] is false whenever a P is queued in the list of idle Ps, and true otherwise. More aggressive update protocols are also possible, but this simple protocol is sufficient to illustrate the benefit.

Setting/clearing mayhavework[i] adds a small overhead to queue management of idle Ps, as well as a test during polling loops. Note that the polling loop in the "delicate dance" already includes what appears to be a redundant guard of allp[i] != nil which is not made by the work-stealing loop.

Here are some results for an example Hyperledger fabric benchmark running on a 4-socket X86 Ivy Bridge server with 120 hardware threads. These examples are for illustration only and are not claimed to be exhaustive; The arguments for the proposal should be valid based on first principles. Performance (throuhgput) of the server is measured in transactions per second (TPS). Cumulative profile percentages were reported by the Golang net/http/pprof profiling service running in the application. Results for GOMAXPROCS eqaul to 12 and 120 (the default) are presented.

GOMAXPROCS = 12
-------------------------------------------------------------------------
                        Baseline   2 Stealing Loops Only   Full Proposal
-------------------------------------------------------------------------
Throughput               996 TPS          987 TPS              997 TPS
runtime.findrunnable      14.0%            13.5%                14.1%
-------------------------------------------------------------------------

GOMAXPROCS = 120
-------------------------------------------------------------------------
                        Baseline   2 Stealing Loops Only   Full Proposal
-------------------------------------------------------------------------
Throughput               991 TPS          963 TPS              997 TPS
runtime.findrunnable      28.2%            21.9%                16.5%
-------------------------------------------------------------------------

This full proposal has no effect on findrunnable overhead or performance on this system with GOMAXPROCS=12. However I have also run the experiment on a POWER8 server and observed a reduction from 14.5% to 9.4% of findrunnable overhead on that system with GOMAXPROCS=12. This may be due to the fact that atomic.Load includes a synchronization instruction on POWER.

For the full system there is a significant reduction in scheduling overhead. It is not clear whether the slight performance drop with 2 stealing loops only is real, or due to experimental variation. In a number of experiments (on POWER8) I have seen what I believe are small, real performance increases and decreases from these modified heuristics, which vary based on the particular benchmark.

To summarize the proposal:

  1. Only poll twice in the work stealing loop;

  2. Implement an efficient centralized data structure that records which Ps might possibly have any work to steal.

Bishop Brock

Metadata

Metadata

Assignees

No one assigned

    Labels

    FrozenDueToAgeNeedsDecisionFeedback is required from experts, contributors, and/or the community before a change can be made.Proposal-Holdearly-in-cycleA change that should be done early in the 3 month dev cycle.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions