Skip to content

runtime: scheduler work stealing slow for high GOMAXPROCS #28808

@ChrisHines

Description

@ChrisHines

What version of Go are you using (go version)?

We don't have Go installed in the production environment. The program reports this info as part of reporting its version.

Go Version:     "go1.11"
Go Compiler:    "gc"
Go ARCH:        "amd64"
Go OS:          "linux"

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

We don't have Go installed in the production environment. I think the following info is relevant to this issue.

$ cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                56
On-line CPU(s) list:   0-55
Thread(s) per core:    2
Core(s) per socket:    14
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
Stepping:              1
CPU MHz:               1207.421
BogoMIPS:              4004.63
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              35840K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55

What did you do?

We have an application that primarily retrieves data in large chunks from a node-local cache server via HTTP, packetizes it, and sends the packets out at a target bitrate via UDP. It does this for many concurrent streaming sessions per node. A node with the CPU shown above will serve several hundred streaming sessions at once.

The application currently starts separate goroutines for each streaming session. Each session has a goroutine responsible for outputting the UDP packets as close to the target bitrate as possible. That goroutine uses a time.Ticker to wake up periodically and transmit data. The Go execution tracer shows that each of these transmitting goroutines typically runs for about 15µs once every 2.4ms.

Until recently the application was built with Go 1.9. We found empirically that running five instances of the program on a node—each instance handling one fifth of the work—handled our work load the best. At the time we didn't understand why we had to shard the workload across multiple processes, but the empirical data was conclusive.

We built our latest release with Go 1.11 and began load testing it with the same configuration of five instances per node.

What did you expect to see?

We expected to see that the new release built with Go 1.11 performed at least as well as the previous release built with Go 1.9.

What did you see instead?

The new release built with Go 1.11 consumed significantly more CPU for the same workload compared with the previous release built with Go 1.9. We eliminated the possibility that the changes in our new release were the cause by rebuilding the previous release with Go 1.11 and observed that it also consumed significantly more CPU for the same load as the same code built with Go 1.9.

We collected CPU profile data from these three builds of the program under load and began looking for culprits. The data showed that the Go 1.11 builds were spending about 3x CPU time in runtime.findrunnable and its helpers than the Go 1.9 builds.

Looking at the commit history since Go 1.9 for that part of the runtime we identified commit "runtime: improve timers scalability on multi-CPU systems" as the only change that seemed relevant. But we were puzzled that despite the performance improvements to timer handling we saw increased CPU load rather than performance improvements for our program.

After further analysis, however, we realized that the inefficient timer implementation was the likely reason we were forced to shard the load across five processes when using Go 1.9. Since the profile data for the Go 1.11 build showed that the cumulative time spent in runtime.runqsteal was the largest contributor to the cumulative time of runtime.findrunnable we hypothesized that with the timer handling bottleneck reduced each P in the scheduler could advance to the work stealing loop instead of contending for the timer lock. Furthermore, since we were running on hardware with 56 hardware threads and had not set an explicit GOMAXPROCS the work stealing loop was rather expensive, especially if it typically found all the other run queues empty, which we confirmed by running with GODEBUG=schedtrace=1000.

With that hypothesis seeming sound we next hypothesized that running each of the five Go 1.11 processes with GOMAXPROCS=12 would be a better configuration to reduce the work stealing iterations and without under utilizing the available hardware threads. This idea also matched the conclusion of this similar issue (now closed) #16476. Load tests with five instances of a Go 1.11 build each with GOMAXPROCS=12 found a similar amount of CPU consumption as with Go 1.9 builds. This is a reasonable workaround and we are using it now.

Although there is a simple workaround, it is unsettling that the runtime scheduler does not scale well for this type of workload on high core counts. Notably, the work stealing algorithm degenerates to O(N²) due to N cores all inspecting each other's mostly empty run queues. The CPU time spent fruitlessly attempting to steal work from empty run queues contributes to overall system load and competes with the demands of other processes on the node, such as the local content cache server mentioned earlier.

The problem I've described here is almost the same as #18237, in which @aclements explained:

Given how much time you're spending in findrunnable, it sounds like you're constantly switching between having something to do and being idle. Presumably the 1500 byte frames are coming in just a little slower than you can process them, so the runtime is constantly looking for work to do, going to sleep, and then immediately being woken up for the next frame. This is the most expensive path in the scheduler (we optimize for the case where there's another goroutine ready to run, which is extremely fast) and there's an implicit assumption here that the cost of going to sleep doesn't really matter if there's nothing to do. But that's violated if new work is coming in at just the wrong rate.

I am creating a new issue because our goroutines are made runnable largely by runtime timers instead of network events. I believe that is a significant difference because although the scheduler cannot easily predict the arrival of network events, runtime timers have known expiration times.

Reproducing

Below is a small program that models the critical path of the application. The code will not run on the playground, but here is a link in case that is more convenient that the inline code below. https://play.golang.org/p/gcGT2v2mZjU

Profile data we've collected from this program shows a strong resemblance to the profile data from the bigger program we originally witnessed this issue with, at least with regard to the Go runtime functions involved.

package main

import (
	"flag"
	"log"
	"math/rand"
	"os"
	"os/signal"
	"runtime"
	"runtime/pprof"
	"runtime/trace"
	"sync"
	"time"
)

func main() {
	var (
		runTime        = flag.Duration("runtime", 10*time.Second, "Run `duration` after target go routine count is reached")
		workDur        = flag.Duration("work", 15*time.Microsecond, "CPU bound work `duration` each cycle")
		cycleDur       = flag.Duration("cycle", 2400*time.Microsecond, "Cycle `duration`")
		gCount         = flag.Int("gcount", runtime.NumCPU(), "Number of `goroutines` to use")
		gStartFreq     = flag.Int("gfreq", 1, "Number of goroutines to start each second until gcount is reached")
		cpuProfilePath = flag.String("cpuprofile", "", "Write CPU profile to `file`")
		tracePath      = flag.String("trace", "", "Write execution trace to `file`")
	)

	flag.Parse()

	sigC := make(chan os.Signal, 1)
	signal.Notify(sigC, os.Interrupt)

	var wg sync.WaitGroup
	done := make(chan struct{})
	stop := make(chan struct{})

	wg.Add(1)
	go func() {
		defer wg.Done()
		select {
		case sig := <-sigC:
			log.Print("got signal ", sig)
		case <-stop:
		}
		close(done)
	}()

	gFreq := time.Second / time.Duration(*gStartFreq)
	jitterCap := int64(gFreq / 2)

	for g := 0; g < *gCount; g++ {
		wg.Add(1)
		go func(id int) {
			defer wg.Done()
			ticker := time.NewTicker(*cycleDur)
			defer ticker.Stop()
			for {
				select {
				case <-done:
					return
				case <-ticker.C:
					workUntil(time.Now().Add(*workDur))
				}
			}
		}(g)
		log.Print("goroutine count: ", g+1)
		jitter := time.Duration(rand.Int63n(jitterCap))
		select {
		case <-done:
			g = *gCount // stop loop early
		case <-time.After(gFreq + jitter):
		}
	}

	select {
	case <-done:
	default:
		log.Print("running for ", *runTime)
		runTimer := time.NewTimer(*runTime)
		wg.Add(1)
		go func() {
			wg.Done()
			select {
			case <-runTimer.C:
				log.Print("runTimer fired")
				close(stop)
			}
		}()
	}

	if *cpuProfilePath != "" {
		f, err := os.Create(*cpuProfilePath)
		if err != nil {
			log.Fatal("could not create CPU profile: ", err)
		}
		if err := pprof.StartCPUProfile(f); err != nil {
			log.Fatal("could not start CPU profile: ", err)
		}
		log.Print("profiling")
		defer pprof.StopCPUProfile()
	}

	if *tracePath != "" {
		f, err := os.Create(*tracePath)
		if err != nil {
			log.Fatal("could not create execution trace: ", err)
			os.Exit(1)
		}
		defer f.Close()
		if err := trace.Start(f); err != nil {
			log.Fatal("could not start execution trace: ", err)
		}
		log.Print("tracing")
		defer trace.Stop()
	}

	wg.Wait()
}

func workUntil(deadline time.Time) {
	now := time.Now()
	for now.Before(deadline) {
		now = time.Now()
	}
}

Profile Data

We ran the above program in several configurations and captured profile and schedtrace data.

Go 1.9, GOMAXPROCS=56, 5 procs x 500 goroutines

schedtrace sample

SCHED 145874ms: gomaxprocs=56 idleprocs=50 threads=60 spinningthreads=1 idlethreads=50 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 146880ms: gomaxprocs=56 idleprocs=43 threads=60 spinningthreads=4 idlethreads=43 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 147886ms: gomaxprocs=56 idleprocs=49 threads=60 spinningthreads=1 idlethreads=49 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 148892ms: gomaxprocs=56 idleprocs=56 threads=60 spinningthreads=0 idlethreads=56 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 149898ms: gomaxprocs=56 idleprocs=50 threads=60 spinningthreads=1 idlethreads=50 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

pprof data

File: sched-test-linux-9
Type: cpu
Time: Oct 30, 2018 at 3:26pm (EDT)
Duration: 1mins, Total samples = 4.70mins (468.80%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20 -cum
Showing nodes accounting for 229.49s, 81.39% of 281.97s total
Dropped 61 nodes (cum <= 1.41s)
Showing top 20 nodes out of 43
      flat  flat%   sum%        cum   cum%
         0     0%     0%    143.03s 50.73%  runtime._System
   135.95s 48.21% 48.21%    135.95s 48.21%  runtime._ExternalCode
     0.18s 0.064% 48.28%     73.63s 26.11%  runtime.mcall
     0.06s 0.021% 48.30%     73.45s 26.05%  runtime.park_m
     0.67s  0.24% 48.54%     72.55s 25.73%  runtime.schedule
    11.32s  4.01% 52.55%     60.58s 21.48%  runtime.findrunnable
     0.97s  0.34% 52.90%     56.95s 20.20%  main.main.func2
     7.71s  2.73% 55.63%     41.98s 14.89%  main.workUntil
     9.40s  3.33% 58.96%     29.40s 10.43%  time.Now
    25.23s  8.95% 67.91%     25.23s  8.95%  runtime.futex
    11.73s  4.16% 72.07%        20s  7.09%  time.now
     3.92s  1.39% 73.46%     19.21s  6.81%  runtime.runqsteal
     0.59s  0.21% 73.67%     15.43s  5.47%  runtime.stopm
    15.06s  5.34% 79.01%     15.29s  5.42%  runtime.runqgrab
     0.11s 0.039% 79.05%     14.20s  5.04%  runtime.futexsleep
     0.92s  0.33% 79.38%     13.86s  4.92%  runtime.notesleep
     4.47s  1.59% 80.96%     12.79s  4.54%  runtime.selectgo
     0.43s  0.15% 81.12%     12.30s  4.36%  runtime.startm
     0.56s   0.2% 81.31%     11.56s  4.10%  runtime.notewakeup
     0.21s 0.074% 81.39%     11.52s  4.09%  runtime.wakep

Go 1.11, GOMAXPROCS=56, 5 procs x 500 goroutines

schedtrace sample

SCHED 144758ms: gomaxprocs=56 idleprocs=52 threads=122 spinningthreads=2 idlethreads=59 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 145764ms: gomaxprocs=56 idleprocs=46 threads=122 spinningthreads=4 idlethreads=55 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 146769ms: gomaxprocs=56 idleprocs=46 threads=122 spinningthreads=3 idlethreads=56 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 147774ms: gomaxprocs=56 idleprocs=46 threads=122 spinningthreads=2 idlethreads=55 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 148780ms: gomaxprocs=56 idleprocs=52 threads=122 spinningthreads=2 idlethreads=60 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
SCHED 149785ms: gomaxprocs=56 idleprocs=46 threads=122 spinningthreads=1 idlethreads=57 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

pprof data

File: sched-test-linux-11
Type: cpu
Time: Oct 30, 2018 at 3:35pm (EDT)
Duration: 1mins, Total samples = 6.43mins (641.46%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20 -cum
Showing nodes accounting for 310.85s, 80.57% of 385.80s total
Dropped 61 nodes (cum <= 1.93s)
Showing top 20 nodes out of 49
      flat  flat%   sum%        cum   cum%
   160.61s 41.63% 41.63%    169.60s 43.96%  time.now
     0.83s  0.22% 41.85%    122.39s 31.72%  runtime.mcall
     0.13s 0.034% 41.88%    121.56s 31.51%  runtime.park_m
     0.64s  0.17% 42.05%    120.12s 31.14%  runtime.schedule
    19.14s  4.96% 47.01%    115.47s 29.93%  runtime.findrunnable
     1.54s   0.4% 47.41%     64.15s 16.63%  main.main.func2
    53.51s 13.87% 61.28%     53.51s 13.87%  runtime.futex
     1.47s  0.38% 61.66%     48.66s 12.61%  runtime.timerproc
    10.70s  2.77% 64.43%     47.49s 12.31%  runtime.runqsteal
     9.88s  2.56% 66.99%     44.13s 11.44%  main.workUntil
     0.14s 0.036% 67.03%     36.91s  9.57%  runtime.notetsleepg
    35.08s  9.09% 76.12%     36.79s  9.54%  runtime.runqgrab
     0.73s  0.19% 76.31%     36.66s  9.50%  runtime.futexsleep
     8.79s  2.28% 78.59%     30.07s  7.79%  time.Now
     1.12s  0.29% 78.88%     26.48s  6.86%  runtime.stopm
     0.33s 0.086% 78.96%     23.16s  6.00%  runtime.systemstack
     1.49s  0.39% 79.35%     22.39s  5.80%  runtime.notesleep
     0.39s   0.1% 79.45%     18.49s  4.79%  runtime.startm
     0.09s 0.023% 79.47%     17.68s  4.58%  runtime.futexwakeup
     4.24s  1.10% 80.57%     17.56s  4.55%  runtime.selectgo

Go 1.11, GOMAXPROCS=12, 5 procs x 500 goroutines

schedtrace sample

SCHED 145716ms: gomaxprocs=12 idleprocs=8 threads=31 spinningthreads=2 idlethreads=11 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 146721ms: gomaxprocs=12 idleprocs=8 threads=31 spinningthreads=1 idlethreads=12 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 147725ms: gomaxprocs=12 idleprocs=8 threads=31 spinningthreads=3 idlethreads=11 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 148730ms: gomaxprocs=12 idleprocs=9 threads=31 spinningthreads=0 idlethreads=12 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 149735ms: gomaxprocs=12 idleprocs=6 threads=31 spinningthreads=1 idlethreads=9 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0]
SCHED 150740ms: gomaxprocs=12 idleprocs=2 threads=31 spinningthreads=3 idlethreads=5 runqueue=0 [0 0 0 0 0 0 0 0 0 0 0 0]

pprof data

File: sched-test-linux-11
Type: cpu
Time: Oct 30, 2018 at 3:32pm (EDT)
Duration: 1mins, Total samples = 4.49mins (447.65%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20 -cum
Showing nodes accounting for 223.33s, 82.96% of 269.19s total
Dropped 61 nodes (cum <= 1.35s)
Showing top 20 nodes out of 46
      flat  flat%   sum%        cum   cum%
   152.74s 56.74% 56.74%    160.99s 59.81%  time.now
     0.82s   0.3% 57.05%     49.80s 18.50%  main.main.func2
     0.44s  0.16% 57.21%     42.75s 15.88%  runtime.mcall
     0.09s 0.033% 57.24%     42.03s 15.61%  runtime.park_m
     0.57s  0.21% 57.45%     41.32s 15.35%  runtime.schedule
    41.16s 15.29% 72.74%     41.16s 15.29%  runtime.futex
     9.33s  3.47% 76.21%     41.09s 15.26%  main.workUntil
        5s  1.86% 78.07%     36.28s 13.48%  runtime.findrunnable
     1.06s  0.39% 78.46%     33.86s 12.58%  runtime.timerproc
     0.55s   0.2% 78.67%     29.45s 10.94%  runtime.futexsleep
     7.98s  2.96% 81.63%     27.25s 10.12%  time.Now
     0.07s 0.026% 81.66%     25.42s  9.44%  runtime.notetsleepg
     0.55s   0.2% 81.86%     17.84s  6.63%  runtime.stopm
     0.72s  0.27% 82.13%     16.36s  6.08%  runtime.notesleep
     1.39s  0.52% 82.64%     15.18s  5.64%  runtime.notetsleep_internal
     0.22s 0.082% 82.73%     13.38s  4.97%  runtime.startm
     0.16s 0.059% 82.79%     12.66s  4.70%  runtime.systemstack
     0.35s  0.13% 82.92%     12.46s  4.63%  runtime.notewakeup
     0.05s 0.019% 82.93%     12.32s  4.58%  runtime.futexwakeup
     0.08s  0.03% 82.96%      8.78s  3.26%  runtime.entersyscallblock

I can test other configurations if it will help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.Performancecompiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions