-
Notifications
You must be signed in to change notification settings - Fork 18.3k
Description
What version of Go are you using (go version
)?
go1.19.3 linux/amd64
Does this issue reproduce with the latest release?
Yes: go1.19.3 is the latest stable release. I have not tried with tip.
What operating system and processor architecture are you using (go env
)?
linux/amd64
What did you do?
I have an app that serves HTTP requests and consumes other HTTP services as backends. The work required to process an inbound request is highly variable. Serving the largest requests involves rapid creation of several hundred goroutines, each of which does a bit of work and then allows other parts of the associated request to proceed. Smaller requests have a similar pattern, but may use only a few dozen goroutines.
Because request size is so variable, the program tries to limit the maximum concurrency that any single inbound request can consume, so smaller requests get a chance to use various HTTP clients and to have on-CPU time. It does this by creating a chan struct{}
for each request, to use the channel's capacity as a semaphore. Before and after doing work for a request, a goroutine needs to interact with the channel to send or receive a permission token.
This app runs on machines that provide 48 hardware threads as a single NUMA node. It uses the default GOMAXPROCS setting, leading to 48 Ps.
What did you expect to see?
I expected chan struct{}
to be an effective way to build a semaphore. I expected the runtime to mediate access to the channel in a way that scales well to 48 threads. I expected the runtime to allow forward progress by goroutines that did not attempt to interact with an especially busy channel.
What did you see instead?
The app's work is bursty. When a large request arrives, there's a sudden increase in the amount of available work. Most units of work take around 100 µs to complete. During some bursts of work, the number of runnable goroutines grows to around 1000 and all Ps are able to cooperate in draining that queue.
But during other bursts of work, the execution tracer shows those goroutines taking more than a millisecond of on-CPU time when a P picks them up. The CPUSample events that arrive on goroutines that are taking an unusually long amount of time to do their usual work show that the P is busy trying to get the runtime-internal lock for one of the concurrency-limiting channels.
I think that part of what's going on is that a single goroutine launches a variable number of additional goroutines, and if they don't all fit in the P's local run queue (capacity of 256), they spill out onto the global run queue. Go's scheduler is good at spreading out that work, so if 700 goroutines land there and one of 48 Ps looks for work, it'll pick up about 14 of them and leave the rest for another P to find. Soon, every P in the program has a dozen goroutines in its local run queue, each of which needs to obtain the lock on the single channel they all share (because they were launched by the same goroutine as part of the same request) before doing any real work.
Goroutines related to other requests, whose concurrency-limiting channels are not contended (small / "mouse" requests), are stuck behind that elephant request's goroutines and cannot get running time on a P.
On its surface, this looks like a fine way to write Go code. IIUC, it's common for an "introduction to Go concurrency" to say that goroutines are cheap, to suggest using a shared channel like this for controlling concurrency (if desired), and to advise against worker pools (which in this app's case, we could size to below GOMAXPROCS). Having 48 Ps is a bit big, but it's not even into the realm of NUMA these days.
CC @golang/runtime
Below is an execution trace that shows a few bursts of work that are large enough that no P is left asleep.
First, from 370–380 ms, the process earns 32 CPUSample events. The leaf-most function in those is runtime.futex
in 10 samples, runtime.procyield
in 4 samples, and 1 sample each in runtime.chanrecv
, runtime.lock2
, and runtime.osyield
. The other 15 samples are in application code and runtime calls unrelated to locking. The execution trace shows relatively quick execution for each goroutine.
Second, from 380–390 ms, the process earns 17 CPUSample events. The leaf-most function in those is runtime.chansend
in 1 sample and runtime.sellock
(for a different type of channel) for 1 sample. The other 15 samples are again in non-lock-related code, and the execution trace shows quick execution.
Finally, the problem: From 390–400 ms, the process earns 40 CPUSample events. The leaf-most function in those is runtime.futex
in 33 samples. The other 7 samples are unrelated to locking. The execution trace shows that when a goroutine runs, it uses the P for multiple milliseconds, much longer than the typical 100 µs.
The runtime portion of the CPUSample events looks like these. Note this runs on a Linux machine, so these events correspond to timer_create
timers delivering a thread-targeted SIGPROF during the kernel's timer interrupt. They arrive in bursts, and the bursts are every 4 ms (CONFIG_HZ=250).
393019281 CPUSample p=9 g=335635497 off=11040816 ts=0 p=9 g=335635497
46fe83 runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s:560
435cd5 runtime.futexsleep /usr/local/go/src/runtime/os_linux.go:69
40bd78 runtime.lock2 /usr/local/go/src/runtime/lock_futex.go:107
406957 runtime.lockWithRank /usr/local/go/src/runtime/lockrank_off.go:22
40694f runtime.lock /usr/local/go/src/runtime/lock_futex.go:48
406939 runtime.chanrecv /usr/local/go/src/runtime/chan.go:511
4067d7 runtime.chanrecv1 /usr/local/go/src/runtime/chan.go:442
397010328 CPUSample p=47 g=335636655 off=11041032 ts=0 p=47 g=335636655
46fe83 runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s:560
435cd5 runtime.futexsleep /usr/local/go/src/runtime/os_linux.go:69
40bd78 runtime.lock2 /usr/local/go/src/runtime/lock_futex.go:107
405b59 runtime.lockWithRank /usr/local/go/src/runtime/lockrank_off.go:22
405b51 runtime.lock /usr/local/go/src/runtime/lock_futex.go:48
405b3b runtime.chansend /usr/local/go/src/runtime/chan.go:202
405a7c runtime.chansend1 /usr/local/go/src/runtime/chan.go:145
397029549 CPUSample p=31 g=335636665 off=11041248 ts=0 p=31 g=335636665
46fe83 runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s:560
435db5 runtime.futexwakeup /usr/local/go/src/runtime/os_linux.go:82
40bf12 runtime.unlock2 /usr/local/go/src/runtime/lock_futex.go:121
405ffc runtime.unlockWithRank /usr/local/go/src/runtime/lockrank_off.go:32
405ff2 runtime.unlock /usr/local/go/src/runtime/lock_futex.go:112
405ff1 runtime.chansend /usr/local/go/src/runtime/chan.go:228
405a7c runtime.chansend1 /usr/local/go/src/runtime/chan.go:145
Metadata
Metadata
Assignees
Labels
Type
Projects
Status