Skip to content

Commit cf4f1d0

Browse files
committed
runtime: bound scanobject to ~100 µs
Currently the time spent in scanobject is proportional to the size of the object being scanned. Since scanobject is non-preemptible, large objects can cause significant goroutine (and even whole application) delays through several means: 1. If a GC assist picks up a large object, the allocating goroutine is blocked for the whole scan, even if that scan well exceeds that goroutine's debt. 2. Since the scheduler does not run on the P performing a large object scan, goroutines in that P's run queue do not run unless they are stolen by another P (which can take some time). If there are a few large objects, all of the Ps may get tied up so the scheduler doesn't run anywhere. 3. Even if a large object is scanned by a background worker and other Ps are still running the scheduler, the large object scan doesn't flush background credit until the whole scan is done. This can easily cause all allocations to block in assists, waiting for credit, causing an effective STW. Fix this by splitting large objects into 128 KB "oblets" and scanning at most one oblet at a time. Since we can scan 1–2 MB/ms, this equates to bounding scanobject at roughly 100 µs. This improves assist behavior both because assists can no longer get "unlucky" and be stuck scanning a large object, and because it causes the background worker to flush credit and unblock assists more frequently when scanning large objects. This also improves GC parallelism if the heap consists primarily of a small number of very large objects by letting multiple workers scan a large objects in parallel. Fixes #10345. Fixes #16293. This substantially improves goroutine latency in the benchmark from issue #16293, which exercises several forms of very large objects: name old max-latency new max-latency delta SliceNoPointer-12 154µs ± 1% 155µs ± 2% ~ (p=0.087 n=13+12) SlicePointer-12 314ms ± 1% 5.94ms ±138% -98.11% (p=0.000 n=19+20) SliceLivePointer-12 1148ms ± 0% 4.72ms ±167% -99.59% (p=0.000 n=19+20) MapNoPointer-12 72509µs ± 1% 408µs ±325% -99.44% (p=0.000 n=19+18) ChanPointer-12 313ms ± 0% 4.74ms ±140% -98.49% (p=0.000 n=18+20) ChanLivePointer-12 1147ms ± 0% 3.30ms ±149% -99.71% (p=0.000 n=19+20) name old P99.9-latency new P99.9-latency delta SliceNoPointer-12 113µs ±25% 107µs ±12% ~ (p=0.153 n=20+18) SlicePointer-12 309450µs ± 0% 133µs ±23% -99.96% (p=0.000 n=20+20) SliceLivePointer-12 961ms ± 0% 1.35ms ±27% -99.86% (p=0.000 n=20+20) MapNoPointer-12 448µs ±288% 119µs ±18% -73.34% (p=0.000 n=18+20) ChanPointer-12 309450µs ± 0% 134µs ±23% -99.96% (p=0.000 n=20+19) ChanLivePointer-12 961ms ± 0% 1.35ms ±27% -99.86% (p=0.000 n=20+20) This has negligible effect on all metrics from the garbage, JSON, and HTTP x/benchmarks. It shows slight improvement on some of the go1 benchmarks, particularly Revcomp, which uses some multi-megabyte buffers: name old time/op new time/op delta BinaryTree17-12 2.46s ± 1% 2.47s ± 1% +0.32% (p=0.012 n=20+20) Fannkuch11-12 2.82s ± 0% 2.81s ± 0% -0.61% (p=0.000 n=17+20) FmtFprintfEmpty-12 50.8ns ± 5% 50.5ns ± 2% ~ (p=0.197 n=17+19) FmtFprintfString-12 131ns ± 1% 132ns ± 0% +0.57% (p=0.000 n=20+16) FmtFprintfInt-12 117ns ± 0% 116ns ± 0% -0.47% (p=0.000 n=15+20) FmtFprintfIntInt-12 180ns ± 0% 179ns ± 1% -0.78% (p=0.000 n=16+20) FmtFprintfPrefixedInt-12 186ns ± 1% 185ns ± 1% -0.55% (p=0.000 n=19+20) FmtFprintfFloat-12 263ns ± 1% 271ns ± 0% +2.84% (p=0.000 n=18+20) FmtManyArgs-12 741ns ± 1% 742ns ± 1% ~ (p=0.190 n=19+19) GobDecode-12 7.44ms ± 0% 7.35ms ± 1% -1.21% (p=0.000 n=20+20) GobEncode-12 6.22ms ± 1% 6.21ms ± 1% ~ (p=0.336 n=20+19) Gzip-12 220ms ± 1% 219ms ± 1% ~ (p=0.130 n=19+19) Gunzip-12 37.9ms ± 0% 37.9ms ± 1% ~ (p=1.000 n=20+19) HTTPClientServer-12 82.5µs ± 3% 82.6µs ± 3% ~ (p=0.776 n=20+19) JSONEncode-12 16.4ms ± 1% 16.5ms ± 2% +0.49% (p=0.003 n=18+19) JSONDecode-12 53.7ms ± 1% 54.1ms ± 1% +0.71% (p=0.000 n=19+18) Mandelbrot200-12 4.19ms ± 1% 4.20ms ± 1% ~ (p=0.452 n=19+19) GoParse-12 3.38ms ± 1% 3.37ms ± 1% ~ (p=0.123 n=19+19) RegexpMatchEasy0_32-12 72.1ns ± 1% 71.8ns ± 1% ~ (p=0.397 n=19+17) RegexpMatchEasy0_1K-12 242ns ± 0% 242ns ± 0% ~ (p=0.168 n=17+20) RegexpMatchEasy1_32-12 72.1ns ± 1% 72.1ns ± 1% ~ (p=0.538 n=18+19) RegexpMatchEasy1_1K-12 385ns ± 1% 384ns ± 1% ~ (p=0.388 n=20+20) RegexpMatchMedium_32-12 112ns ± 1% 112ns ± 3% ~ (p=0.539 n=20+20) RegexpMatchMedium_1K-12 34.4µs ± 2% 34.4µs ± 2% ~ (p=0.628 n=18+18) RegexpMatchHard_32-12 1.80µs ± 1% 1.80µs ± 1% ~ (p=0.522 n=18+19) RegexpMatchHard_1K-12 54.0µs ± 1% 54.1µs ± 1% ~ (p=0.647 n=20+19) Revcomp-12 387ms ± 1% 369ms ± 5% -4.89% (p=0.000 n=17+19) Template-12 62.3ms ± 1% 62.0ms ± 0% -0.48% (p=0.002 n=20+17) TimeParse-12 314ns ± 1% 314ns ± 0% ~ (p=1.011 n=20+13) TimeFormat-12 358ns ± 0% 354ns ± 0% -1.12% (p=0.000 n=17+20) [Geo mean] 53.5µs 53.3µs -0.23% Change-Id: I2a0a179d1d6bf7875dd054b7693dd12d2a340132 Reviewed-on: https://go-review.googlesource.com/23540 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Rick Hudson <[email protected]>
1 parent b275e55 commit cf4f1d0

File tree

3 files changed

+64
-7
lines changed

3 files changed

+64
-7
lines changed

src/runtime/mgc.go

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,15 @@
122122
// proportion to the allocation cost. Adjusting GOGC just changes the linear constant
123123
// (and also the amount of extra memory used).
124124

125+
// Oblets
126+
//
127+
// In order to prevent long pauses while scanning large objects and to
128+
// improve parallelism, the garbage collector breaks up scan jobs for
129+
// objects larger than maxObletBytes into "oblets" of at most
130+
// maxObletBytes. When scanning encounters the beginning of a large
131+
// object, it scans only the first oblet and enqueues the remaining
132+
// oblets as new scan jobs.
133+
125134
package runtime
126135

127136
import (

src/runtime/mgcmark.go

Lines changed: 54 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,15 @@ const (
2525
// rootBlockSpans is the number of spans to scan per span
2626
// root.
2727
rootBlockSpans = 8 * 1024 // 64MB worth of spans
28+
29+
// maxObletBytes is the maximum bytes of an object to scan at
30+
// once. Larger objects will be split up into "oblets" of at
31+
// most this size. Since we can scan 1–2 MB/ms, 128 KB bounds
32+
// scan preemption at ~100 µs.
33+
//
34+
// This must be > _MaxSmallSize so that the object base is the
35+
// span base.
36+
maxObletBytes = 128 << 10
2837
)
2938

3039
// gcMarkRootPrepare queues root scanning jobs (stacks, globals, and
@@ -1113,9 +1122,10 @@ func scanblock(b0, n0 uintptr, ptrmask *uint8, gcw *gcWork) {
11131122
}
11141123

11151124
// scanobject scans the object starting at b, adding pointers to gcw.
1116-
// b must point to the beginning of a heap object; scanobject consults
1117-
// the GC bitmap for the pointer mask and the spans for the size of the
1118-
// object.
1125+
// b must point to the beginning of a heap object or an oblet.
1126+
// scanobject consults the GC bitmap for the pointer mask and the
1127+
// spans for the size of the object.
1128+
//
11191129
//go:nowritebarrier
11201130
func scanobject(b uintptr, gcw *gcWork) {
11211131
// Note that arena_used may change concurrently during
@@ -1130,16 +1140,54 @@ func scanobject(b uintptr, gcw *gcWork) {
11301140
arena_start := mheap_.arena_start
11311141
arena_used := mheap_.arena_used
11321142

1133-
// Find bits of the beginning of the object.
1134-
// b must point to the beginning of a heap object, so
1135-
// we can get its bits and span directly.
1143+
// Find the bits for b and the size of the object at b.
1144+
//
1145+
// b is either the beginning of an object, in which case this
1146+
// is the size of the object to scan, or it points to an
1147+
// oblet, in which case we compute the size to scan below.
11361148
hbits := heapBitsForAddr(b)
11371149
s := spanOfUnchecked(b)
11381150
n := s.elemsize
11391151
if n == 0 {
11401152
throw("scanobject n == 0")
11411153
}
11421154

1155+
if n > maxObletBytes {
1156+
// Large object. Break into oblets for better
1157+
// parallelism and lower latency.
1158+
if b == s.base() {
1159+
// It's possible this is a noscan object (not
1160+
// from greyobject, but from other code
1161+
// paths), in which case we must *not* enqueue
1162+
// oblets since their bitmaps will be
1163+
// uninitialized.
1164+
if !hbits.hasPointers(n) {
1165+
// Bypass the whole scan.
1166+
gcw.bytesMarked += uint64(n)
1167+
return
1168+
}
1169+
1170+
// Enqueue the other oblets to scan later.
1171+
// Some oblets may be in b's scalar tail, but
1172+
// these will be marked as "no more pointers",
1173+
// so we'll drop out immediately when we go to
1174+
// scan those.
1175+
for oblet := b + maxObletBytes; oblet < s.base()+s.elemsize; oblet += maxObletBytes {
1176+
if !gcw.putFast(oblet) {
1177+
gcw.put(oblet)
1178+
}
1179+
}
1180+
}
1181+
1182+
// Compute the size of the oblet. Since this object
1183+
// must be a large object, s.base() is the beginning
1184+
// of the object.
1185+
n = s.base() + s.elemsize - b
1186+
if n > maxObletBytes {
1187+
n = maxObletBytes
1188+
}
1189+
}
1190+
11431191
var i uintptr
11441192
for i = 0; i < n; i += sys.PtrSize {
11451193
// Find bits for this word.

src/runtime/mgcwork.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ func (w *gcWork) init() {
9494
}
9595

9696
// put enqueues a pointer for the garbage collector to trace.
97-
// obj must point to the beginning of a heap object.
97+
// obj must point to the beginning of a heap object or an oblet.
9898
//go:nowritebarrier
9999
func (w *gcWork) put(obj uintptr) {
100100
wbuf := w.wbuf1.ptr()

0 commit comments

Comments
 (0)