Skip to content

Commit dc7ed62

Browse files
committed
design: proposal for concurrent stack re-scanning
Updates golang/go#17505. Change-Id: I353edb79d23b1ef1a6bab57485cfd74089b67fa0 Reviewed-on: https://go-review.googlesource.com/31360 Reviewed-by: Brad Fitzpatrick <[email protected]>
1 parent 2f60ea6 commit dc7ed62

File tree

1 file changed

+323
-0
lines changed

1 file changed

+323
-0
lines changed

design/17505-concurrent-rescan.md

Lines changed: 323 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,323 @@
1+
# Proposal: Concurrent stack re-scanning
2+
3+
Author(s): Austin Clements, Rick Hudson
4+
5+
Last updated: 2016-10-18
6+
7+
Discussion at https://golang.org/issue/17505.
8+
9+
**Note:** We are not actually proposing this.
10+
This design was developed before proposal #17503, which is a
11+
dramatically simpler solution to the problem of stack re-scanning.
12+
We're posting this design doc for its historical value.
13+
14+
15+
## Abstract
16+
17+
Since the release of the concurrent garbage collector in Go 1.5, each
18+
subsequent release has further reduced stop-the-world (STW) time by
19+
moving more tasks to the concurrent phase.
20+
As of Go 1.7, the only non-trivial STW task is stack re-scanning.
21+
We propose to make stack re-scanning concurrent for Go 1.8, likely
22+
resulting in sub-millisecond worst-case STW times.
23+
24+
25+
## Background
26+
27+
Go's concurrent garbage collector consists of four phases: mark, mark
28+
termination, sweep, and sweep termination.
29+
The mark and sweep phases are *concurrent*, meaning that the
30+
application (the *mutator*) continues to run during these phases,
31+
while the mark termination and sweep termination phases are
32+
*stop-the-world* (STW), meaning that the garbage collector pauses the
33+
mutator for the duration of the phase.
34+
35+
Since Go 1.5, we've been steadily moving tasks from the STW phases to
36+
the concurrent phases, with a particular focus on tasks that take time
37+
proportional to something under application control, such as heap size
38+
or number of goroutines.
39+
As a result, in Go 1.7, most applications have sub-millisecond STW
40+
times.
41+
42+
As of Go 1.7, the only remaining application-controllable STW task is
43+
*stack re-scanning*.
44+
Because of this one task, applications with large numbers of active
45+
goroutines can still experience STW times in excess of 10ms.
46+
47+
Stack re-scanning is necessary because stacks are *permagray* in the
48+
Go garbage collector.
49+
Specifically, for performance reasons, there are no write barriers for
50+
writes to pointers in the current stack frame.
51+
As a result, even though the garbage collector scans all stacks at the
52+
beginning of the mark phase, it must re-scan all modified stacks with
53+
the world is stopped to catch any pointers the mutator "hid" on the
54+
stack.
55+
56+
Unfortunately, this makes STW time proportional to the total amount of
57+
stack that needs to be rescanned.
58+
Worse, stack scanning is relatively expensive (~5ms/MB).
59+
Hence, applications with a large number of active goroutines can
60+
quickly drive up STW time.
61+
62+
63+
## Proposal
64+
65+
We propose to make stack re-scanning concurrent using a *transitive
66+
mark* write barrier.
67+
68+
In this design, we add a new concurrent phase between mark and mark
69+
termination called *stack re-scan*.
70+
This phase starts as soon as the mark phase has marked all objects
71+
reachable from roots *other than stacks*.
72+
The phase re-scans stacks that have been modified since their initial
73+
scan, and enables a special *transitive mark* write barrier.
74+
75+
Re-scanning and the write barrier ensure the following invariant
76+
during this phase:
77+
78+
> *After a goroutine stack G has been re-scanned, all objects locally
79+
> reachable to G are black.*
80+
81+
This depends on a goroutine-local notion of reachability, which is the
82+
set of objects reachable from globals or a given goroutine's stack or
83+
registers.
84+
Unlike regular global reachability, this is not stable: as goroutines
85+
modify heap pointers or communicate, an object that was locally
86+
unreachable to a given goroutine may become locally reachable.
87+
However, the concepts are closely related: a globally reachable object
88+
must be locally reachable by at least one goroutine, and, conversely,
89+
an object that is not locally reachable by any goroutine is not
90+
globally reachable.
91+
92+
This invariant ensures that re-scanning a stack *blackens* that stack,
93+
and that the stack remains black since the goroutine has no way to
94+
find a white object once its stack has been re-scanned.
95+
96+
Furthermore, once every goroutine stack has been re-scanned, marking
97+
is complete.
98+
Every globally reachable object must be locally reachable by some
99+
goroutine and, once every stack has been re-scanned, every object
100+
locally reachable by some goroutine is black, so it follows that every
101+
globally reachable object is black once every stack has been
102+
re-scanned.
103+
104+
### Transitive mark write barrier
105+
106+
The transitive mark write barrier for an assignment `*dst = src`
107+
(where `src` is a pointer) ensures that all objects reachable from
108+
`src` are black *before* writing `src` to `*dst`.
109+
Writing `src` to `*dst` may make any object reachable from `src`
110+
(including `src` itself) locally reachable to some goroutine that has
111+
been re-scanned.
112+
Hence, to maintain the invariant, we must ensure these objects are all
113+
black.
114+
115+
To do this, the write barrier greys `src` and then drains the mark
116+
work queue until there are no grey objects (using the same work queue
117+
logic that drives the mark phase).
118+
At this point, it writes `src` to `*dst` and allows the goroutine to
119+
proceed.
120+
121+
The write barrier must not perform the write until all simultaneous
122+
write barriers are also ready to perform the write.
123+
We refer to this *mark quiescence*.
124+
To see why this is necessary, consider two simultaneous write barriers
125+
for `*D1 = S1` and `*D2 = S2` on an object graph that looks like this:
126+
127+
G1 [b] → D1 [b] S1 [w]
128+
129+
O1 [w] → O2 [w] → O3 [w]
130+
131+
D2 [b] S2 [w]
132+
133+
Goroutine *G1* has been re-scanned (so *D1* must be black), while *Sn*
134+
and *On* are all white.
135+
136+
Suppose the *S2* write barrier blackens *S2* and *O1* and greys *O2*,
137+
then the *S1* write barrier blackens *S1* and observes that *O1* is
138+
already black:
139+
140+
G1 [b] → D1 [b] S1 [b]
141+
142+
O1 [b] → O2 [g] → O3 [w]
143+
144+
D2 [b] S2 [b]
145+
146+
At this point, the *S1* barrier has run out of local work, but the
147+
*S2* barrier is still going.
148+
If *S1* were to complete and write `*D1 = S1` at this point, it would
149+
make white object *O3* reachable to goroutine *G1*, violating the
150+
invariant.
151+
Hence, the *S1* barrier cannot complete until the *S2* barrier is also
152+
ready to complete.
153+
154+
This requirement sounds onerous, but it can be achieved in a simple
155+
and reasonably efficient manner by sharing a global mark work queue
156+
between the write barriers.
157+
This reuses the existing mark work queue and quiescence logic and
158+
allows write barriers to help each other to completion.
159+
160+
### Stack re-scanning
161+
162+
The stack re-scan phase re-scans the stacks of all goroutines that
163+
have run since the initial stack scan to find pointers to white
164+
objects.
165+
The process of re-scanning a stack is identical to that of the initial
166+
scan, except that it must participate in mark quiescence.
167+
Specifically, the re-scanned goroutine must not resume execution until
168+
the system has reached mark quiescence (even if no white pointers are
169+
found on the stack).
170+
Otherwise, the same sorts of races that were described above are
171+
possible.
172+
173+
There are multiple ways to realize this.
174+
The whole stack scan could participate in mark quiescence, but this
175+
would block any contemporaneous stack scans or write barriers from
176+
completing during a stack scan if any white pointers were found.
177+
Alternatively, each white pointer found on the stack could participate
178+
individually in mark quiescence, blocking the stack scan at that
179+
pointer until mark quiescence, and the stack scan could again
180+
participate in mark quiescence once all frames had been scanned.
181+
182+
We propose an intermediate: gather small batches of white pointers
183+
from a stack at a time and reach mark quiescence on each batch
184+
individually, as well as at the end of the stack scan (even if the
185+
final batch is empty).
186+
187+
### Other considerations
188+
189+
Goroutines that start during stack re-scanning cannot reach any white
190+
objects, so their stacks are immediately considered black.
191+
192+
Goroutines can also share pointers through channels, which are often
193+
implemented as direct stack-to-stack copies.
194+
Hence, channel receives also require write barriers in order to
195+
maintain the invariant.
196+
Channel receives already have write barriers to maintain stack
197+
barriers, so there is no additional work here.
198+
199+
200+
## Rationale
201+
202+
The primary drawback of this approach to concurrent stack re-scanning
203+
is that a write barrier during re-scanning could introduce significant
204+
mutator latency if the transitive mark finds a large unmarked region
205+
of the heap, or if overlapping write barriers significantly delay mark
206+
quiescence.
207+
However, we consider this situation unlikely in non-adversarial
208+
applications.
209+
Furthermore, the resulting delay should be no worse than the mark
210+
termination STW time applications currently experience, since mark
211+
termination has to do exactly the same amount of marking work, in
212+
addition to the cost of stack scanning.
213+
214+
### Alternative approaches
215+
216+
An alternative solution to concurrent stack re-scanning would be to
217+
adopt DMOS-style quiescence [Hudson '97].
218+
In this approach, greying any object during stack re-scanning (either
219+
by finding a pointer to a white object on a stack or by installing a
220+
pointer to a white object in the heap) forces the GC to drain this
221+
marking work and *restart* the stack re-scanning phase.
222+
223+
This approach has a much simpler write barrier implementation that is
224+
constant time, so the write barrier would not induce significant
225+
mutator latency.
226+
However, unlike the proposed approach, the amount of work performed by
227+
DMOS-style stack re-scanning is potentially unbounded.
228+
This interacts poorly with Go's GC pacer.
229+
The pacer enforces the goal heap size making allocating and GC work
230+
proportional, but this requires an upper bound on possible GC work.
231+
As a result, if the pacer underestimates the amount of re-scanning
232+
work, it may need to block allocation entirely to avoid exceeding the
233+
goal heap size.
234+
This would be an effective STW.
235+
236+
There is also a hybrid solution: we could use the proposed transitive
237+
marking write barrier, but bound the amount of work it can do (and
238+
hence the latency it can induce).
239+
If the write barrier exceeds this bound, it performs a DMOS-style
240+
restart.
241+
This is likely to get the best of both worlds, but also inherits the
242+
sum of their complexity.
243+
244+
A final alternative would be to eliminate concurrent stack re-scanning
245+
entirely by adopting a *deletion-style* write barrier [Yuasa '90].
246+
This style of write barrier allows the initial stack scan to *blacken*
247+
the stack, rather than merely greying it (still without the need for
248+
stack write barriers).
249+
For full details, see proposal #17503.
250+
251+
252+
## Compatibility
253+
254+
This proposal does not affect the language or any APIs and hence
255+
satisfies the Go 1 compatibility guidelines.
256+
257+
258+
## Implementation
259+
260+
We do not plan to implement this proposal.
261+
Instead, we plan to implement proposal #17503.
262+
263+
The implementation steps are as follows:
264+
265+
1. While not strictly necessary, first make GC assists participate in
266+
stack scanning.
267+
Currently this is not possible, which increases mutator latency at
268+
the beginning of the GC cycle.
269+
This proposal would compound this effect by also blocking GC
270+
assists at the end of the GC cycle, causing an effective STW.
271+
272+
2. Modify the write barrier to be pre-publication instead of
273+
post-publication.
274+
Currently the write barrier occurs after the write of a pointer,
275+
but this proposal requires that the write barrier complete
276+
transitive marking *before* writing the pointer to its destination.
277+
A pre-publication barrier is also necessary for
278+
[ROC](https://golang.org/s/gctoc).
279+
280+
3. Make the mark completion condition precise.
281+
Currently it's possible (albeit unlikely) to enter mark termination
282+
before all heap pointers have been marked.
283+
This proposal requires that we not start stack re-scanning until
284+
all objects reachable from globals are marked, which requires a
285+
precise completion condition.
286+
287+
4. Implement the transitive mark write barrier.
288+
This can reuse the existing work buffer pool lists and logic,
289+
including the global quiescence barrier in getfull.
290+
It may be necessary to improve the performance characteristics of
291+
the getfull barrier, since this proposal will lean far more heavily
292+
on this barrier than we currently do.
293+
294+
5. Check stack re-scanning code and make sure it is safe during
295+
non-STW.
296+
Since this only runs during STW right now, it may omit
297+
synchronization that will be necessary when running during non-STW.
298+
This is likely to be minimal, since most of the code is shared with
299+
the initial stack scan, which does run concurrently.
300+
301+
6. Make stack re-scanning participate in write barrier quiescence.
302+
303+
7. Create a new stack re-scanning phase.
304+
Make mark 2 completion transition to stack re-scanning instead of
305+
mark termination and enqueue stack re-scanning root jobs.
306+
Once all stack re-scanning jobs are complete, transition to mark
307+
termination.
308+
309+
310+
## Acknowledgments
311+
312+
We would like to thank Rhys Hiltner (@rhysh) for suggesting the idea
313+
of a transitive mark write barrier.
314+
315+
316+
## References
317+
318+
[Hudson '97] R. L. Hudson, R. Morrison, J. E. B. Moss, and D. S.
319+
Munro. Garbage collecting the world: One car at a time. In *ACM
320+
SIGPLAN Notices* 32(10):162–175, October 1997.
321+
322+
[Yuasa '90] T. Yuasa. Real-time garbage collection on general-purpose
323+
machines. *Journal of Systems and Software*, 11(3):181–198, 1990.

0 commit comments

Comments
 (0)