|
| 1 | +# Proposal: Concurrent stack re-scanning |
| 2 | + |
| 3 | +Author(s): Austin Clements, Rick Hudson |
| 4 | + |
| 5 | +Last updated: 2016-10-18 |
| 6 | + |
| 7 | +Discussion at https://golang.org/issue/17505. |
| 8 | + |
| 9 | +**Note:** We are not actually proposing this. |
| 10 | +This design was developed before proposal #17503, which is a |
| 11 | +dramatically simpler solution to the problem of stack re-scanning. |
| 12 | +We're posting this design doc for its historical value. |
| 13 | + |
| 14 | + |
| 15 | +## Abstract |
| 16 | + |
| 17 | +Since the release of the concurrent garbage collector in Go 1.5, each |
| 18 | +subsequent release has further reduced stop-the-world (STW) time by |
| 19 | +moving more tasks to the concurrent phase. |
| 20 | +As of Go 1.7, the only non-trivial STW task is stack re-scanning. |
| 21 | +We propose to make stack re-scanning concurrent for Go 1.8, likely |
| 22 | +resulting in sub-millisecond worst-case STW times. |
| 23 | + |
| 24 | + |
| 25 | +## Background |
| 26 | + |
| 27 | +Go's concurrent garbage collector consists of four phases: mark, mark |
| 28 | +termination, sweep, and sweep termination. |
| 29 | +The mark and sweep phases are *concurrent*, meaning that the |
| 30 | +application (the *mutator*) continues to run during these phases, |
| 31 | +while the mark termination and sweep termination phases are |
| 32 | +*stop-the-world* (STW), meaning that the garbage collector pauses the |
| 33 | +mutator for the duration of the phase. |
| 34 | + |
| 35 | +Since Go 1.5, we've been steadily moving tasks from the STW phases to |
| 36 | +the concurrent phases, with a particular focus on tasks that take time |
| 37 | +proportional to something under application control, such as heap size |
| 38 | +or number of goroutines. |
| 39 | +As a result, in Go 1.7, most applications have sub-millisecond STW |
| 40 | +times. |
| 41 | + |
| 42 | +As of Go 1.7, the only remaining application-controllable STW task is |
| 43 | +*stack re-scanning*. |
| 44 | +Because of this one task, applications with large numbers of active |
| 45 | +goroutines can still experience STW times in excess of 10ms. |
| 46 | + |
| 47 | +Stack re-scanning is necessary because stacks are *permagray* in the |
| 48 | +Go garbage collector. |
| 49 | +Specifically, for performance reasons, there are no write barriers for |
| 50 | +writes to pointers in the current stack frame. |
| 51 | +As a result, even though the garbage collector scans all stacks at the |
| 52 | +beginning of the mark phase, it must re-scan all modified stacks with |
| 53 | +the world is stopped to catch any pointers the mutator "hid" on the |
| 54 | +stack. |
| 55 | + |
| 56 | +Unfortunately, this makes STW time proportional to the total amount of |
| 57 | +stack that needs to be rescanned. |
| 58 | +Worse, stack scanning is relatively expensive (~5ms/MB). |
| 59 | +Hence, applications with a large number of active goroutines can |
| 60 | +quickly drive up STW time. |
| 61 | + |
| 62 | + |
| 63 | +## Proposal |
| 64 | + |
| 65 | +We propose to make stack re-scanning concurrent using a *transitive |
| 66 | +mark* write barrier. |
| 67 | + |
| 68 | +In this design, we add a new concurrent phase between mark and mark |
| 69 | +termination called *stack re-scan*. |
| 70 | +This phase starts as soon as the mark phase has marked all objects |
| 71 | +reachable from roots *other than stacks*. |
| 72 | +The phase re-scans stacks that have been modified since their initial |
| 73 | +scan, and enables a special *transitive mark* write barrier. |
| 74 | + |
| 75 | +Re-scanning and the write barrier ensure the following invariant |
| 76 | +during this phase: |
| 77 | + |
| 78 | +> *After a goroutine stack G has been re-scanned, all objects locally |
| 79 | +> reachable to G are black.* |
| 80 | +
|
| 81 | +This depends on a goroutine-local notion of reachability, which is the |
| 82 | +set of objects reachable from globals or a given goroutine's stack or |
| 83 | +registers. |
| 84 | +Unlike regular global reachability, this is not stable: as goroutines |
| 85 | +modify heap pointers or communicate, an object that was locally |
| 86 | +unreachable to a given goroutine may become locally reachable. |
| 87 | +However, the concepts are closely related: a globally reachable object |
| 88 | +must be locally reachable by at least one goroutine, and, conversely, |
| 89 | +an object that is not locally reachable by any goroutine is not |
| 90 | +globally reachable. |
| 91 | + |
| 92 | +This invariant ensures that re-scanning a stack *blackens* that stack, |
| 93 | +and that the stack remains black since the goroutine has no way to |
| 94 | +find a white object once its stack has been re-scanned. |
| 95 | + |
| 96 | +Furthermore, once every goroutine stack has been re-scanned, marking |
| 97 | +is complete. |
| 98 | +Every globally reachable object must be locally reachable by some |
| 99 | +goroutine and, once every stack has been re-scanned, every object |
| 100 | +locally reachable by some goroutine is black, so it follows that every |
| 101 | +globally reachable object is black once every stack has been |
| 102 | +re-scanned. |
| 103 | + |
| 104 | +### Transitive mark write barrier |
| 105 | + |
| 106 | +The transitive mark write barrier for an assignment `*dst = src` |
| 107 | +(where `src` is a pointer) ensures that all objects reachable from |
| 108 | +`src` are black *before* writing `src` to `*dst`. |
| 109 | +Writing `src` to `*dst` may make any object reachable from `src` |
| 110 | +(including `src` itself) locally reachable to some goroutine that has |
| 111 | +been re-scanned. |
| 112 | +Hence, to maintain the invariant, we must ensure these objects are all |
| 113 | +black. |
| 114 | + |
| 115 | +To do this, the write barrier greys `src` and then drains the mark |
| 116 | +work queue until there are no grey objects (using the same work queue |
| 117 | +logic that drives the mark phase). |
| 118 | +At this point, it writes `src` to `*dst` and allows the goroutine to |
| 119 | +proceed. |
| 120 | + |
| 121 | +The write barrier must not perform the write until all simultaneous |
| 122 | +write barriers are also ready to perform the write. |
| 123 | +We refer to this *mark quiescence*. |
| 124 | +To see why this is necessary, consider two simultaneous write barriers |
| 125 | +for `*D1 = S1` and `*D2 = S2` on an object graph that looks like this: |
| 126 | + |
| 127 | + G1 [b] → D1 [b] S1 [w] |
| 128 | + ↘ |
| 129 | + O1 [w] → O2 [w] → O3 [w] |
| 130 | + ↗ |
| 131 | + D2 [b] S2 [w] |
| 132 | + |
| 133 | +Goroutine *G1* has been re-scanned (so *D1* must be black), while *Sn* |
| 134 | +and *On* are all white. |
| 135 | + |
| 136 | +Suppose the *S2* write barrier blackens *S2* and *O1* and greys *O2*, |
| 137 | +then the *S1* write barrier blackens *S1* and observes that *O1* is |
| 138 | +already black: |
| 139 | + |
| 140 | + G1 [b] → D1 [b] S1 [b] |
| 141 | + ↘ |
| 142 | + O1 [b] → O2 [g] → O3 [w] |
| 143 | + ↗ |
| 144 | + D2 [b] S2 [b] |
| 145 | + |
| 146 | +At this point, the *S1* barrier has run out of local work, but the |
| 147 | +*S2* barrier is still going. |
| 148 | +If *S1* were to complete and write `*D1 = S1` at this point, it would |
| 149 | +make white object *O3* reachable to goroutine *G1*, violating the |
| 150 | +invariant. |
| 151 | +Hence, the *S1* barrier cannot complete until the *S2* barrier is also |
| 152 | +ready to complete. |
| 153 | + |
| 154 | +This requirement sounds onerous, but it can be achieved in a simple |
| 155 | +and reasonably efficient manner by sharing a global mark work queue |
| 156 | +between the write barriers. |
| 157 | +This reuses the existing mark work queue and quiescence logic and |
| 158 | +allows write barriers to help each other to completion. |
| 159 | + |
| 160 | +### Stack re-scanning |
| 161 | + |
| 162 | +The stack re-scan phase re-scans the stacks of all goroutines that |
| 163 | +have run since the initial stack scan to find pointers to white |
| 164 | +objects. |
| 165 | +The process of re-scanning a stack is identical to that of the initial |
| 166 | +scan, except that it must participate in mark quiescence. |
| 167 | +Specifically, the re-scanned goroutine must not resume execution until |
| 168 | +the system has reached mark quiescence (even if no white pointers are |
| 169 | +found on the stack). |
| 170 | +Otherwise, the same sorts of races that were described above are |
| 171 | +possible. |
| 172 | + |
| 173 | +There are multiple ways to realize this. |
| 174 | +The whole stack scan could participate in mark quiescence, but this |
| 175 | +would block any contemporaneous stack scans or write barriers from |
| 176 | +completing during a stack scan if any white pointers were found. |
| 177 | +Alternatively, each white pointer found on the stack could participate |
| 178 | +individually in mark quiescence, blocking the stack scan at that |
| 179 | +pointer until mark quiescence, and the stack scan could again |
| 180 | +participate in mark quiescence once all frames had been scanned. |
| 181 | + |
| 182 | +We propose an intermediate: gather small batches of white pointers |
| 183 | +from a stack at a time and reach mark quiescence on each batch |
| 184 | +individually, as well as at the end of the stack scan (even if the |
| 185 | +final batch is empty). |
| 186 | + |
| 187 | +### Other considerations |
| 188 | + |
| 189 | +Goroutines that start during stack re-scanning cannot reach any white |
| 190 | +objects, so their stacks are immediately considered black. |
| 191 | + |
| 192 | +Goroutines can also share pointers through channels, which are often |
| 193 | +implemented as direct stack-to-stack copies. |
| 194 | +Hence, channel receives also require write barriers in order to |
| 195 | +maintain the invariant. |
| 196 | +Channel receives already have write barriers to maintain stack |
| 197 | +barriers, so there is no additional work here. |
| 198 | + |
| 199 | + |
| 200 | +## Rationale |
| 201 | + |
| 202 | +The primary drawback of this approach to concurrent stack re-scanning |
| 203 | +is that a write barrier during re-scanning could introduce significant |
| 204 | +mutator latency if the transitive mark finds a large unmarked region |
| 205 | +of the heap, or if overlapping write barriers significantly delay mark |
| 206 | +quiescence. |
| 207 | +However, we consider this situation unlikely in non-adversarial |
| 208 | +applications. |
| 209 | +Furthermore, the resulting delay should be no worse than the mark |
| 210 | +termination STW time applications currently experience, since mark |
| 211 | +termination has to do exactly the same amount of marking work, in |
| 212 | +addition to the cost of stack scanning. |
| 213 | + |
| 214 | +### Alternative approaches |
| 215 | + |
| 216 | +An alternative solution to concurrent stack re-scanning would be to |
| 217 | +adopt DMOS-style quiescence [Hudson '97]. |
| 218 | +In this approach, greying any object during stack re-scanning (either |
| 219 | +by finding a pointer to a white object on a stack or by installing a |
| 220 | +pointer to a white object in the heap) forces the GC to drain this |
| 221 | +marking work and *restart* the stack re-scanning phase. |
| 222 | + |
| 223 | +This approach has a much simpler write barrier implementation that is |
| 224 | +constant time, so the write barrier would not induce significant |
| 225 | +mutator latency. |
| 226 | +However, unlike the proposed approach, the amount of work performed by |
| 227 | +DMOS-style stack re-scanning is potentially unbounded. |
| 228 | +This interacts poorly with Go's GC pacer. |
| 229 | +The pacer enforces the goal heap size making allocating and GC work |
| 230 | +proportional, but this requires an upper bound on possible GC work. |
| 231 | +As a result, if the pacer underestimates the amount of re-scanning |
| 232 | +work, it may need to block allocation entirely to avoid exceeding the |
| 233 | +goal heap size. |
| 234 | +This would be an effective STW. |
| 235 | + |
| 236 | +There is also a hybrid solution: we could use the proposed transitive |
| 237 | +marking write barrier, but bound the amount of work it can do (and |
| 238 | +hence the latency it can induce). |
| 239 | +If the write barrier exceeds this bound, it performs a DMOS-style |
| 240 | +restart. |
| 241 | +This is likely to get the best of both worlds, but also inherits the |
| 242 | +sum of their complexity. |
| 243 | + |
| 244 | +A final alternative would be to eliminate concurrent stack re-scanning |
| 245 | +entirely by adopting a *deletion-style* write barrier [Yuasa '90]. |
| 246 | +This style of write barrier allows the initial stack scan to *blacken* |
| 247 | +the stack, rather than merely greying it (still without the need for |
| 248 | +stack write barriers). |
| 249 | +For full details, see proposal #17503. |
| 250 | + |
| 251 | + |
| 252 | +## Compatibility |
| 253 | + |
| 254 | +This proposal does not affect the language or any APIs and hence |
| 255 | +satisfies the Go 1 compatibility guidelines. |
| 256 | + |
| 257 | + |
| 258 | +## Implementation |
| 259 | + |
| 260 | +We do not plan to implement this proposal. |
| 261 | +Instead, we plan to implement proposal #17503. |
| 262 | + |
| 263 | +The implementation steps are as follows: |
| 264 | + |
| 265 | +1. While not strictly necessary, first make GC assists participate in |
| 266 | + stack scanning. |
| 267 | + Currently this is not possible, which increases mutator latency at |
| 268 | + the beginning of the GC cycle. |
| 269 | + This proposal would compound this effect by also blocking GC |
| 270 | + assists at the end of the GC cycle, causing an effective STW. |
| 271 | + |
| 272 | +2. Modify the write barrier to be pre-publication instead of |
| 273 | + post-publication. |
| 274 | + Currently the write barrier occurs after the write of a pointer, |
| 275 | + but this proposal requires that the write barrier complete |
| 276 | + transitive marking *before* writing the pointer to its destination. |
| 277 | + A pre-publication barrier is also necessary for |
| 278 | + [ROC](https://golang.org/s/gctoc). |
| 279 | + |
| 280 | +3. Make the mark completion condition precise. |
| 281 | + Currently it's possible (albeit unlikely) to enter mark termination |
| 282 | + before all heap pointers have been marked. |
| 283 | + This proposal requires that we not start stack re-scanning until |
| 284 | + all objects reachable from globals are marked, which requires a |
| 285 | + precise completion condition. |
| 286 | + |
| 287 | +4. Implement the transitive mark write barrier. |
| 288 | + This can reuse the existing work buffer pool lists and logic, |
| 289 | + including the global quiescence barrier in getfull. |
| 290 | + It may be necessary to improve the performance characteristics of |
| 291 | + the getfull barrier, since this proposal will lean far more heavily |
| 292 | + on this barrier than we currently do. |
| 293 | + |
| 294 | +5. Check stack re-scanning code and make sure it is safe during |
| 295 | + non-STW. |
| 296 | + Since this only runs during STW right now, it may omit |
| 297 | + synchronization that will be necessary when running during non-STW. |
| 298 | + This is likely to be minimal, since most of the code is shared with |
| 299 | + the initial stack scan, which does run concurrently. |
| 300 | + |
| 301 | +6. Make stack re-scanning participate in write barrier quiescence. |
| 302 | + |
| 303 | +7. Create a new stack re-scanning phase. |
| 304 | + Make mark 2 completion transition to stack re-scanning instead of |
| 305 | + mark termination and enqueue stack re-scanning root jobs. |
| 306 | + Once all stack re-scanning jobs are complete, transition to mark |
| 307 | + termination. |
| 308 | + |
| 309 | + |
| 310 | +## Acknowledgments |
| 311 | + |
| 312 | +We would like to thank Rhys Hiltner (@rhysh) for suggesting the idea |
| 313 | +of a transitive mark write barrier. |
| 314 | + |
| 315 | + |
| 316 | +## References |
| 317 | + |
| 318 | +[Hudson '97] R. L. Hudson, R. Morrison, J. E. B. Moss, and D. S. |
| 319 | +Munro. Garbage collecting the world: One car at a time. In *ACM |
| 320 | +SIGPLAN Notices* 32(10):162–175, October 1997. |
| 321 | + |
| 322 | +[Yuasa '90] T. Yuasa. Real-time garbage collection on general-purpose |
| 323 | +machines. *Journal of Systems and Software*, 11(3):181–198, 1990. |
0 commit comments