-
Notifications
You must be signed in to change notification settings - Fork 18k
internal/trace/v2: TestTraceCgoCallback failures #64060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Found new dashboard test flakes for:
2023-11-10 15:49 linux-amd64-nocgo go@43ffe2a8 internal/trace/v2.TestTraceCgoCallback (log)
2023-11-10 15:50 linux-amd64-nocgo go@9d914015 internal/trace/v2.TestTraceCgoCallback (log)
2023-11-10 15:51 linux-amd64-nocgo go@3b303fa9 internal/trace/v2.TestTraceCgoCallback (log)
|
This is fixed as of abf8422. |
Found new dashboard test flakes for:
2023-11-10 15:49 linux-ppc64-sid-buildlet go@43ffe2a8 internal/trace/v2.TestTraceCgoCallback (log)
2023-11-10 15:50 darwin-amd64-nocgo go@9d914015 internal/trace/v2.TestTraceCgoCallback (log)
2023-11-10 15:50 linux-ppc64-sid-buildlet go@9d914015 internal/trace/v2.TestTraceCgoCallback (log)
2023-11-10 15:51 darwin-amd64-nocgo go@3b303fa9 internal/trace/v2.TestTraceCgoCallback (log)
2023-11-10 15:51 linux-ppc64-sid-buildlet go@3b303fa9 internal/trace/v2.TestTraceCgoCallback (log)
|
Found new dashboard test flakes for:
2023-11-10 15:49 darwin-amd64-nocgo go@43ffe2a8 internal/trace/v2.TestTraceCgoCallback (log)
|
Found new dashboard test flakes for:
2023-11-10 15:51 openbsd-riscv64-jsing go@3b303fa9 internal/trace/v2.TestTraceCgoCallback (log)
|
Found new dashboard test flakes for:
2023-11-10 15:50 openbsd-riscv64-jsing go@9d914015 internal/trace/v2.TestTraceCgoCallback (log)
|
Found new dashboard test flakes for:
2023-11-17 20:40 dragonfly-amd64-622 go@3ff5632d internal/trace/v2.TestTraceCgoCallback (log)
|
Found new dashboard test flakes for:
2023-11-17 23:15 dragonfly-amd64-622 go@f67b2d8f internal/trace/v2.TestTraceCgoCallback (log)
|
Found new dashboard test flakes for:
2023-11-21 16:20 dragonfly-amd64-622 go@8be8bfea internal/trace/v2.TestTraceCgoCallback (log)
|
Found new dashboard test flakes for:
2023-11-21 21:29 dragonfly-amd64-622 go@4e3ac99a internal/trace/v2.TestTraceCgoCallback (log)
|
Some of these issues may be resolved by https://go.dev/cl/544215. |
The staticlockranking builder found an issue that I think might explain a lot of these failures. See https://go.dev/cl/544396. EDIT: It wasn't a real issue. |
Found new dashboard test flakes for:
2023-11-22 02:20 ios-arm64-corellium go@5f7a4085 internal/trace/v2.TestTraceCgoCallback (log)
|
Change https://go.dev/cl/545515 mentions this issue: |
https://go.dev/cl/545515 fixes all the "expected no proc but had one" issues. The only one that doesn't fit the pattern is: 2023-11-17 20:40 dragonfly-amd64-622 go@3ff5632d internal/trace/v2.TestTraceCgoCallback (log) But I'm fairly certain that's fixed by https://go.dev/cl/544215. I will try and confirm. |
Huh. Actually, that dragonfly failure has something really weird going on. It kind of looks like there's more than one active thread with the same EDIT: It's not that. The ID is just reused. I see the problem. |
On non-pthread platforms, it's totally possible for the same M to GoCreateSyscall/GoDestroySyscall on the same thread multiple times. That same thread may hold onto its P through all those calls. For #64060. Change-Id: Ib968bfd439ecd5bc24fc98d78c06145b0d4b7802 Reviewed-on: https://go-review.googlesource.com/c/go/+/545515 Reviewed-by: Michael Pratt <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]>
Found new dashboard test flakes for:
2023-11-29 16:01 dragonfly-amd64-622 go@b9a4eaa6 internal/trace/v2.TestTraceCgoCallback (log)
|
Change https://go.dev/cl/546096 mentions this issue: |
To begin with, CL 545515 made the trace parser tolerant of GoCreateSyscall having a P, but that was wrong. Because dropm trashes the M's syscalltick, that case should never be possible. So the first thing this change does is it rewrites the test that CL introduced to expect a failure instead of a success. What I'd misinterpreted as a case that should be allowed was actually the same as the other issues causing golang#64060, which is that the parser doesn't correctly implement what happens to Ps when a thread calls back into Go on non-pthread platforms, and what happens when a thread dies on pthread platorms (or more succinctly, what the runtime does when it calls dropm). Specifically, the GoDestroySyscall event implies that if any P is still running on that M when it's called, that the P stops running. This is what is intended by the runtime trashing the M's syscalltick; when it calls back into Go, the tracer models that thread as obtaining a new P from scratch. Handling this incorrectly manifests in one of two ways. On pthread platforms, GoDestroySyscall is only emitted when a C thread that previously called into Go is destroyed. However, that thread ID can be reused. Because we have no thread events, whether it's the same thread or not is totally ambiguous to the tracer. Therefore, the tracer may observe a thread that previously died try to start running with a new P under the same identity. The association to the old P is still intact because the ID is the same, and the tracer gets confused -- it appears as if two Ps are running on the same M! On non-pthread platforms, GoDestroySyscall is emitted on every return to C from Go code. In this case, the same thread with the same identity is naturally going to keep calling back into Go. But again, since the runtime trashes syscalltick in dropm, it's always going to acquire a P from the tracer's perspective. But if this is a different P than before, just like the pthread case, the parser is going to get confused, since it looks like two Ps are running on the same M! The case that CL 545515 actually handled was actually the non-pthread case, specifically where the same P is reacquired by an M calling back into Go. In this case, if we tolerate having a P, then what we'll observe is the M stealing its own P from itself, then running with it. Now that we know what the problem is, how do we fix it? This change addresses the problem by emitting an extra event when encountering a GoDestroySyscall with an active P in its context. In this case, it emits an additional ProcSteal event to steal from itself, indicating that the P stopped running. This removes any association between that M and that P, resolving any ambiguity in the tracer. There's one other minor detail that needs to be worked out, and that's what happens to any *real* ProcSteal event that stole the P we're now emitting an extra ProcSteal event for. Since, this event is going to look for an M that may have moved on already and the P at this point is already idle. Luckily, we have *exactly* the right fix for this. The handler for GoDestroySyscall now moves any active P it has to the ProcSyscallAbandoned state, indicating that we've lost information about the P and that it should be treated as already idle. Conceptually this all makes sense: this is a P in _Psyscall that has been abandoned by the M it was previously bound to. It's unfortunate how complicated this has all ended up being, but we can take a closer look at that in the future. Fixes golang#64060. Change-Id: Ie9e6eb9cf738607617446e3487392643656069a2 Reviewed-on: https://go-review.googlesource.com/c/go/+/546096 Reviewed-by: Michael Pratt <[email protected]> TryBot-Result: Gopher Robot <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Run-TryBot: Michael Knyszek <[email protected]> Auto-Submit: Michael Knyszek <[email protected]>
Issue created automatically to collect these failures.
Example (log):
— watchflakes
The text was updated successfully, but these errors were encountered: