-
Notifications
You must be signed in to change notification settings - Fork 18.1k
cmd/internal/testdir: Test failures #64050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Found new dashboard test flakes for:
2023-11-09 22:35 linux-ppc64le-buildlet go@ff7cf2d4 cmd/internal/testdir.Test (log)
2023-11-09 22:35 linux-ppc64le-power9osu go@ff7cf2d4 cmd/internal/testdir.Test (log)
|
That's convenient; I was just about to file an issue. When I enabled the allocation headers by default, two builders started timing out (linux-ppc64le-buildlet and linux-ppc64-sid-power10). However, I'm a little skeptical this is related to that change just because there was another issue briefly masking it so we're not 100% sure that change is implicated. On top of that, some of the other ppc64 builders don't show this slowdown and some appear to have the tests running faster than before. This makes me wonder if it's potentially a builder issue? It seems to me like the builder might be already hovering somewhat close to its overall timeout? CC @golang/ppc64 |
Found new dashboard test flakes for:
2023-11-09 22:38 linux-ppc64le-buildlet go@130baf3d cmd/internal/testdir.Test (log)
2023-11-10 04:46 linux-ppc64le-buildlet go@f7c5cbb8 cmd/internal/testdir.Test (log)
|
I don't think it's a builder issue. I ran into a deadlock running all.bash against tip on a development machine on my first try. Likewise, one ppc64le builder is stuck bootstrapping the compiler (which seems to bypass the buildlet timeouts). Attaching gdb to the stuck ppc64le bootstrapping buildlet, the sysmon task is running, but everything else is stuck on a futex, the backtrace isn't helpful. |
@pmur Thanks for checking. I'll take a look. |
Found new dashboard test flakes for:
2023-11-10 15:49 linux-ppc64le-power10osu go@43ffe2a8 cmd/internal/testdir.Test (log)
2023-11-10 15:49 linux-ppc64le-power9osu go@43ffe2a8 cmd/internal/testdir.Test (log)
|
I'm struggling a bit to get a gomote. I'll just disable the experiment on ppc64 for now. |
Change https://go.dev/cl/541555 mentions this issue: |
Found new dashboard test flakes for:
2023-11-10 15:51 linux-ppc64le-power10osu go@3b303fa9 cmd/internal/testdir.Test (log)
|
I'm trying to submit https://go.dev/cl/541555 but it seems to be stuck even with the log
|
Oh! Actually, the first timeout is on https://go.dev/cl/533455. I was looking at the wrong thing. That... certainly seems more likely to cause a deadlock. |
Yes, I believe I've narrowed it to down to https://go.dev/cl/533455. But why? |
@pmur Since I assume you're able to run on a ppc64 machine directly fairly easily, do you mind posting a stack dump of one of these deadlocks? I can't seem to ssh into the gomote. |
The bootstrap binaries seem to lack debug information. I am poking at the stuck compile binary on ppc64le/power_03. Are the following stack dumps helpful? dumps of the 6 thread stacks
|
Found new dashboard test flakes for:
2023-11-10 16:18 linux-ppc64le-buildlet go@abf84221 cmd/internal/testdir.Test (log)
|
Oof. It's a bit hard for me to make sense of these tracebacks. I think whatever is inferring symbols is probably being a little overzealous. |
Here is a proper log of backtraces from running |
Oh that's great, thanks! I think I have a couple leads now. |
Yeah, that seems to suggest everything is stuck on Notably, there is an attempt to allocate a new stack in the tracebacks, which supports this theory. That suggests there's a rogue stack growth at a point where |
Oh, this might suggest it's the change below the one I apparently identified as a culprit. That certainly introduced new calls in potentially sensitive contexts like this. |
Found new dashboard test flakes for:
2023-11-10 16:18 linux-ppc64le-power9osu go@abf84221 cmd/internal/testdir.Test (log)
|
Change https://go.dev/cl/541635 mentions this issue: |
With the CL I just put up, I can no longer reproduce the timeouts I was seeing before in a gomote on tip. It's definitely fixing a problem so we should land it anyway, but it looks like it might be the problem. If this really is the root cause, then it turns out it had very little to do with any of the changes landed today. |
Found new dashboard test flakes for:
2023-11-10 20:43 linux-ppc64-sid-buildlet go@4346ba34 cmd/internal/testdir.Test (log)
|
Closing as a duplicate of #64067. |
Found new dashboard test flakes for:
2023-11-10 18:45 linux-ppc64le-power10osu go@505dff4f cmd/internal/testdir.Test (log)
2023-11-10 21:09 linux-ppc64le-power10osu go@ea14b633 cmd/internal/testdir.Test (log)
|
These functions acquire the heap lock. If they're not called on the systemstack, a stack growth could cause a self-deadlock since stack growth may allocate memory from the page heap. This has been a problem for a while. If this is what's plaguing the ppc64 port right now, it's very surprising (and probably just coincidental) that it's showing up now. For #64050. For #64062. Fixes #64067. Change-Id: I2b95dc134d17be63b9fe8f7a3370fe5b5438682f Reviewed-on: https://go-review.googlesource.com/c/go/+/541635 LUCI-TryBot-Result: Go LUCI <[email protected]> Run-TryBot: Michael Knyszek <[email protected]> Auto-Submit: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Pratt <[email protected]> Reviewed-by: Paul Murphy <[email protected]>
Found new dashboard test flakes for:
2023-11-10 20:43 linux-ppc64-sid-power10 go@4346ba34 cmd/internal/testdir.Test (log)
2023-11-11 02:02 linux-ppc64-sid-power10 go@8da6405e cmd/internal/testdir.Test (log)
|
These are failures from before the fix. |
Found new dashboard test flakes for:
2023-11-10 21:09 linux-ppc64le-power9osu go@ea14b633 cmd/internal/testdir.Test (log)
2023-11-10 21:25 linux-ppc64-sid-buildlet go@31887586 cmd/internal/testdir.Test (log)
2023-11-10 21:25 linux-ppc64le-power9osu go@31887586 cmd/internal/testdir.Test (log)
2023-11-11 02:02 linux-ppc64-sid-buildlet go@8da6405e cmd/internal/testdir.Test (log)
2023-11-11 02:02 linux-ppc64le-power9osu go@8da6405e cmd/internal/testdir.Test (log)
|
These are unrelated timeout failures due to #64067. |
Change https://go.dev/cl/541955 mentions this issue: |
… callees on the systemstack These functions acquire the heap lock. If they're not called on the systemstack, a stack growth could cause a self-deadlock since stack growth may allocate memory from the page heap. This has been a problem for a while. If this is what's plaguing the ppc64 port right now, it's very surprising (and probably just coincidental) that it's showing up now. For #64050. For #64062. For #64067. Fixes #64073. Change-Id: I2b95dc134d17be63b9fe8f7a3370fe5b5438682f Reviewed-on: https://go-review.googlesource.com/c/go/+/541635 LUCI-TryBot-Result: Go LUCI <[email protected]> Run-TryBot: Michael Knyszek <[email protected]> Auto-Submit: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Pratt <[email protected]> Reviewed-by: Paul Murphy <[email protected]> (cherry picked from commit 5f08b44) Reviewed-on: https://go-review.googlesource.com/c/go/+/541955 Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Auto-Submit: Dmitri Shuralyov <[email protected]>
Issue created automatically to collect these failures.
Example (log):
— watchflakes
The text was updated successfully, but these errors were encountered: