Skip to content

internal/trace/v2: TestTraceCgoCallback failures #64060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gopherbot opened this issue Nov 10, 2023 · 18 comments
Closed

internal/trace/v2: TestTraceCgoCallback failures #64060

gopherbot opened this issue Nov 10, 2023 · 18 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@gopherbot
Copy link
Contributor

#!watchflakes
default <- pkg == "internal/trace/v2" && test == "TestTraceCgoCallback"

Issue created automatically to collect these failures.

Example (log):

--- FAIL: TestTraceCgoCallback (0.00s)
    --- FAIL: TestTraceCgoCallback/Default (0.00s)
        trace_test.go:548: stderr: go: no Go source files
        trace_test.go:550: exit status 1

watchflakes

@gopherbot gopherbot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 10, 2023
@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "internal/trace/v2" && test == "TestTraceCgoCallback"
2023-11-10 15:49 linux-amd64-nocgo go@43ffe2a8 internal/trace/v2.TestTraceCgoCallback (log)
--- FAIL: TestTraceCgoCallback (0.00s)
    --- FAIL: TestTraceCgoCallback/Default (0.00s)
        trace_test.go:548: stderr: go: no Go source files
        trace_test.go:550: exit status 1
2023-11-10 15:50 linux-amd64-nocgo go@9d914015 internal/trace/v2.TestTraceCgoCallback (log)
--- FAIL: TestTraceCgoCallback (0.00s)
    --- FAIL: TestTraceCgoCallback/Default (0.00s)
        trace_test.go:548: stderr: go: no Go source files
        trace_test.go:550: exit status 1
2023-11-10 15:51 linux-amd64-nocgo go@3b303fa9 internal/trace/v2.TestTraceCgoCallback (log)
--- FAIL: TestTraceCgoCallback (0.00s)
    --- FAIL: TestTraceCgoCallback/Default (0.00s)
        trace_test.go:548: stderr: go: no Go source files
        trace_test.go:550: exit status 1

watchflakes

@mknyszek
Copy link
Contributor

This is fixed as of abf8422.

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "internal/trace/v2" && test == "TestTraceCgoCallback"
2023-11-10 15:49 linux-ppc64-sid-buildlet go@43ffe2a8 internal/trace/v2.TestTraceCgoCallback (log)
--- FAIL: TestTraceCgoCallback (0.04s)
    --- FAIL: TestTraceCgoCallback/Default (0.04s)
        trace_test.go:548: stderr: go: no Go source files
        trace_test.go:550: exit status 1
2023-11-10 15:50 darwin-amd64-nocgo go@9d914015 internal/trace/v2.TestTraceCgoCallback (log)
--- FAIL: TestTraceCgoCallback (0.05s)
    --- FAIL: TestTraceCgoCallback/Default (0.05s)
        trace_test.go:548: stderr: go: no Go source files
        trace_test.go:550: exit status 1
2023-11-10 15:50 linux-ppc64-sid-buildlet go@9d914015 internal/trace/v2.TestTraceCgoCallback (log)
--- FAIL: TestTraceCgoCallback (0.05s)
    --- FAIL: TestTraceCgoCallback/Default (0.05s)
        trace_test.go:548: stderr: go: no Go source files
        trace_test.go:550: exit status 1
2023-11-10 15:51 darwin-amd64-nocgo go@3b303fa9 internal/trace/v2.TestTraceCgoCallback (log)
--- FAIL: TestTraceCgoCallback (0.05s)
    --- FAIL: TestTraceCgoCallback/Default (0.04s)
        trace_test.go:548: stderr: go: no Go source files
        trace_test.go:550: exit status 1
2023-11-10 15:51 linux-ppc64-sid-buildlet go@3b303fa9 internal/trace/v2.TestTraceCgoCallback (log)
--- FAIL: TestTraceCgoCallback (0.03s)
    --- FAIL: TestTraceCgoCallback/Default (0.03s)
        trace_test.go:548: stderr: go: no Go source files
        trace_test.go:550: exit status 1

watchflakes

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "internal/trace/v2" && test == "TestTraceCgoCallback"
2023-11-10 15:49 darwin-amd64-nocgo go@43ffe2a8 internal/trace/v2.TestTraceCgoCallback (log)
--- FAIL: TestTraceCgoCallback (0.06s)
    --- FAIL: TestTraceCgoCallback/Default (0.06s)
        trace_test.go:548: stderr: go: no Go source files
        trace_test.go:550: exit status 1

watchflakes

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "internal/trace/v2" && test == "TestTraceCgoCallback"
2023-11-10 15:51 openbsd-riscv64-jsing go@3b303fa9 internal/trace/v2.TestTraceCgoCallback (log)
--- FAIL: TestTraceCgoCallback (0.05s)
    --- FAIL: TestTraceCgoCallback/Default (0.05s)
        trace_test.go:548: stderr: go: no Go source files
        trace_test.go:550: exit status 1

watchflakes

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "internal/trace/v2" && test == "TestTraceCgoCallback"
2023-11-10 15:50 openbsd-riscv64-jsing go@9d914015 internal/trace/v2.TestTraceCgoCallback (log)
--- FAIL: TestTraceCgoCallback (0.05s)
    --- FAIL: TestTraceCgoCallback/Default (0.05s)
        trace_test.go:548: stderr: go: no Go source files
        trace_test.go:550: exit status 1

watchflakes

@gopherbot gopherbot reopened this Nov 17, 2023
@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "internal/trace/v2" && test == "TestTraceCgoCallback"
2023-11-17 20:40 dragonfly-amd64-622 go@3ff5632d internal/trace/v2.TestTraceCgoCallback (log)
--- FAIL: TestTraceCgoCallback (0.92s)
    --- FAIL: TestTraceCgoCallback/Default (0.92s)
        reader_test.go:65: unexpected error while reading the trace: broken trace: failed to advance: frontier:
            [gen=1]

            M 24 [ProcStart time=318751161344 p=5 p_seq=12]
            M 5 [GoSyscallEndBlocked time=318751246208]
            M 11 [ProcStart time=318751358144 p=2 p_seq=29]
            M 10 [ProcStart time=318751388992 p=3 p_seq=26]
            M 12 [ProcStart time=318751464256 p=4 p_seq=9]
...
            String id=89
            	data="internal/poll.(*FD).Write"
            String id=90
            	data="os.(*File).write"
            String id=91
            	data="/tmp/workdir/go/src/os/file_posix.go"
            String id=92
            	data="os.(*File).Write"
            String id=93
            	data="/tmp/workdir/go/src/os/file.go"

watchflakes

@mknyszek mknyszek self-assigned this Nov 17, 2023
@mknyszek mknyszek added this to the Go1.22 milestone Nov 17, 2023
@mknyszek mknyszek added the compiler/runtime Issues related to the Go compiler and/or runtime. label Nov 17, 2023
@mknyszek mknyszek moved this to In Progress in Go Compiler / Runtime Nov 17, 2023
@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "internal/trace/v2" && test == "TestTraceCgoCallback"
2023-11-17 23:15 dragonfly-amd64-622 go@f67b2d8f internal/trace/v2.TestTraceCgoCallback (log)
--- FAIL: TestTraceCgoCallback (1.50s)
    --- FAIL: TestTraceCgoCallback/Default (1.49s)
        reader_test.go:65: unexpected error while reading the trace: expected no proc but had one
        trace_test.go:569: found bad trace; dumping to test log...
        trace_test.go:570: Trace Go1.22
            EventBatch gen=1 m=8 time=7013092587 size=140
            GoCreateSyscall dt=100 new_g=34
            ProcStart dt=68 p=6 p_seq=1
            GoSyscallEndBlocked dt=22
            GoStart dt=16 g=34 g_seq=1
...
            String id=86
            	data="runtime.(*scavengerState).park"
            String id=87
            	data="/tmp/workdir/go/src/runtime/mgcscavenge.go"
            String id=88
            	data="runtime.bgscavenge"
            String id=89
            	data="runtime.(*traceAdvancerState).start"
            String id=90
            	data="main._Cfunc_foo"

watchflakes

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "internal/trace/v2" && test == "TestTraceCgoCallback"
2023-11-21 16:20 dragonfly-amd64-622 go@8be8bfea internal/trace/v2.TestTraceCgoCallback (log)
--- FAIL: TestTraceCgoCallback (1.53s)
    --- FAIL: TestTraceCgoCallback/Default (1.52s)
        reader_test.go:65: unexpected error while reading the trace: expected no proc but had one
        trace_test.go:569: found bad trace; dumping to test log...
        trace_test.go:570: Trace Go1.22
            EventBatch gen=1 m=13 time=6544938370 size=121
            GoCreateSyscall dt=107 new_g=35
            ProcStart dt=64 p=7 p_seq=1
            GoSyscallEndBlocked dt=23
            GoStart dt=16 g=35 g_seq=1
...
            String id=86
            	data="runtime.traceStartReadCPU"
            String id=87
            	data="runtime.(*scavengerState).park"
            String id=88
            	data="/tmp/workdir/go/src/runtime/mgcscavenge.go"
            String id=89
            	data="runtime.bgscavenge"
            String id=90
            	data="runtime.traceLocker.Gomaxprocs"

watchflakes

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "internal/trace/v2" && test == "TestTraceCgoCallback"
2023-11-21 21:29 dragonfly-amd64-622 go@4e3ac99a internal/trace/v2.TestTraceCgoCallback (log)
--- FAIL: TestTraceCgoCallback (1.50s)
    --- FAIL: TestTraceCgoCallback/Default (1.49s)
        reader_test.go:65: unexpected error while reading the trace: expected no proc but had one
        trace_test.go:569: found bad trace; dumping to test log...
        trace_test.go:570: Trace Go1.22
            EventBatch gen=1 m=25 time=7125512681 size=78
            GoCreateSyscall dt=85 new_g=72
            ProcStart dt=62 p=5 p_seq=12
            GoSyscallEndBlocked dt=16
            GoStart dt=12 g=72 g_seq=1
...
            String id=84
            	data="runtime.gopark"
            String id=85
            	data="runtime.(*traceAdvancerState).start.func1"
            String id=86
            	data="runtime.startTheWorld"
            String id=87
            	data="main._Cfunc_foo"
            String id=88
            	data="runtime.traceLocker.Gomaxprocs"

watchflakes

@mknyszek
Copy link
Contributor

Some of these issues may be resolved by https://go.dev/cl/544215.

@mknyszek
Copy link
Contributor

mknyszek commented Nov 22, 2023

The staticlockranking builder found an issue that I think might explain a lot of these failures. See https://go.dev/cl/544396.

EDIT: It wasn't a real issue.

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "internal/trace/v2" && test == "TestTraceCgoCallback"
2023-11-22 02:20 ios-arm64-corellium go@5f7a4085 internal/trace/v2.TestTraceCgoCallback (log)
--- FAIL: TestTraceCgoCallback (1.88s)
    --- FAIL: TestTraceCgoCallback/Default (1.86s)
        reader_test.go:65: unexpected error while reading the trace: expected no proc but had one
        trace_test.go:570: found bad trace; dumping to test log...
        trace_test.go:581: Trace Go1.22
            EventBatch gen=1 m=6105411584 time=21723120546 size=68
            GoCreateSyscall dt=130 new_g=38
            GoSyscallEndBlocked dt=14
            ProcStart dt=6276 p=1 p_seq=28
            GoStart dt=10 g=38 g_seq=1
...
            String id=84
            	data="os.(*File).Write"
            String id=85
            	data="/tmp/workdir-host-ios-arm64-corellium-ios/go/src/os/file.go"
            String id=86
            	data="runtime.gopark"
            String id=87
            	data="runtime.chansend1"
            String id=88
            	data="runtime.(*traceAdvancerState).start"

watchflakes

@gopherbot
Copy link
Contributor Author

Change https://go.dev/cl/545515 mentions this issue: internal/trace/v2: tolerate having a P in GoCreateSyscall

@mknyszek
Copy link
Contributor

https://go.dev/cl/545515 fixes all the "expected no proc but had one" issues.

The only one that doesn't fit the pattern is:

2023-11-17 20:40 dragonfly-amd64-622 go@3ff5632d internal/trace/v2.TestTraceCgoCallback (log)

But I'm fairly certain that's fixed by https://go.dev/cl/544215. I will try and confirm.

@mknyszek
Copy link
Contributor

mknyszek commented Nov 28, 2023

Huh. Actually, that dragonfly failure has something really weird going on. It kind of looks like there's more than one active thread with the same procid, which is typically the system thread ID. Is it not unique on that platform?

EDIT: It's not that. The ID is just reused. I see the problem.

gopherbot pushed a commit that referenced this issue Nov 28, 2023
On non-pthread platforms, it's totally possible for the same M to
GoCreateSyscall/GoDestroySyscall on the same thread multiple times. That
same thread may hold onto its P through all those calls.

For #64060.

Change-Id: Ib968bfd439ecd5bc24fc98d78c06145b0d4b7802
Reviewed-on: https://go-review.googlesource.com/c/go/+/545515
Reviewed-by: Michael Pratt <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "internal/trace/v2" && test == "TestTraceCgoCallback"
2023-11-29 16:01 dragonfly-amd64-622 go@b9a4eaa6 internal/trace/v2.TestTraceCgoCallback (log)
--- FAIL: TestTraceCgoCallback (0.67s)
    --- FAIL: TestTraceCgoCallback/Default (0.67s)
        reader_test.go:65: unexpected error while reading the trace: broken trace: failed to advance: frontier:
            [gen=1]

            M 17 [ProcStart time=329731484928 p=3 p_seq=17]
            M 14 [ProcStart time=329731592512 p=0 p_seq=16]
            M 21 [ProcStart time=329731702144 p=2 p_seq=12]
            M 12 [ProcStart time=329731794816 p=5 p_seq=20]
            M 13 [ProcStart time=329731878464 p=3 p_seq=19]
...
            String id=84
            	data="os.(*File).write"
            String id=85
            	data="/tmp/workdir/go/src/os/file_posix.go"
            String id=86
            	data="os.(*File).Write"
            String id=87
            	data="/tmp/workdir/go/src/os/file.go"
            String id=88
            	data="main._Cfunc_foo"

watchflakes

@gopherbot
Copy link
Contributor Author

Change https://go.dev/cl/546096 mentions this issue: internal/trace/v2: handle thread ID reuse correctly

@github-project-automation github-project-automation bot moved this from In Progress to Done in Go Compiler / Runtime Dec 5, 2023
ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
To begin with, CL 545515 made the trace parser tolerant of
GoCreateSyscall having a P, but that was wrong. Because dropm trashes
the M's syscalltick, that case should never be possible. So the first
thing this change does is it rewrites the test that CL introduced to
expect a failure instead of a success.

What I'd misinterpreted as a case that should be allowed was actually
the same as the other issues causing golang#64060, which is that the parser
doesn't correctly implement what happens to Ps when a thread calls back
into Go on non-pthread platforms, and what happens when a thread dies
on pthread platorms (or more succinctly, what the runtime does when it
calls dropm).

Specifically, the GoDestroySyscall event implies that if any P is still
running on that M when it's called, that the P stops running. This is
what is intended by the runtime trashing the M's syscalltick; when it
calls back into Go, the tracer models that thread as obtaining a new P
from scratch.

Handling this incorrectly manifests in one of two ways.

On pthread platforms, GoDestroySyscall is only emitted when a C thread
that previously called into Go is destroyed. However, that thread ID can
be reused. Because we have no thread events, whether it's the same
thread or not is totally ambiguous to the tracer. Therefore, the tracer
may observe a thread that previously died try to start running with a
new P under the same identity. The association to the old P is still
intact because the ID is the same, and the tracer gets confused -- it
appears as if two Ps are running on the same M!

On non-pthread platforms, GoDestroySyscall is emitted on every return to
C from Go code. In this case, the same thread with the same identity is
naturally going to keep calling back into Go. But again, since the
runtime trashes syscalltick in dropm, it's always going to acquire a P
from the tracer's perspective. But if this is a different P than before,
just like the pthread case, the parser is going to get confused, since
it looks like two Ps are running on the same M!

The case that CL 545515 actually handled was actually the non-pthread
case, specifically where the same P is reacquired by an M calling back
into Go. In this case, if we tolerate having a P, then what we'll
observe is the M stealing its own P from itself, then running with it.

Now that we know what the problem is, how do we fix it? This change
addresses the problem by emitting an extra event when encountering a
GoDestroySyscall with an active P in its context. In this case, it emits
an additional ProcSteal event to steal from itself, indicating that the
P stopped running. This removes any association between that M and that
P, resolving any ambiguity in the tracer.

There's one other minor detail that needs to be worked out, and that's
what happens to any *real* ProcSteal event that stole the P we're now
emitting an extra ProcSteal event for. Since, this event is going to
look for an M that may have moved on already and the P at this point is
already idle. Luckily, we have *exactly* the right fix for this. The
handler for GoDestroySyscall now moves any active P it has to the
ProcSyscallAbandoned state, indicating that we've lost information about
the P and that it should be treated as already idle. Conceptually this
all makes sense: this is a P in _Psyscall that has been abandoned by the
M it was previously bound to.

It's unfortunate how complicated this has all ended up being, but we can
take a closer look at that in the future.

Fixes golang#64060.

Change-Id: Ie9e6eb9cf738607617446e3487392643656069a2
Reviewed-on: https://go-review.googlesource.com/c/go/+/546096
Reviewed-by: Michael Pratt <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Run-TryBot: Michael Knyszek <[email protected]>
Auto-Submit: Michael Knyszek <[email protected]>
@golang golang locked and limited conversation to collaborators Dec 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Archived in project
Development

No branches or pull requests

2 participants