Skip to content

Commit ccad8a9

Browse files
doujiang24cherrymui
authored andcommitted
runtime/cgo: store M for C-created thread in pthread key
This reapplies CL 392854, with the followup fixes in CL 479255, CL 479915, and CL 481057 incorporated. CL 392854, by doujiang24 <[email protected]>, speed up C to Go calls by binding the M to the C thread. See below for its description. CL 479255 is a followup fix for a small bug in ARM assembly code. CL 479915 is another followup fix to address C to Go calls after the C code uses some stack, but that CL is also buggy. CL 481057, by Michael Knyszek, is a followup fix for a memory leak bug of CL 479915. [Original CL 392854 description] In a C thread, it's necessary to acquire an extra M by using needm while invoking a Go function from C. But, needm and dropm are heavy costs due to the signal-related syscalls. So, we change to not dropm while returning back to C, which means binding the extra M to the C thread until it exits, to avoid needm and dropm on each C to Go call. Instead, we only dropm while the C thread exits, so the extra M won't leak. When invoking a Go function from C: Allocate a pthread variable using pthread_key_create, only once per shared object, and register a thread-exit-time destructor. And store the g0 of the current m into the thread-specified value of the pthread key, only once per C thread, so that the destructor will put the extra M back onto the extra M list while the C thread exits. When returning back to C: Skip dropm in cgocallback, when the pthread variable has been created, so that the extra M will be reused the next time invoke a Go function from C. This is purely a performance optimization. The old version, in which needm & dropm happen on each cgo call, is still correct too, and we have to keep the old version on systems with cgo but without pthreads, like Windows. This optimization is significant, and the specific value depends on the OS system and CPU, but in general, it can be considered as 10x faster, for a simple Go function call from a C thread. For the newly added BenchmarkCGoInCThread, some benchmark results: 1. it's 28x faster, from 3395 ns/op to 121 ns/op, in darwin OS & Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz 2. it's 6.5x faster, from 1495 ns/op to 230 ns/op, in Linux OS & Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz [CL 479915 description] Currently, when C calls into Go the first time, we grab an M using needm, which sets m.g0's stack bounds using the SP. We don't know how big the stack is, so we simply assume 32K. Previously, when the Go function returns to C, we drop the M, and the next time C calls into Go, we put a new stack bound on the g0 based on the current SP. After CL 392854, we don't drop the M, and the next time C calls into Go, we reuse the same g0, without recomputing the stack bounds. If the C code uses quite a bit of stack space before calling into Go, the SP may be well below the 32K stack bound we assumed, so the runtime thinks the g0 stack overflows. This CL makes needm get a more accurate stack bound from pthread. (In some platforms this may still be a guess as we don't know exactly where we are in the C stack), but it is probably better than simply assuming 32K. Fixes #51676. Fixes #59294. Change-Id: I9bf1400106d5c08ce621d2ed1df3a2d9e3f55494 Reviewed-on: https://go-review.googlesource.com/c/go/+/481061 Reviewed-by: Michael Knyszek <[email protected]> Run-TryBot: Cherry Mui <[email protected]> Reviewed-by: DeJiang Zhu (doujiang) <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
1 parent 33d8cde commit ccad8a9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+944
-67
lines changed

misc/cgo/test/cgo_test.go

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,7 @@ func TestThreadLock(t *testing.T) { testThreadLockFunc(t) }
104104
func TestUnsignedInt(t *testing.T) { testUnsignedInt(t) }
105105
func TestZeroArgCallback(t *testing.T) { testZeroArgCallback(t) }
106106

107-
func BenchmarkCgoCall(b *testing.B) { benchCgoCall(b) }
108-
func BenchmarkGoString(b *testing.B) { benchGoString(b) }
109-
func BenchmarkCGoCallback(b *testing.B) { benchCallback(b) }
107+
func BenchmarkCgoCall(b *testing.B) { benchCgoCall(b) }
108+
func BenchmarkGoString(b *testing.B) { benchGoString(b) }
109+
func BenchmarkCGoCallback(b *testing.B) { benchCallback(b) }
110+
func BenchmarkCGoInCThread(b *testing.B) { benchCGoInCthread(b) }

misc/cgo/test/cthread_unix.c

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,27 @@ doAdd(int max, int nthread)
3232
for(i=0; i<nthread; i++)
3333
pthread_join(thread_id[i], 0);
3434
}
35+
36+
static void*
37+
goDummyCallbackThread(void* p)
38+
{
39+
int i, max;
40+
41+
max = *(int*)p;
42+
for(i=0; i<max; i++)
43+
goDummy();
44+
return NULL;
45+
}
46+
47+
int
48+
callGoInCThread(int max)
49+
{
50+
pthread_t thread;
51+
52+
if (pthread_create(&thread, NULL, goDummyCallbackThread, (void*)(&max)) != 0)
53+
return -1;
54+
if (pthread_join(thread, NULL) != 0)
55+
return -1;
56+
57+
return max;
58+
}

misc/cgo/test/cthread_windows.c

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,25 @@ doAdd(int max, int nthread)
3535
CloseHandle((HANDLE)thread_id[i]);
3636
}
3737
}
38+
39+
__stdcall
40+
static unsigned int
41+
goDummyCallbackThread(void* p)
42+
{
43+
int i, max;
44+
45+
max = *(int*)p;
46+
for(i=0; i<max; i++)
47+
goDummy();
48+
return 0;
49+
}
50+
51+
int
52+
callGoInCThread(int max)
53+
{
54+
uintptr_t thread_id;
55+
thread_id = _beginthreadex(0, 0, goDummyCallbackThread, &max, 0, 0);
56+
WaitForSingleObject((HANDLE)thread_id, INFINITE);
57+
CloseHandle((HANDLE)thread_id);
58+
return max;
59+
}

misc/cgo/test/testx.go

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ import (
2424
/*
2525
// threads
2626
extern void doAdd(int, int);
27+
extern int callGoInCThread(int);
2728
2829
// issue 1328
2930
void IntoC(void);
@@ -146,6 +147,10 @@ func Add(x int) {
146147
*p = 2
147148
}
148149

150+
//export goDummy
151+
func goDummy() {
152+
}
153+
149154
func testCthread(t *testing.T) {
150155
if (runtime.GOOS == "darwin" || runtime.GOOS == "ios") && runtime.GOARCH == "arm64" {
151156
t.Skip("the iOS exec wrapper is unable to properly handle the panic from Add")
@@ -159,6 +164,15 @@ func testCthread(t *testing.T) {
159164
}
160165
}
161166

167+
// Benchmark measuring overhead from C to Go in a C thread.
168+
// Create a new C thread and invoke Go function repeatedly in the new C thread.
169+
func benchCGoInCthread(b *testing.B) {
170+
n := C.callGoInCThread(C.int(b.N))
171+
if int(n) != b.N {
172+
b.Fatal("unmatch loop times")
173+
}
174+
}
175+
162176
// issue 1328
163177

164178
//export BackIntoGo

misc/cgo/testcarchive/carchive_test.go

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1247,3 +1247,57 @@ func TestPreemption(t *testing.T) {
12471247
t.Error(err)
12481248
}
12491249
}
1250+
1251+
// Issue 59294. Test calling Go function from C after using some
1252+
// stack space.
1253+
func TestDeepStack(t *testing.T) {
1254+
t.Parallel()
1255+
1256+
if !testWork {
1257+
defer func() {
1258+
os.Remove("testp9" + exeSuffix)
1259+
os.Remove("libgo9.a")
1260+
os.Remove("libgo9.h")
1261+
}()
1262+
}
1263+
1264+
cmd := exec.Command("go", "build", "-buildmode=c-archive", "-o", "libgo9.a", "./libgo9")
1265+
out, err := cmd.CombinedOutput()
1266+
t.Logf("%v\n%s", cmd.Args, out)
1267+
if err != nil {
1268+
t.Fatal(err)
1269+
}
1270+
checkLineComments(t, "libgo9.h")
1271+
checkArchive(t, "libgo9.a")
1272+
1273+
// build with -O0 so the C compiler won't optimize out the large stack frame
1274+
ccArgs := append(cc, "-O0", "-o", "testp9"+exeSuffix, "main9.c", "libgo9.a")
1275+
out, err = exec.Command(ccArgs[0], ccArgs[1:]...).CombinedOutput()
1276+
t.Logf("%v\n%s", ccArgs, out)
1277+
if err != nil {
1278+
t.Fatal(err)
1279+
}
1280+
1281+
argv := cmdToRun("./testp9")
1282+
cmd = exec.Command(argv[0], argv[1:]...)
1283+
sb := new(strings.Builder)
1284+
cmd.Stdout = sb
1285+
cmd.Stderr = sb
1286+
if err := cmd.Start(); err != nil {
1287+
t.Fatal(err)
1288+
}
1289+
1290+
timer := time.AfterFunc(time.Minute,
1291+
func() {
1292+
t.Error("test program timed out")
1293+
cmd.Process.Kill()
1294+
},
1295+
)
1296+
defer timer.Stop()
1297+
1298+
err = cmd.Wait()
1299+
t.Logf("%v\n%s", cmd.Args, sb)
1300+
if err != nil {
1301+
t.Error(err)
1302+
}
1303+
}
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
// Copyright 2023 The Go Authors. All rights reserved.
2+
// Use of this source code is governed by a BSD-style
3+
// license that can be found in the LICENSE file.
4+
5+
package main
6+
7+
import "runtime"
8+
9+
import "C"
10+
11+
func main() {}
12+
13+
//export GoF
14+
func GoF() { runtime.GC() }
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
// Copyright 2023 The Go Authors. All rights reserved.
2+
// Use of this source code is governed by a BSD-style
3+
// license that can be found in the LICENSE file.
4+
5+
#include "libgo9.h"
6+
7+
void use(int *x) { (*x)++; }
8+
9+
void callGoFWithDeepStack() {
10+
int x[10000];
11+
12+
use(&x[0]);
13+
use(&x[9999]);
14+
15+
GoF();
16+
17+
use(&x[0]);
18+
use(&x[9999]);
19+
}
20+
21+
int main() {
22+
GoF(); // call GoF without using much stack
23+
callGoFWithDeepStack(); // call GoF with a deep stack
24+
}

src/runtime/asm_386.s

Lines changed: 35 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -689,7 +689,20 @@ nosave:
689689
TEXT ·cgocallback(SB),NOSPLIT,$12-12 // Frame size must match commented places below
690690
NO_LOCAL_POINTERS
691691

692-
// If g is nil, Go did not create the current thread.
692+
// Skip cgocallbackg, just dropm when fn is nil, and frame is the saved g.
693+
// It is used to dropm while thread is exiting.
694+
MOVL fn+0(FP), AX
695+
CMPL AX, $0
696+
JNE loadg
697+
// Restore the g from frame.
698+
get_tls(CX)
699+
MOVL frame+4(FP), BX
700+
MOVL BX, g(CX)
701+
JMP dropm
702+
703+
loadg:
704+
// If g is nil, Go did not create the current thread,
705+
// or if this thread never called into Go on pthread platforms.
693706
// Call needm to obtain one for temporary use.
694707
// In this case, we're running on the thread stack, so there's
695708
// lots of space, but the linker doesn't know. Hide the call from
@@ -707,9 +720,9 @@ TEXT ·cgocallback(SB),NOSPLIT,$12-12 // Frame size must match commented places
707720
MOVL BP, savedm-4(SP) // saved copy of oldm
708721
JMP havem
709722
needm:
710-
MOVL $runtime·needm(SB), AX
723+
MOVL $runtime·needAndBindM(SB), AX
711724
CALL AX
712-
MOVL $0, savedm-4(SP) // dropm on return
725+
MOVL $0, savedm-4(SP)
713726
get_tls(CX)
714727
MOVL g(CX), BP
715728
MOVL g_m(BP), BP
@@ -784,13 +797,29 @@ havem:
784797
MOVL 0(SP), AX
785798
MOVL AX, (g_sched+gobuf_sp)(SI)
786799

787-
// If the m on entry was nil, we called needm above to borrow an m
788-
// for the duration of the call. Since the call is over, return it with dropm.
800+
// If the m on entry was nil, we called needm above to borrow an m,
801+
// 1. for the duration of the call on non-pthread platforms,
802+
// 2. or the duration of the C thread alive on pthread platforms.
803+
// If the m on entry wasn't nil,
804+
// 1. the thread might be a Go thread,
805+
// 2. or it's wasn't the first call from a C thread on pthread platforms,
806+
// since the we skip dropm to resue the m in the first call.
789807
MOVL savedm-4(SP), DX
790808
CMPL DX, $0
791-
JNE 3(PC)
809+
JNE droppedm
810+
811+
// Skip dropm to reuse it in the next call, when a pthread key has been created.
812+
MOVL _cgo_pthread_key_created(SB), DX
813+
// It means cgo is disabled when _cgo_pthread_key_created is a nil pointer, need dropm.
814+
CMPL DX, $0
815+
JEQ dropm
816+
CMPL (DX), $0
817+
JNE droppedm
818+
819+
dropm:
792820
MOVL $runtime·dropm(SB), AX
793821
CALL AX
822+
droppedm:
794823

795824
// Done!
796825
RET

src/runtime/asm_amd64.s

Lines changed: 33 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -918,7 +918,20 @@ GLOBL zeroTLS<>(SB),RODATA,$const_tlsSize
918918
TEXT ·cgocallback(SB),NOSPLIT,$24-24
919919
NO_LOCAL_POINTERS
920920

921-
// If g is nil, Go did not create the current thread.
921+
// Skip cgocallbackg, just dropm when fn is nil, and frame is the saved g.
922+
// It is used to dropm while thread is exiting.
923+
MOVQ fn+0(FP), AX
924+
CMPQ AX, $0
925+
JNE loadg
926+
// Restore the g from frame.
927+
get_tls(CX)
928+
MOVQ frame+8(FP), BX
929+
MOVQ BX, g(CX)
930+
JMP dropm
931+
932+
loadg:
933+
// If g is nil, Go did not create the current thread,
934+
// or if this thread never called into Go on pthread platforms.
922935
// Call needm to obtain one m for temporary use.
923936
// In this case, we're running on the thread stack, so there's
924937
// lots of space, but the linker doesn't know. Hide the call from
@@ -956,9 +969,9 @@ needm:
956969
// a bad value in there, in case needm tries to use it.
957970
XORPS X15, X15
958971
XORQ R14, R14
959-
MOVQ $runtime·needm<ABIInternal>(SB), AX
972+
MOVQ $runtime·needAndBindM<ABIInternal>(SB), AX
960973
CALL AX
961-
MOVQ $0, savedm-8(SP) // dropm on return
974+
MOVQ $0, savedm-8(SP)
962975
get_tls(CX)
963976
MOVQ g(CX), BX
964977
MOVQ g_m(BX), BX
@@ -1047,11 +1060,26 @@ havem:
10471060
MOVQ 0(SP), AX
10481061
MOVQ AX, (g_sched+gobuf_sp)(SI)
10491062

1050-
// If the m on entry was nil, we called needm above to borrow an m
1051-
// for the duration of the call. Since the call is over, return it with dropm.
1063+
// If the m on entry was nil, we called needm above to borrow an m,
1064+
// 1. for the duration of the call on non-pthread platforms,
1065+
// 2. or the duration of the C thread alive on pthread platforms.
1066+
// If the m on entry wasn't nil,
1067+
// 1. the thread might be a Go thread,
1068+
// 2. or it's wasn't the first call from a C thread on pthread platforms,
1069+
// since the we skip dropm to resue the m in the first call.
10521070
MOVQ savedm-8(SP), BX
10531071
CMPQ BX, $0
10541072
JNE done
1073+
1074+
// Skip dropm to reuse it in the next call, when a pthread key has been created.
1075+
MOVQ _cgo_pthread_key_created(SB), AX
1076+
// It means cgo is disabled when _cgo_pthread_key_created is a nil pointer, need dropm.
1077+
CMPQ AX, $0
1078+
JEQ dropm
1079+
CMPQ (AX), $0
1080+
JNE done
1081+
1082+
dropm:
10551083
MOVQ $runtime·dropm(SB), AX
10561084
CALL AX
10571085
#ifdef GOOS_windows

src/runtime/asm_arm.s

Lines changed: 33 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -630,6 +630,16 @@ nosave:
630630
TEXT ·cgocallback(SB),NOSPLIT,$12-12
631631
NO_LOCAL_POINTERS
632632

633+
// Skip cgocallbackg, just dropm when fn is nil, and frame is the saved g.
634+
// It is used to dropm while thread is exiting.
635+
MOVW fn+0(FP), R1
636+
CMP $0, R1
637+
B.NE loadg
638+
// Restore the g from frame.
639+
MOVW frame+4(FP), g
640+
B dropm
641+
642+
loadg:
633643
// Load m and g from thread-local storage.
634644
#ifdef GOOS_openbsd
635645
BL runtime·load_g(SB)
@@ -639,7 +649,8 @@ TEXT ·cgocallback(SB),NOSPLIT,$12-12
639649
BL.NE runtime·load_g(SB)
640650
#endif
641651

642-
// If g is nil, Go did not create the current thread.
652+
// If g is nil, Go did not create the current thread,
653+
// or if this thread never called into Go on pthread platforms.
643654
// Call needm to obtain one for temporary use.
644655
// In this case, we're running on the thread stack, so there's
645656
// lots of space, but the linker doesn't know. Hide the call from
@@ -653,7 +664,7 @@ TEXT ·cgocallback(SB),NOSPLIT,$12-12
653664

654665
needm:
655666
MOVW g, savedm-4(SP) // g is zero, so is m.
656-
MOVW $runtime·needm(SB), R0
667+
MOVW $runtime·needAndBindM(SB), R0
657668
BL (R0)
658669

659670
// Set m->g0->sched.sp = SP, so that if a panic happens
@@ -724,14 +735,31 @@ havem:
724735
MOVW savedsp-12(SP), R4 // must match frame size
725736
MOVW R4, (g_sched+gobuf_sp)(g)
726737

727-
// If the m on entry was nil, we called needm above to borrow an m
728-
// for the duration of the call. Since the call is over, return it with dropm.
738+
// If the m on entry was nil, we called needm above to borrow an m,
739+
// 1. for the duration of the call on non-pthread platforms,
740+
// 2. or the duration of the C thread alive on pthread platforms.
741+
// If the m on entry wasn't nil,
742+
// 1. the thread might be a Go thread,
743+
// 2. or it's wasn't the first call from a C thread on pthread platforms,
744+
// since the we skip dropm to resue the m in the first call.
729745
MOVW savedm-4(SP), R6
730746
CMP $0, R6
731-
B.NE 3(PC)
747+
B.NE done
748+
749+
// Skip dropm to reuse it in the next call, when a pthread key has been created.
750+
MOVW _cgo_pthread_key_created(SB), R6
751+
// It means cgo is disabled when _cgo_pthread_key_created is a nil pointer, need dropm.
752+
CMP $0, R6
753+
B.EQ dropm
754+
MOVW (R6), R6
755+
CMP $0, R6
756+
B.NE done
757+
758+
dropm:
732759
MOVW $runtime·dropm(SB), R0
733760
BL (R0)
734761

762+
done:
735763
// Done!
736764
RET
737765

0 commit comments

Comments
 (0)