-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Dotnet runtime uses membarrier() syscalls in the Linux implementation of FlushProcessWriteBuffers(). An initialization call to membarrier(MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED)
can run substantially longer in a process with more than one thread, by bypassing this fast-path (mm->mm_users > 1) in the kernel.
PAL_InitializeCoreCLR() hits the slow path by initializing membarrier() after launching a sync manager worker thread. Startup time can be improved by reordering membarrier init ahead of thread creation.
Potential fix in runtime PR 106724.
The issue can be demonstrated in this simple C program:
// membarrier(REGISTER_PRIVATE_EXPEDITED) init demo
// 1) Install tools: sudo apt install gcc libc6-dev hyperfine
// 2) Build test program: gcc -o mbdemo mbdemo.c -lpthread
// 3) Slow: hyperfine --style basic --time-unit millisecond "./mbdemo n"
// 4) Fast: hyperfine --style basic --time-unit millisecond "./mbdemo y"
#include <linux/membarrier.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <pthread.h>
#include <assert.h>
#include <stdio.h>
static void* worker_funct(void* param) {
printf("worker done\n");
return param;
}
int main(int argc, const char** argv) {
if (argc >= 2 && argv[1][0] == 'y') {
// init before thread
assert(syscall(SYS_membarrier, MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, 0, 0) == 0);
}
pthread_t worker_thread = {0};
assert(pthread_create(&worker_thread, NULL, &worker_funct, NULL) == 0);
// init after thread
assert(syscall(SYS_membarrier, MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, 0, 0) == 0);
assert(pthread_join(worker_thread, NULL) == 0);
printf("main done\n");
return 0;
}
~11ms difference on a 16-core arm64 system (AWS r7g.4xlarge):
$ hyperfine --style basic --time-unit millisecond "./mbdemo n"
Benchmark 1: ./mbdemo n
Time (mean ± σ): 11.5 ms ± 3.0 ms [User: 0.8 ms, System: 0.0 ms]
Range (min … max): 5.5 ms … 23.5 ms 496 runs
$ hyperfine --style basic --time-unit millisecond "./mbdemo y"
Benchmark 1: ./mbdemo y
Time (mean ± σ): 0.5 ms ± 0.0 ms [User: 0.5 ms, System: 0.3 ms]
Range (min … max): 0.5 ms … 0.7 ms 2992 runs
~8ms difference on 16-core x86_64 (AWS r6i.4xlarge):
ubuntu@ip-172-31-41-194:~$ hyperfine --style basic --time-unit millisecond "./mbdemo n"
Benchmark 1: ./mbdemo n
Time (mean ± σ): 8.7 ms ± 2.0 ms [User: 0.5 ms, System: 0.0 ms]
Range (min … max): 5.5 ms … 16.5 ms 335 runs
ubuntu@ip-172-31-41-194:~$ hyperfine --style basic --time-unit millisecond "./mbdemo y"
Benchmark 1: ./mbdemo y
Time (mean ± σ): 0.5 ms ± 0.0 ms [User: 0.3 ms, System: 0.2 ms]
Range (min … max): 0.4 ms … 1.5 ms 3157 runs
A workaround can be implemented with an LD_PRELOAD shared library calling membarrier(REGISTER_PRIVATE_EXPEDITED)
before the first dotnet thread:
// ld preload hack to run `membarrier(MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITE)` at process startup, before first thread is created.
// 1) Build: gcc -shared -o mbhack.so mbhack.c -lpthread
// 2) Run: LD_PRELOAD=/path/to/mbhack.so some-dotnet-binary
#include <linux/membarrier.h>
#include <sys/syscall.h>
#include <unistd.h>
__attribute__((constructor))
static void mbhack_init()
{
syscall(SYS_membarrier, MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, 0, 0);
}
~15ms difference on a 16-core arm64 system (AWS r7g.4xlarge):
$ hyperfine --style basic --time-unit millisecond ./bin/Release/net9.0/hello-world-dotnet9
Benchmark 1: ./bin/Release/net9.0/hello-world-dotnet9
Time (mean ± σ): 46.3 ms ± 4.1 ms [User: 22.9 ms, System: 7.9 ms]
Range (min … max): 35.3 ms … 55.3 ms 62 runs
$ LD_PRELOAD=$HOME/mbhack.so hyperfine --style basic --time-unit millisecond ./bin/Release/net9.0/hello-world-dotnet9
Benchmark 1: ./bin/Release/net9.0/hello-world-dotnet9
Time (mean ± σ): 30.1 ms ± 0.3 ms [User: 22.9 ms, System: 8.0 ms]
Range (min … max): 29.5 ms … 30.9 ms 95 runs
~10ms difference on 16-core x86_64 (AWS r6i.4xlarge):
ubuntu@ip-172-31-41-194:~/hello-world-dotnet9$ hyperfine --style basic --time-unit millisecond ./bin/Release/net9.0/hello-world-dotnet9
Benchmark 1: ./bin/Release/net9.0/hello-world-dotnet9
Time (mean ± σ): 36.6 ms ± 2.0 ms [User: 19.3 ms, System: 6.4 ms]
Range (min … max): 31.4 ms … 42.6 ms 82 runs
ubuntu@ip-172-31-41-194:~/hello-world-dotnet9$ LD_PRELOAD=$HOME/mbhack.so hyperfine --style basic --time-unit millisecond ./bin/Release/net9.0/hello-world-dotnet9
Benchmark 1: ./bin/Release/net9.0/hello-world-dotnet9
Time (mean ± σ): 26.5 ms ± 0.9 ms [User: 20.4 ms, System: 6.2 ms]
Range (min … max): 25.5 ms … 29.2 ms 109 runs