Skip to content

membarrier(REGISTER_PRIVATE_EXPEDITED) waits through an unnecessary RCU grace period during Linux process startup #106722

@harisokanovic

Description

@harisokanovic

Dotnet runtime uses membarrier() syscalls in the Linux implementation of FlushProcessWriteBuffers(). An initialization call to membarrier(MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED) can run substantially longer in a process with more than one thread, by bypassing this fast-path (mm->mm_users > 1) in the kernel.

PAL_InitializeCoreCLR() hits the slow path by initializing membarrier() after launching a sync manager worker thread. Startup time can be improved by reordering membarrier init ahead of thread creation.

Potential fix in runtime PR 106724.


The issue can be demonstrated in this simple C program:

// membarrier(REGISTER_PRIVATE_EXPEDITED) init demo
// 1) Install tools: sudo apt install gcc libc6-dev hyperfine
// 2) Build test program: gcc -o mbdemo mbdemo.c -lpthread
// 3) Slow: hyperfine --style basic --time-unit millisecond "./mbdemo n"
// 4) Fast: hyperfine --style basic --time-unit millisecond "./mbdemo y"

#include <linux/membarrier.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <pthread.h>
#include <assert.h>
#include <stdio.h>

static void* worker_funct(void* param) {
  printf("worker done\n");
  return param;
}

int main(int argc, const char** argv) {
  if (argc >= 2 && argv[1][0] == 'y') {
    // init before thread
    assert(syscall(SYS_membarrier, MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, 0, 0) == 0);
  }

  pthread_t worker_thread = {0};
  assert(pthread_create(&worker_thread, NULL, &worker_funct, NULL) == 0);

  // init after thread
  assert(syscall(SYS_membarrier, MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, 0, 0) == 0);

  assert(pthread_join(worker_thread, NULL) == 0);

  printf("main done\n");
  return 0;
}

~11ms difference on a 16-core arm64 system (AWS r7g.4xlarge):

$ hyperfine --style basic --time-unit millisecond "./mbdemo n"
Benchmark 1: ./mbdemo n
  Time (mean ± σ):      11.5 ms ±   3.0 ms    [User: 0.8 ms, System: 0.0 ms]
  Range (min … max):     5.5 ms …  23.5 ms    496 runs

$ hyperfine --style basic --time-unit millisecond "./mbdemo y"
Benchmark 1: ./mbdemo y
  Time (mean ± σ):       0.5 ms ±   0.0 ms    [User: 0.5 ms, System: 0.3 ms]
  Range (min … max):     0.5 ms …   0.7 ms    2992 runs

~8ms difference on 16-core x86_64 (AWS r6i.4xlarge):

ubuntu@ip-172-31-41-194:~$ hyperfine --style basic --time-unit millisecond "./mbdemo n"
Benchmark 1: ./mbdemo n
  Time (mean ± σ):       8.7 ms ±   2.0 ms    [User: 0.5 ms, System: 0.0 ms]
  Range (min … max):     5.5 ms …  16.5 ms    335 runs

ubuntu@ip-172-31-41-194:~$ hyperfine --style basic --time-unit millisecond "./mbdemo y"
Benchmark 1: ./mbdemo y
  Time (mean ± σ):       0.5 ms ±   0.0 ms    [User: 0.3 ms, System: 0.2 ms]
  Range (min … max):     0.4 ms …   1.5 ms    3157 runs

A workaround can be implemented with an LD_PRELOAD shared library calling membarrier(REGISTER_PRIVATE_EXPEDITED) before the first dotnet thread:

// ld preload hack to run `membarrier(MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITE)` at process startup, before first thread is created.
// 1) Build: gcc -shared -o mbhack.so mbhack.c -lpthread
// 2) Run: LD_PRELOAD=/path/to/mbhack.so some-dotnet-binary

#include <linux/membarrier.h>
#include <sys/syscall.h>
#include <unistd.h>

__attribute__((constructor))
static void mbhack_init()
{
  syscall(SYS_membarrier, MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, 0, 0);
}

~15ms difference on a 16-core arm64 system (AWS r7g.4xlarge):

$ hyperfine --style basic --time-unit millisecond ./bin/Release/net9.0/hello-world-dotnet9 
Benchmark 1: ./bin/Release/net9.0/hello-world-dotnet9
  Time (mean ± σ):      46.3 ms ±   4.1 ms    [User: 22.9 ms, System: 7.9 ms]
  Range (min … max):    35.3 ms …  55.3 ms    62 runs

$ LD_PRELOAD=$HOME/mbhack.so hyperfine --style basic --time-unit millisecond ./bin/Release/net9.0/hello-world-dotnet9 
Benchmark 1: ./bin/Release/net9.0/hello-world-dotnet9
  Time (mean ± σ):      30.1 ms ±   0.3 ms    [User: 22.9 ms, System: 8.0 ms]
  Range (min … max):    29.5 ms …  30.9 ms    95 runs

~10ms difference on 16-core x86_64 (AWS r6i.4xlarge):

ubuntu@ip-172-31-41-194:~/hello-world-dotnet9$ hyperfine --style basic --time-unit millisecond ./bin/Release/net9.0/hello-world-dotnet9
Benchmark 1: ./bin/Release/net9.0/hello-world-dotnet9
  Time (mean ± σ):      36.6 ms ±   2.0 ms    [User: 19.3 ms, System: 6.4 ms]
  Range (min … max):    31.4 ms …  42.6 ms    82 runs

ubuntu@ip-172-31-41-194:~/hello-world-dotnet9$ LD_PRELOAD=$HOME/mbhack.so hyperfine --style basic --time-unit millisecond ./bin/Release/net9.0/hello-world-dotnet9
Benchmark 1: ./bin/Release/net9.0/hello-world-dotnet9
  Time (mean ± σ):      26.5 ms ±   0.9 ms    [User: 20.4 ms, System: 6.2 ms]
  Range (min … max):    25.5 ms …  29.2 ms    109 runs

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-PAL-coreclrin-prThere is an active PR which will close this issue when it is mergedtenet-performancePerformance related issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions