Skip to content

runtime: epoll scalability problem with 192 core machine and 1k+ ready sockets #65064

@prattmic

Description

@prattmic

Split from #31908 (comment) and full write-up at https://jazco.dev/2024/01/10/golang-and-epoll/.

tl;dr is that a program on a 192 core machine with >2500 sockets and with >1k becoming ready at once results in huge costs in netpoll -> epoll_wait (~65% of total CPU).

Most interesting is that sharding these connections across 8 processes seems to solve the problem, implying some kind of super-linear scaling.

That the profile shows the time spent in epoll_wait itself, this may be a scalability problem in the kernel itself, but we may still be able to mitigate.

@ericvolp12, some questions if you don't mind answering:

  • Which version of Go are you using? And which kernel version?
  • Do you happen to have a reproducer for this problem that you could share? (Sounds like no?)
  • On a similar note, do you have a perf profile of this problem that shows where the time in the kernel is spent?
  • The 128 event buffer size is mentioned several times, but it is not obvious to me that increasing this size would actually solve the problem. Did you try increasing the size and see improved results?

cc @golang/runtime

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsFixThe path to resolution is known, but the work has not been done.OS-LinuxPerformanceScalabilityIssues related to runtime/application scalabilitycompiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    Status

    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions