bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated. #5980

kernel-patches-daemon-bpf-rc · 2025-09-17T19:22:47Z

Pull request for series with
subject: bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated.
version: 9
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1003509

If memcg is enabled, accept() acquires lock_sock() twice for each new TCP/MPTCP socket in inet_csk_accept() and __inet_accept(). Let's move memcg operations from inet_csk_accept() to __inet_accept(). Note that SCTP somehow allocates a new socket by sk_alloc() in sk->sk_prot->accept() and clones fields manually, instead of using sk_clone_lock(). mem_cgroup_sk_alloc() is called for SCTP before __inet_accept(), so I added the protocol check in __inet_accept(), but this can be removed once SCTP uses sk_clone_lock(). Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Shakeel Butt <[email protected]>

…ing. Some protocols (e.g., TCP, UDP) implement memory accounting for socket buffers and charge memory to per-protocol global counters pointed to by sk->sk_proto->memory_allocated. If a socket has sk->sk_memcg, this memory is also charged to memcg as "sock" in memory.stat. We do not need to pay costs for two orthogonal memory accounting mechanisms. A microbenchmark result is in the subsequent bpf patch. Let's decouple sockets under memcg from the global per-protocol memory accounting if mem_cgroup_sk_exclusive() returns true. Note that this does NOT disable memcg, but rather the per-protocol one. mem_cgroup_sk_exclusive() starts to return true in the following patches, and then, the per-protocol memory accounting will be skipped. In __inet_accept(), we need to reclaim counts that are already charged for child sockets because we do not allocate sk->sk_memcg until accept(). trace_sock_exceed_buf_limit() will always show 0 as accounted for the memcg-exclusive sockets, but this can be obtained in memory.stat. Signed-off-by: Kuniyuki Iwashima <[email protected]> Nacked-by: Johannes Weiner <[email protected]>

If net.core.memcg_exclusive is 1 when sk->sk_memcg is allocated, the socket is flagged with SK_MEMCG_EXCLUSIVE internally and skips the global per-protocol memory accounting. OTOH, for accept()ed child sockets, this flag is inherited from the listening socket in sk_clone_lock() and set in __inet_accept(). This is to preserve the decision by BPF which will be supported later. Given sk->sk_memcg can be accessed in the fast path, it would be preferable to place the flag field in the same cache line as sk->sk_memcg. However, struct sock does not have such a 1-byte hole. Let's store the flag in the lowest bit of sk->sk_memcg and check it in mem_cgroup_sk_exclusive(). Tested with a script that creates local socket pairs and send()s a bunch of data without recv()ing. Setup: # mkdir /sys/fs/cgroup/test # echo $$ >> /sys/fs/cgroup/test/cgroup.procs # sysctl -q net.ipv4.tcp_mem="1000 1000 1000" Without net.core.memcg_exclusive, charged to memcg & tcp_mem: # prlimit -n=524288:524288 bash -c "python3 pressure.py" & # cat /sys/fs/cgroup/test/memory.stat | grep sock sock 22642688 <-------------------------------------- charged to memcg # cat /proc/net/sockstat| grep TCP TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 5376 <-- charged to tcp_mem # ss -tn | head -n 5 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53188 ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:49972 ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53868 ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53554 # nstat | grep Pressure || echo no pressure TcpExtTCPMemoryPressures 1 0.0 With net.core.memcg_exclusive=1, only charged to memcg: # sysctl -q net.core.memcg_exclusive=1 # prlimit -n=524288:524288 bash -c "python3 pressure.py" & # cat /sys/fs/cgroup/test/memory.stat | grep sock sock 2757468160 <------------------------------------ charged to memcg # cat /proc/net/sockstat | grep TCP TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0 <- NOT charged to tcp_mem # ss -tn | head -n 5 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:49026 ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:45630 ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:44870 ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:45274 # nstat | grep Pressure || echo no pressure no pressure Signed-off-by: Kuniyuki Iwashima <[email protected]>

We will support flagging SK_MEMCG_EXCLUSIVE via bpf_setsockopt() at the BPF_CGROUP_INET_SOCK_CREATE hook. BPF_CGROUP_INET_SOCK_CREATE is invoked by __cgroup_bpf_run_filter_sk() that passes a pointer to struct sock to the bpf prog as void *ctx. But there are no bpf_func_proto for bpf_setsockopt() that receives the ctx as a pointer to struct sock. Also, bpf_getsockopt() will be necessary for a cgroup with multiple bpf progs running. Let's add new bpf_setsockopt() and bpf_getsockopt() variants for BPF_CGROUP_INET_SOCK_CREATE. Note that inet_create() is not under lock_sock() and has the same semantics with bpf_lsm_unlocked_sockopt_hooks. Signed-off-by: Kuniyuki Iwashima <[email protected]>

@start

If a socket has sk->sk_memcg with SK_MEMCG_EXCLUSIVE, it is decoupled from the global protocol memory accounting. This is controlled by net.core.memcg_exclusive sysctl, but it lacks flexibility. Let's support flagging (and clearing) SK_MEMCG_EXCLUSIVE via bpf_setsockopt() at the BPF_CGROUP_INET_SOCK_CREATE hook. u32 flags = SK_BPF_MEMCG_EXCLUSIVE; bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS, &flags, sizeof(flags)); As with net.core.memcg_exclusive, this is inherited to child sockets, and BPF always takes precedence over sysctl at socket(2) and accept(2). SK_BPF_MEMCG_FLAGS is only supported at BPF_CGROUP_INET_SOCK_CREATE and not supported on other hooks for some reasons: 1. UDP charges memory under sk->sk_receive_queue.lock instead of lock_sock() 2. For TCP child sockets, memory accounting is adjusted only in __inet_accept() which sk->sk_memcg allocation is deferred to 3. Modifying the flag after skb is charged to sk requires such adjustment during bpf_setsockopt() and complicates the logic unnecessarily We can support other hooks later if a real use case justifies that. Most changes are inline and hard to trace, but a microbenchmark on __sk_mem_raise_allocated() during neper/tcp_stream showed that more samples completed faster with SK_MEMCG_EXCLUSIVE. This will be more visible under tcp_mem pressure. # bpftrace -e 'kprobe:__sk_mem_raise_allocated { @start[tid] = nsecs; } kretprobe:__sk_mem_raise_allocated /@start[tid]/ { @EnD[tid] = nsecs - @start[tid]; @times = hist(@EnD[tid]); delete(@start[tid]); }' # tcp_stream -6 -F 1000 -N -T 256 Without bpf prog: [128, 256) 3846 | | [256, 512) 1505326 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [512, 1K) 1371006 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [1K, 2K) 198207 |@@@@@@ | [2K, 4K) 31199 |@ | With bpf prog in the next patch: (must be attached before tcp_stream) # bpftool prog load sk_memcg.bpf.o /sys/fs/bpf/sk_memcg type cgroup/sock_create # bpftool cgroup attach /sys/fs/cgroup/test cgroup_inet_sock_create pinned /sys/fs/bpf/sk_memcg [128, 256) 6413 | | [256, 512) 1868425 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [512, 1K) 1101697 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [1K, 2K) 117031 |@@@@ | [2K, 4K) 11773 | | Signed-off-by: Kuniyuki Iwashima <[email protected]>

The test does the following for IPv4/IPv6 x TCP/UDP sockets with/without SK_MEMCG_EXCLUSIVE, which can be turned on by net.core.memcg_exclusive or bpf_setsockopt(SK_BPF_MEMCG_EXCLUSIVE). 1. Create socket pairs 2. Send a bunch of data that requires more than 1024 pages 3. Read memory_allocated from sk->sk_prot->memory_allocated and sk->sk_prot->memory_per_cpu_fw_alloc 4. Check if unread data is charged to memory_allocated If SK_MEMCG_EXCLUSIVE is set, memory_allocated should not be changed, but we allow a small error (up to 10 pages) in case other processes on the host use some amounts of TCP/UDP memory. The amount of allocated pages are buffered to per-cpu variable {tcp,udp}_memory_per_cpu_fw_alloc up to +/- net.core.mem_pcpu_rsv before reported to {tcp,udp}_memory_allocated. At 3., memory_allocated is calculated from the 2 variables twice at fentry and fexit of socket create function to check if the per-cpu value is drained during calculation. In that case, 3. is retried. We use kern_sync_rcu() for UDP because UDP recv queue is destroyed after RCU grace period. The test takes ~2s on QEMU (64 CPUs) w/ KVM but takes 6s w/o KVM. # time ./test_progs -t sk_memcg #370/1 sk_memcg/TCP :OK #370/2 sk_memcg/UDP :OK #370/3 sk_memcg/TCPv6:OK #370/4 sk_memcg/UDPv6:OK #370 sk_memcg:OK Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED real 0m1.623s user 0m0.165s sys 0m0.366s Signed-off-by: Kuniyuki Iwashima <[email protected]>

kernel-patches-daemon-bpf-rc · 2025-09-17T19:22:48Z

Upstream branch: 2dfd8b8
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1003509
version: 9

kernel-patches-daemon-bpf-rc · 2025-09-18T20:16:24Z

At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=1003509 expired. Closing PR.

q2ven added 6 commits September 17, 2025 12:22

kernel-patches-daemon-bpf-rc bot added new V9 bpf-net labels Sep 17, 2025

kernel-patches-daemon-bpf-rc bot added changes-requested and removed new labels Sep 18, 2025

kernel-patches-daemon-bpf-rc bot closed this Sep 18, 2025

kernel-patches-daemon-bpf-rc bot deleted the series/1003509=>bpf-net branch September 21, 2025 03:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated. #5980

bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated. #5980

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Sep 17, 2025

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Sep 17, 2025

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Sep 18, 2025

Uh oh!

Uh oh!

bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated. #5980

bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated. #5980

Uh oh!

Conversation

kernel-patches-daemon-bpf-rc bot commented Sep 17, 2025

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Sep 17, 2025

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Sep 18, 2025

Uh oh!

Uh oh!