Skip to content

Commit 3c8d3e2

Browse files
committed
Merge: tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3301 JIRA: https://issues.redhat.com/browse/RHEL-11592 commit b650d95 Author: [email protected] <[email protected]> Date: Sun Jun 11 22:05:24 2023 -0500 tcp: enforce receive buffer memory limits by allowing the tcp window to shrink Under certain circumstances, the tcp receive buffer memory limit set by autotuning (sk_rcvbuf) is increased due to incoming data packets as a result of the window not closing when it should be. This can result in the receive buffer growing all the way up to tcp_rmem[2], even for tcp sessions with a low BDP. To reproduce: Connect a TCP session with the receiver doing nothing and the sender sending small packets (an infinite loop of socket send() with 4 bytes of payload with a sleep of 1 ms in between each send()). This will cause the tcp receive buffer to grow all the way up to tcp_rmem[2]. As a result, a host can have individual tcp sessions with receive buffers of size tcp_rmem[2], and the host itself can reach tcp_mem limits, causing the host to go into tcp memory pressure mode. The fundamental issue is the relationship between the granularity of the window scaling factor and the number of byte ACKed back to the sender. This problem has previously been identified in RFC 7323, appendix F [1]. The Linux kernel currently adheres to never shrinking the window. In addition to the overallocation of memory mentioned above, the current behavior is functionally incorrect, because once tcp_rmem[2] is reached when no remediations remain (i.e. tcp collapse fails to free up any more memory and there are no packets to prune from the out-of-order queue), the receiver will drop in-window packets resulting in retransmissions and an eventual timeout of the tcp session. A receive buffer full condition should instead result in a zero window and an indefinite wait. In practice, this problem is largely hidden for most flows. It is not applicable to mice flows. Elephant flows can send data fast enough to "overrun" the sk_rcvbuf limit (in a single ACK), triggering a zero window. But this problem does show up for other types of flows. Examples are websockets and other type of flows that send small amounts of data spaced apart slightly in time. In these cases, we directly encounter the problem described in [1]. RFC 7323, section 2.4 [2], says there are instances when a retracted window can be offered, and that TCP implementations MUST ensure that they handle a shrinking window, as specified in RFC 1122, section 4.2.2.16 [3]. All prior RFCs on the topic of tcp window management have made clear that sender must accept a shrunk window from the receiver, including RFC 793 [4] and RFC 1323 [5]. This patch implements the functionality to shrink the tcp window when necessary to keep the right edge within the memory limit by autotuning (sk_rcvbuf). This new functionality is enabled with the new sysctl: net.ipv4.tcp_shrink_window Additional information can be found at: https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/ [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4 [3] https://www.rfc-editor.org/rfc/rfc1122#page-91 [4] https://www.rfc-editor.org/rfc/rfc793 [5] https://www.rfc-editor.org/rfc/rfc1323 Signed-off-by: Mike Freemon <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Felix Maurer <[email protected]> Approved-by: Florian Westphal <[email protected]> Approved-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: Jan Stancek <[email protected]>
2 parents c8de7e8 + 130ad87 commit 3c8d3e2

File tree

7 files changed

+100
-13
lines changed

7 files changed

+100
-13
lines changed

Documentation/networking/ip-sysctl.rst

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -939,6 +939,21 @@ tcp_tw_reuse - INTEGER
939939
tcp_window_scaling - BOOLEAN
940940
Enable window scaling as defined in RFC1323.
941941

942+
tcp_shrink_window - BOOLEAN
943+
This changes how the TCP receive window is calculated.
944+
945+
RFC 7323, section 2.4, says there are instances when a retracted
946+
window can be offered, and that TCP implementations MUST ensure
947+
that they handle a shrinking window, as specified in RFC 1122.
948+
949+
- 0 - Disabled. The window is never shrunk.
950+
- 1 - Enabled. The window is shrunk when necessary to remain within
951+
the memory limit set by autotuning (sk_rcvbuf).
952+
This only occurs if a non-zero receive window
953+
scaling factor is also in effect.
954+
955+
Default: 0
956+
942957
tcp_wmem - vector of 3 INTEGERs: min, default, max
943958
min: Amount of memory reserved for send buffers for TCP sockets.
944959
Each TCP socket has rights to use it due to fact of its birth.

include/net/netns/ipv4.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@ struct netns_ipv4 {
6464
#endif
6565
bool fib_has_custom_local_routes;
6666
bool fib_offload_disabled;
67+
u8 sysctl_tcp_shrink_window;
6768
#ifdef CONFIG_IP_ROUTE_CLASSID
6869
atomic_t fib_num_tclassid_users;
6970
#endif

include/net/tcp.h

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1459,6 +1459,17 @@ static inline int tcp_full_space(const struct sock *sk)
14591459
return tcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf));
14601460
}
14611461

1462+
static inline void tcp_adjust_rcv_ssthresh(struct sock *sk)
1463+
{
1464+
int unused_mem = sk_unused_reserved_mem(sk);
1465+
struct tcp_sock *tp = tcp_sk(sk);
1466+
1467+
tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
1468+
if (unused_mem)
1469+
tp->rcv_ssthresh = max_t(u32, tp->rcv_ssthresh,
1470+
tcp_win_from_space(sk, unused_mem));
1471+
}
1472+
14621473
void tcp_cleanup_rbuf(struct sock *sk, int copied);
14631474

14641475
/* We provision sk_rcvbuf around 200% of sk_rcvlowat.

net/ipv4/sysctl_net_ipv4.c

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1354,6 +1354,15 @@ static struct ctl_table ipv4_net_table[] = {
13541354
.extra1 = SYSCTL_ZERO,
13551355
.extra2 = &two,
13561356
},
1357+
{
1358+
.procname = "tcp_shrink_window",
1359+
.data = &init_net.ipv4.sysctl_tcp_shrink_window,
1360+
.maxlen = sizeof(u8),
1361+
.mode = 0644,
1362+
.proc_handler = proc_dou8vec_minmax,
1363+
.extra1 = SYSCTL_ZERO,
1364+
.extra2 = SYSCTL_ONE,
1365+
},
13571366
{ }
13581367
};
13591368

net/ipv4/tcp_input.c

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -491,8 +491,11 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
491491

492492
room = min_t(int, tp->window_clamp, tcp_space(sk)) - tp->rcv_ssthresh;
493493

494+
if (room <= 0)
495+
return;
496+
494497
/* Check #1 */
495-
if (room > 0 && !tcp_under_memory_pressure(sk)) {
498+
if (!tcp_under_memory_pressure(sk)) {
496499
int incr;
497500

498501
/* Check #2. Increase window, if skb with such overhead
@@ -508,6 +511,11 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
508511
tp->rcv_ssthresh += min(room, incr);
509512
inet_csk(sk)->icsk_ack.quick |= 1;
510513
}
514+
} else {
515+
/* Under pressure:
516+
* Adjust rcv_ssthresh according to reserved mem
517+
*/
518+
tcp_adjust_rcv_ssthresh(sk);
511519
}
512520
}
513521

@@ -5369,7 +5377,7 @@ static int tcp_prune_queue(struct sock *sk)
53695377
if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
53705378
tcp_clamp_window(sk);
53715379
else if (tcp_under_memory_pressure(sk))
5372-
tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
5380+
tcp_adjust_rcv_ssthresh(sk);
53735381

53745382
if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
53755383
return 0;

net/ipv4/tcp_ipv4.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3195,6 +3195,8 @@ static int __net_init tcp_sk_init(struct net *net)
31953195
else
31963196
net->ipv4.tcp_congestion_control = &tcp_reno;
31973197

3198+
net->ipv4.sysctl_tcp_shrink_window = 0;
3199+
31983200
return 0;
31993201
fail:
32003202
tcp_sk_exit(net);

net/ipv4/tcp_output.c

Lines changed: 52 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -259,8 +259,8 @@ static u16 tcp_select_window(struct sock *sk)
259259
u32 old_win = tp->rcv_wnd;
260260
u32 cur_win = tcp_receive_window(tp);
261261
u32 new_win = __tcp_select_window(sk);
262+
struct net *net = sock_net(sk);
262263

263-
/* Never shrink the offered window */
264264
if (new_win < cur_win) {
265265
/* Danger Will Robinson!
266266
* Don't update rcv_wup/rcv_wnd here or else
@@ -269,19 +269,22 @@ static u16 tcp_select_window(struct sock *sk)
269269
*
270270
* Relax Will Robinson.
271271
*/
272-
if (new_win == 0)
273-
NET_INC_STATS(sock_net(sk),
274-
LINUX_MIB_TCPWANTZEROWINDOWADV);
275-
new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
272+
if (!READ_ONCE(net->ipv4.sysctl_tcp_shrink_window) || !tp->rx_opt.rcv_wscale) {
273+
/* Never shrink the offered window */
274+
if (new_win == 0)
275+
NET_INC_STATS(net, LINUX_MIB_TCPWANTZEROWINDOWADV);
276+
new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
277+
}
276278
}
279+
277280
tp->rcv_wnd = new_win;
278281
tp->rcv_wup = tp->rcv_nxt;
279282

280283
/* Make sure we do not exceed the maximum possible
281284
* scaled window.
282285
*/
283286
if (!tp->rx_opt.rcv_wscale &&
284-
READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows))
287+
READ_ONCE(net->ipv4.sysctl_tcp_workaround_signed_windows))
285288
new_win = min(new_win, MAX_TCP_WINDOW);
286289
else
287290
new_win = min(new_win, (65535U << tp->rx_opt.rcv_wscale));
@@ -293,10 +296,9 @@ static u16 tcp_select_window(struct sock *sk)
293296
if (new_win == 0) {
294297
tp->pred_flags = 0;
295298
if (old_win)
296-
NET_INC_STATS(sock_net(sk),
297-
LINUX_MIB_TCPTOZEROWINDOWADV);
299+
NET_INC_STATS(net, LINUX_MIB_TCPTOZEROWINDOWADV);
298300
} else if (old_win == 0) {
299-
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFROMZEROWINDOWADV);
301+
NET_INC_STATS(net, LINUX_MIB_TCPFROMZEROWINDOWADV);
300302
}
301303

302304
return new_win;
@@ -2953,6 +2955,7 @@ u32 __tcp_select_window(struct sock *sk)
29532955
{
29542956
struct inet_connection_sock *icsk = inet_csk(sk);
29552957
struct tcp_sock *tp = tcp_sk(sk);
2958+
struct net *net = sock_net(sk);
29562959
/* MSS for the peer's data. Previous versions used mss_clamp
29572960
* here. I don't know if the value based on our guesses
29582961
* of peer's MSS is better for the performance. It's more correct
@@ -2974,12 +2977,20 @@ u32 __tcp_select_window(struct sock *sk)
29742977
if (mss <= 0)
29752978
return 0;
29762979
}
2980+
2981+
/* Only allow window shrink if the sysctl is enabled and we have
2982+
* a non-zero scaling factor in effect.
2983+
*/
2984+
if (READ_ONCE(net->ipv4.sysctl_tcp_shrink_window) && tp->rx_opt.rcv_wscale)
2985+
goto shrink_window_allowed;
2986+
2987+
/* do not allow window to shrink */
2988+
29772989
if (free_space < (full_space >> 1)) {
29782990
icsk->icsk_ack.quick = 0;
29792991

29802992
if (tcp_under_memory_pressure(sk))
2981-
tp->rcv_ssthresh = min(tp->rcv_ssthresh,
2982-
4U * tp->advmss);
2993+
tcp_adjust_rcv_ssthresh(sk);
29832994

29842995
/* free_space might become our new window, make sure we don't
29852996
* increase it due to wscale.
@@ -3029,6 +3040,36 @@ u32 __tcp_select_window(struct sock *sk)
30293040
}
30303041

30313042
return window;
3043+
3044+
shrink_window_allowed:
3045+
/* new window should always be an exact multiple of scaling factor */
3046+
free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
3047+
3048+
if (free_space < (full_space >> 1)) {
3049+
icsk->icsk_ack.quick = 0;
3050+
3051+
if (tcp_under_memory_pressure(sk))
3052+
tcp_adjust_rcv_ssthresh(sk);
3053+
3054+
/* if free space is too low, return a zero window */
3055+
if (free_space < (allowed_space >> 4) || free_space < mss ||
3056+
free_space < (1 << tp->rx_opt.rcv_wscale))
3057+
return 0;
3058+
}
3059+
3060+
if (free_space > tp->rcv_ssthresh) {
3061+
free_space = tp->rcv_ssthresh;
3062+
/* new window should always be an exact multiple of scaling factor
3063+
*
3064+
* For this case, we ALIGN "up" (increase free_space) because
3065+
* we know free_space is not zero here, it has been reduced from
3066+
* the memory-based limit, and rcv_ssthresh is not a hard limit
3067+
* (unlike sk_rcvbuf).
3068+
*/
3069+
free_space = ALIGN(free_space, (1 << tp->rx_opt.rcv_wscale));
3070+
}
3071+
3072+
return free_space;
30323073
}
30333074

30343075
void tcp_skb_collapse_tstamp(struct sk_buff *skb,

0 commit comments

Comments
 (0)