Skip to content

Commit 5fbc491

Browse files
krikkkuNipaLocal
authored andcommitted
net: Prevent RPS table overwrite for active flows
This patch fixes an issue where two different flows on the same RXq produce the same hash resulting in continuous flow overwrites. Flow kernel-patches#1: A packet for Flow kernel-patches#1 comes in, kernel calls the steering function. The driver gives back a filter id. The kernel saves this filter id in the selected slot. Later, the driver's service task checks if any filters have expired and then installs the rule for Flow kernel-patches#1. Flow kernel-patches#2: A packet for Flow kernel-patches#2 comes in. It goes through the same steps. But this time, the chosen slot is being used by Flow kernel-patches#1. The driver gives a new filter id and the kernel saves it in the same slot. When the driver's service task runs, it runs through all the flows, checks if Flow kernel-patches#1 should be expired, the kernel returns True as the slot has a different filter id, and then the driver installs the rule for Flow kernel-patches#2. Flow kernel-patches#1: Another packet for Flow kernel-patches#1 comes in. The same thing repeats. The slot is overwritten with a new filter id for Flow kernel-patches#1. This causes a repeated cycle of flow programming for missed packets, wasting CPU cycles while not improving performance. This problem happens at higher rates when the RPS table is small, but tests show it still happens even with 12,000 connections and an RPS size of 16K per queue (global table size = 144x16K = 64K). This patch prevents overwriting an rps_dev_flow entry if it is active. The intention is that it is better to do aRFS for the first flow instead of hurting all flows on the same hash. Without this, two (or more) flows on one RX queue with the same hash can keep overwriting each other. This causes the driver to reprogram the flow repeatedly. Changes: 1. Add a new 'hash' field to struct rps_dev_flow. 2. Add rps_flow_is_active(): a helper function to check if a flow is active or not, extracted from rps_may_expire_flow(). 3. In set_rps_cpu(): - Avoid overwriting by programming a new filter if: - The slot is not in use, or - The slot is in use but the flow is not active, or - The slot has an active flow with the same hash, but target CPU differs. - Save the hash in the rps_dev_flow entry. 4. rps_may_expire_flow(): Use earlier extracted rps_flow_is_active(). Testing & results: - Driver: ice (E810 NIC), Kernel: net-next - #CPUs = #RXq = 144 (1:1) - Number of flows: 12K - Eight RPS settings from 256 to 32768. Though RPS=256 is not ideal, it is still sufficient to cover 12K flows (256*144 rx-queues = 64K global table slots) - Global Table Size = 144 * RPS (effectively equal to 256 * RPS) - Each RPS test duration = 8 mins (org code) + 8 mins (new code). - Metrics captured on client Legend for following tables: Steer-C: #times ndo_rx_flow_steer() was Called by set_rps_cpu() Steer-L: #times ice_arfs_flow_steer() Looped over aRFS entries Add: #times driver actually programmed aRFS (ice_arfs_build_entry()) Del: #times driver deleted the flow (ice_arfs_del_flow_rules()) Units: K = 1,000 times, M = 1 million times |-------|---------|------| Org Code |---------|---------| | RPS | Latency | CPU | Add | Del | Steer-C | Steer-L | |-------|---------|------|--------|--------|---------|---------| | 256 | 227.0 | 93.2 | 1.6M | 1.6M | 121.7M | 267.6M | | 512 | 225.9 | 94.1 | 11.5M | 11.2M | 65.7M | 199.6M | | 1024 | 223.5 | 95.6 | 16.5M | 16.5M | 27.1M | 187.3M | | 2048 | 222.2 | 96.3 | 10.5M | 10.5M | 12.5M | 115.2M | | 4096 | 223.9 | 94.1 | 5.5M | 5.5M | 7.2M | 65.9M | | 8192 | 224.7 | 92.5 | 2.7M | 2.7M | 3.0M | 29.9M | | 16384 | 223.5 | 92.5 | 1.3M | 1.3M | 1.4M | 13.9M | | 32768 | 219.6 | 93.2 | 838.1K | 838.1K | 965.1K | 8.9M | |-------|---------|------| New Code |---------|---------| | 256 | 201.5 | 99.1 | 13.4K | 5.0K | 13.7K | 75.2K | | 512 | 202.5 | 98.2 | 11.2K | 5.9K | 11.2K | 55.5K | | 1024 | 207.3 | 93.9 | 11.5K | 9.7K | 11.5K | 59.6K | | 2048 | 207.5 | 96.7 | 11.8K | 11.1K | 15.5K | 79.3K | | 4096 | 206.9 | 96.6 | 11.8K | 11.7K | 11.8K | 63.2K | | 8192 | 205.8 | 96.7 | 11.9K | 11.8K | 11.9K | 63.9K | | 16384 | 200.9 | 98.2 | 11.9K | 11.9K | 11.9K | 64.2K | | 32768 | 202.5 | 98.0 | 11.9K | 11.9K | 11.9K | 64.2K | |-------|---------|------|--------|--------|---------|---------| Some observations: 1. Overall Latency improved: (1790.19-1634.94)/1790.19*100 = 8.67% 2. Overall CPU increased: (777.32-751.49)/751.45*100 = 3.44% 3. Flow Management (add/delete) remained almost constant at ~11K compared to values in millions. Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Krishna Kumar <[email protected]> Signed-off-by: NipaLocal <nipa@local>
1 parent b539a66 commit 5fbc491

File tree

3 files changed

+81
-10
lines changed

3 files changed

+81
-10
lines changed

include/net/rps.h

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,13 +25,14 @@ struct rps_map {
2525

2626
/*
2727
* The rps_dev_flow structure contains the mapping of a flow to a CPU, the
28-
* tail pointer for that CPU's input queue at the time of last enqueue, and
29-
* a hardware filter index.
28+
* tail pointer for that CPU's input queue at the time of last enqueue, a
29+
* hardware filter index, and the hash of the flow.
3030
*/
3131
struct rps_dev_flow {
3232
u16 cpu;
3333
u16 filter;
3434
unsigned int last_qtail;
35+
u32 hash;
3536
};
3637
#define RPS_NO_FILTER 0xffff
3738

net/core/dev.c

Lines changed: 75 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4855,6 +4855,28 @@ static u32 rfs_slot(u32 hash, const struct rps_dev_flow_table *flow_table)
48554855
return hash_32(hash, flow_table->log);
48564856
}
48574857

4858+
#ifdef CONFIG_RFS_ACCEL
4859+
/**
4860+
* rps_flow_is_active - check whether the flow is recently active.
4861+
* @rflow: Specific flow to check activity.
4862+
* @flow_table: Check activity against the flow_table's size.
4863+
* @cpu: CPU saved in @rflow.
4864+
*
4865+
* If the CPU has processed many packets since the flow's last activity
4866+
* (beyond 10 times the table size), the flow is considered stale.
4867+
*
4868+
* Return: true if flow was recently active.
4869+
*/
4870+
static bool rps_flow_is_active(struct rps_dev_flow *rflow,
4871+
struct rps_dev_flow_table *flow_table,
4872+
unsigned int cpu)
4873+
{
4874+
return cpu < nr_cpu_ids &&
4875+
((int)(READ_ONCE(per_cpu(softnet_data, cpu).input_queue_head) -
4876+
READ_ONCE(rflow->last_qtail)) < (int)(10 << flow_table->log));
4877+
}
4878+
#endif
4879+
48584880
static struct rps_dev_flow *
48594881
set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
48604882
struct rps_dev_flow *rflow, u16 next_cpu)
@@ -4865,8 +4887,11 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
48654887
struct netdev_rx_queue *rxqueue;
48664888
struct rps_dev_flow_table *flow_table;
48674889
struct rps_dev_flow *old_rflow;
4890+
struct rps_dev_flow *tmp_rflow;
4891+
unsigned int tmp_cpu;
48684892
u16 rxq_index;
48694893
u32 flow_id;
4894+
u32 hash;
48704895
int rc;
48714896

48724897
/* Should we steer this flow to a different hardware queue? */
@@ -4881,14 +4906,58 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
48814906
flow_table = rcu_dereference(rxqueue->rps_flow_table);
48824907
if (!flow_table)
48834908
goto out;
4884-
flow_id = rfs_slot(skb_get_hash(skb), flow_table);
4909+
4910+
hash = skb_get_hash(skb);
4911+
flow_id = rfs_slot(hash, flow_table);
4912+
4913+
tmp_rflow = &flow_table->flows[flow_id];
4914+
tmp_cpu = READ_ONCE(tmp_rflow->cpu);
4915+
4916+
/* Make sure this slot is usable before enabling steer */
4917+
if (READ_ONCE(tmp_rflow->filter) != RPS_NO_FILTER) {
4918+
/* This slot has an entry */
4919+
if (rps_flow_is_active(tmp_rflow, flow_table,
4920+
tmp_cpu)) {
4921+
/*
4922+
* This slot has an active "programmed" flow.
4923+
* Break out if the cached value stored is for
4924+
* a different flow, or (for our flow) the
4925+
* rx-queue# did not change.
4926+
*/
4927+
if (hash != READ_ONCE(tmp_rflow->hash) ||
4928+
next_cpu == tmp_cpu) {
4929+
/*
4930+
* Don't unnecessarily reprogram if:
4931+
* 1. This slot has an active different
4932+
* flow.
4933+
* 2. This slot has the same flow (very
4934+
* likely but not guaranteed) and
4935+
* the rx-queue# did not change.
4936+
*/
4937+
goto out;
4938+
}
4939+
}
4940+
4941+
/*
4942+
* When we overwrite the flow, the driver still has
4943+
* the cached entry. But drivers will check if the
4944+
* flow is active and rps_may_expire_entry() will
4945+
* return False and driver will delete it soon. Hence
4946+
* inconsistency between kernel & driver is quickly
4947+
* resolved.
4948+
*/
4949+
}
4950+
48854951
rc = dev->netdev_ops->ndo_rx_flow_steer(dev, skb,
48864952
rxq_index, flow_id);
48874953
if (rc < 0)
48884954
goto out;
4955+
48894956
old_rflow = rflow;
4890-
rflow = &flow_table->flows[flow_id];
4957+
rflow = tmp_rflow;
48914958
WRITE_ONCE(rflow->filter, rc);
4959+
WRITE_ONCE(rflow->hash, hash);
4960+
48924961
if (old_rflow->filter == rc)
48934962
WRITE_ONCE(old_rflow->filter, RPS_NO_FILTER);
48944963
out:
@@ -5023,17 +5092,16 @@ bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index,
50235092
struct rps_dev_flow_table *flow_table;
50245093
struct rps_dev_flow *rflow;
50255094
bool expire = true;
5026-
unsigned int cpu;
50275095

50285096
rcu_read_lock();
50295097
flow_table = rcu_dereference(rxqueue->rps_flow_table);
50305098
if (flow_table && flow_id < (1UL << flow_table->log)) {
5099+
unsigned int cpu;
5100+
50315101
rflow = &flow_table->flows[flow_id];
50325102
cpu = READ_ONCE(rflow->cpu);
5033-
if (READ_ONCE(rflow->filter) == filter_id && cpu < nr_cpu_ids &&
5034-
((int)(READ_ONCE(per_cpu(softnet_data, cpu).input_queue_head) -
5035-
READ_ONCE(rflow->last_qtail)) <
5036-
(int)(10 << flow_table->log)))
5103+
if (READ_ONCE(rflow->filter) == filter_id &&
5104+
rps_flow_is_active(rflow, flow_table, cpu))
50375105
expire = false;
50385106
}
50395107
rcu_read_unlock();

net/core/net-sysfs.c

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1120,8 +1120,10 @@ static ssize_t store_rps_dev_flow_table_cnt(struct netdev_rx_queue *queue,
11201120
return -ENOMEM;
11211121

11221122
table->log = ilog2(mask) + 1;
1123-
for (count = 0; count <= mask; count++)
1123+
for (count = 0; count <= mask; count++) {
11241124
table->flows[count].cpu = RPS_NO_CPU;
1125+
table->flows[count].filter = RPS_NO_FILTER;
1126+
}
11251127
} else {
11261128
table = NULL;
11271129
}

0 commit comments

Comments
 (0)