Skip to content

Commit 4b837ad

Browse files
committed
Merge branch 'netfilter-flowtable'
Pablo Neira Ayuso says: ==================== netfilter: flowtable enhancements [ This is v2 that includes documentation enhancements, including existing limitations. This is a rebase on top on net-next. ] The following patchset augments the Netfilter flowtable fastpath to support for network topologies that combine IP forwarding, bridge, classic VLAN devices, bridge VLAN filtering, DSA and PPPoE. This includes support for the flowtable software and hardware datapaths. The following pictures provides an example scenario: fast path! .------------------------. / \ | IP forwarding | | / \ \/ | br0 wan ..... eth0 . / \ host C -> veth1 veth2 . switch/router . . eth0 host A The bridge master device 'br0' has an IP address and a DHCP server is also assumed to be running to provide connectivity to host A which reaches the Internet through 'br0' as default gateway. Then, packet enters the IP forwarding path and Netfilter is used to NAT the packets before they leave through the wan device. The general idea is to accelerate forwarding by building a fast path that takes packets from the ingress path of the bridge port and place them in the egress path of the wan device (and vice versa). Hence, skipping the classic bridge and IP stack paths. ** Patch from #1 to #6 add the infrastructure which describes the list of netdevice hops to reach a given destination MAC address in the local network topology. Patch #1 adds dev_fill_forward_path() and .ndo_fill_forward_path() to netdev_ops. Patch #2 adds .ndo_fill_forward_path for vlan devices, which provides the next device hop via vlan->real_dev, the vlan ID and the protocol. Patch #3 adds .ndo_fill_forward_path for bridge devices, which allows to make lookups to the FDB to locate the next device hop (bridge port) in the forwarding path. Patch #4 extends bridge .ndo_fill_forward_path to support for bridge VLAN filtering. Patch #5 adds .ndo_fill_forward_path for PPPoE devices. Patch #6 adds .ndo_fill_forward_path for DSA. Patches from #7 to #14 update the flowtable software datapath: Patch #7 adds the transmit path type field to the flow tuple. Two transmit paths are supported so far: the neighbour and the xfrm transmit paths. Patch #8 and #9 update the flowtable datapath to use dev_fill_forward_path() to obtain the real ingress/egress device for the flowtable datapath. This adds the new ethernet xmit direct path to the flowtable. Patch #10 adds native flowtable VLAN support (up to 2 VLAN tags) through dev_fill_forward_path(). The flowtable stores the VLAN id and protocol in the flow tuple. Patch #11 adds native flowtable bridge VLAN filter support through dev_fill_forward_path(). Patch #12 adds native flowtable bridge PPPoE through dev_fill_forward_path(). Patch #13 adds DSA support through dev_fill_forward_path(). Patch #14 extends flowtable selftests to cover for flowtable software datapath enhancements. ** Patches from #15 to #20 update the flowtable hardware offload datapath: Patch #15 extends the flowtable hardware offload to support for the direct ethernet xmit path. This also includes VLAN support. Patch #16 stores the egress real device in the flow tuple. The software flowtable datapath uses dev_hard_header() to transmit packets, hence it might refer to VLAN/DSA/PPPoE software device, not the real ethernet device. Patch #17 deals with switchdev PVID hardware offload to skip it on egress. Patch #18 adds FLOW_ACTION_PPPOE_PUSH to the flow_offload action API. Patch #19 extends the flowtable hardware offload to support for PPPoE Patch #20 adds TC_SETUP_FT support for DSA. ** Patches from #20 to #23: Felix Fietkau adds a new driver which support hardware offload for the mtk PPE engine through the existing flow offload API which supports for the flowtable enhancements coming in this batch. Patch #24 extends the documentation and describe existing limitations. Please, apply, thanks. ==================== Signed-off-by: David S. Miller <[email protected]>
2 parents ad248f7 + 143490c commit 4b837ad

26 files changed

+2892
-146
lines changed

Documentation/networking/nf_flowtable.rst

Lines changed: 143 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -4,35 +4,38 @@
44
Netfilter's flowtable infrastructure
55
====================================
66

7-
This documentation describes the software flowtable infrastructure available in
8-
Netfilter since Linux kernel 4.16.
7+
This documentation describes the Netfilter flowtable infrastructure which allows
8+
you to define a fastpath through the flowtable datapath. This infrastructure
9+
also provides hardware offload support. The flowtable supports for the layer 3
10+
IPv4 and IPv6 and the layer 4 TCP and UDP protocols.
911

1012
Overview
1113
--------
1214

13-
Initial packets follow the classic forwarding path, once the flow enters the
14-
established state according to the conntrack semantics (ie. we have seen traffic
15-
in both directions), then you can decide to offload the flow to the flowtable
16-
from the forward chain via the 'flow offload' action available in nftables.
15+
Once the first packet of the flow successfully goes through the IP forwarding
16+
path, from the second packet on, you might decide to offload the flow to the
17+
flowtable through your ruleset. The flowtable infrastructure provides a rule
18+
action that allows you to specify when to add a flow to the flowtable.
1719

18-
Packets that find an entry in the flowtable (ie. flowtable hit) are sent to the
19-
output netdevice via neigh_xmit(), hence, they bypass the classic forwarding
20-
path (the visible effect is that you do not see these packets from any of the
21-
netfilter hooks coming after the ingress). In case of flowtable miss, the packet
22-
follows the classic forward path.
20+
A packet that finds a matching entry in the flowtable (ie. flowtable hit) is
21+
transmitted to the output netdevice via neigh_xmit(), hence, packets bypass the
22+
classic IP forwarding path (the visible effect is that you do not see these
23+
packets from any of the Netfilter hooks coming after ingress). In case that
24+
there is no matching entry in the flowtable (ie. flowtable miss), the packet
25+
follows the classic IP forwarding path.
2326

24-
The flowtable uses a resizable hashtable, lookups are based on the following
25-
7-tuple selectors: source, destination, layer 3 and layer 4 protocols, source
26-
and destination ports and the input interface (useful in case there are several
27-
conntrack zones in place).
27+
The flowtable uses a resizable hashtable. Lookups are based on the following
28+
n-tuple selectors: layer 2 protocol encapsulation (VLAN and PPPoE), layer 3
29+
source and destination, layer 4 source and destination ports and the input
30+
interface (useful in case there are several conntrack zones in place).
2831

29-
Flowtables are populated via the 'flow offload' nftables action, so the user can
30-
selectively specify what flows are placed into the flow table. Hence, packets
31-
follow the classic forwarding path unless the user explicitly instruct packets
32-
to use this new alternative forwarding path via nftables policy.
32+
The 'flow add' action allows you to populate the flowtable, the user selectively
33+
specifies what flows are placed into the flowtable. Hence, packets follow the
34+
classic IP forwarding path unless the user explicitly instruct flows to use this
35+
new alternative forwarding path via policy.
3336

34-
This is represented in Fig.1, which describes the classic forwarding path
35-
including the Netfilter hooks and the flowtable fastpath bypass.
37+
The flowtable datapath is represented in Fig.1, which describes the classic IP
38+
forwarding path including the Netfilter hooks and the flowtable fastpath bypass.
3639

3740
::
3841

@@ -67,11 +70,13 @@ including the Netfilter hooks and the flowtable fastpath bypass.
6770
Fig.1 Netfilter hooks and flowtable interactions
6871

6972
The flowtable entry also stores the NAT configuration, so all packets are
70-
mangled according to the NAT policy that matches the initial packets that went
71-
through the classic forwarding path. The TTL is decremented before calling
72-
neigh_xmit(). Fragmented traffic is passed up to follow the classic forwarding
73-
path given that the transport selectors are missing, therefore flowtable lookup
74-
is not possible.
73+
mangled according to the NAT policy that is specified from the classic IP
74+
forwarding path. The TTL is decremented before calling neigh_xmit(). Fragmented
75+
traffic is passed up to follow the classic IP forwarding path given that the
76+
transport header is missing, in this case, flowtable lookups are not possible.
77+
TCP RST and FIN packets are also passed up to the classic IP forwarding path to
78+
release the flow gracefully. Packets that exceed the MTU are also passed up to
79+
the classic forwarding path to report packet-too-big ICMP errors to the sender.
7580

7681
Example configuration
7782
---------------------
@@ -85,7 +90,7 @@ flowtable and add one rule to your forward chain::
8590
}
8691
chain y {
8792
type filter hook forward priority 0; policy accept;
88-
ip protocol tcp flow offload @f
93+
ip protocol tcp flow add @f
8994
counter packets 0 bytes 0
9095
}
9196
}
@@ -103,6 +108,117 @@ flow is offloaded, you will observe that the counter rule in the example above
103108
does not get updated for the packets that are being forwarded through the
104109
forwarding bypass.
105110

111+
You can identify offloaded flows through the [OFFLOAD] tag when listing your
112+
connection tracking table.
113+
114+
::
115+
# conntrack -L
116+
tcp 6 src=10.141.10.2 dst=192.168.10.2 sport=52728 dport=5201 src=192.168.10.2 dst=192.168.10.1 sport=5201 dport=52728 [OFFLOAD] mark=0 use=2
117+
118+
119+
Layer 2 encapsulation
120+
---------------------
121+
122+
Since Linux kernel 5.13, the flowtable infrastructure discovers the real
123+
netdevice behind VLAN and PPPoE netdevices. The flowtable software datapath
124+
parses the VLAN and PPPoE layer 2 headers to extract the ethertype and the
125+
VLAN ID / PPPoE session ID which are used for the flowtable lookups. The
126+
flowtable datapath also deals with layer 2 decapsulation.
127+
128+
You do not need to add the PPPoE and the VLAN devices to your flowtable,
129+
instead the real device is sufficient for the flowtable to track your flows.
130+
131+
Bridge and IP forwarding
132+
------------------------
133+
134+
Since Linux kernel 5.13, you can add bridge ports to the flowtable. The
135+
flowtable infrastructure discovers the topology behind the bridge device. This
136+
allows the flowtable to define a fastpath bypass between the bridge ports
137+
(represented as eth1 and eth2 in the example figure below) and the gateway
138+
device (represented as eth0) in your switch/router.
139+
140+
::
141+
fastpath bypass
142+
.-------------------------.
143+
/ \
144+
| IP forwarding |
145+
| / \ \/
146+
| br0 eth0 ..... eth0
147+
. / \ *host B*
148+
-> eth1 eth2
149+
. *switch/router*
150+
.
151+
.
152+
eth0
153+
*host A*
154+
155+
The flowtable infrastructure also supports for bridge VLAN filtering actions
156+
such as PVID and untagged. You can also stack a classic VLAN device on top of
157+
your bridge port.
158+
159+
If you would like that your flowtable defines a fastpath between your bridge
160+
ports and your IP forwarding path, you have to add your bridge ports (as
161+
represented by the real netdevice) to your flowtable definition.
162+
163+
Counters
164+
--------
165+
166+
The flowtable can synchronize packet and byte counters with the existing
167+
connection tracking entry by specifying the counter statement in your flowtable
168+
definition, e.g.
169+
170+
::
171+
table inet x {
172+
flowtable f {
173+
hook ingress priority 0; devices = { eth0, eth1 };
174+
counter
175+
}
176+
...
177+
}
178+
179+
Counter support is available since Linux kernel 5.7.
180+
181+
Hardware offload
182+
----------------
183+
184+
If your network device provides hardware offload support, you can turn it on by
185+
means of the 'offload' flag in your flowtable definition, e.g.
186+
187+
::
188+
table inet x {
189+
flowtable f {
190+
hook ingress priority 0; devices = { eth0, eth1 };
191+
flags offload;
192+
}
193+
...
194+
}
195+
196+
There is a workqueue that adds the flows to the hardware. Note that a few
197+
packets might still run over the flowtable software path until the workqueue has
198+
a chance to offload the flow to the network device.
199+
200+
You can identify hardware offloaded flows through the [HW_OFFLOAD] tag when
201+
listing your connection tracking table. Please, note that the [OFFLOAD] tag
202+
refers to the software offload mode, so there is a distinction between [OFFLOAD]
203+
which refers to the software flowtable fastpath and [HW_OFFLOAD] which refers
204+
to the hardware offload datapath being used by the flow.
205+
206+
The flowtable hardware offload infrastructure also supports for the DSA
207+
(Distributed Switch Architecture).
208+
209+
Limitations
210+
-----------
211+
212+
The flowtable behaves like a cache. The flowtable entries might get stale if
213+
either the destination MAC address or the egress netdevice that is used for
214+
transmission changes.
215+
216+
This might be a problem if:
217+
218+
- You run the flowtable in software mode and you combine bridge and IP
219+
forwarding in your setup.
220+
- Hardware offload is enabled.
221+
106222
More reading
107223
------------
108224

drivers/net/ethernet/mediatek/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,5 @@
44
#
55

66
obj-$(CONFIG_NET_MEDIATEK_SOC) += mtk_eth.o
7-
mtk_eth-y := mtk_eth_soc.o mtk_sgmii.o mtk_eth_path.o
7+
mtk_eth-y := mtk_eth_soc.o mtk_sgmii.o mtk_eth_path.o mtk_ppe.o mtk_ppe_debugfs.o mtk_ppe_offload.o
88
obj-$(CONFIG_NET_MEDIATEK_STAR_EMAC) += mtk_star_emac.o

drivers/net/ethernet/mediatek/mtk_eth_soc.c

Lines changed: 33 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
#include <linux/interrupt.h>
2020
#include <linux/pinctrl/devinfo.h>
2121
#include <linux/phylink.h>
22+
#include <net/dsa.h>
2223

2324
#include "mtk_eth_soc.h"
2425

@@ -1264,13 +1265,12 @@ static int mtk_poll_rx(struct napi_struct *napi, int budget,
12641265
break;
12651266

12661267
/* find out which mac the packet come from. values start at 1 */
1267-
if (MTK_HAS_CAPS(eth->soc->caps, MTK_SOC_MT7628)) {
1268+
if (MTK_HAS_CAPS(eth->soc->caps, MTK_SOC_MT7628) ||
1269+
(trxd.rxd4 & RX_DMA_SPECIAL_TAG))
12681270
mac = 0;
1269-
} else {
1270-
mac = (trxd.rxd4 >> RX_DMA_FPORT_SHIFT) &
1271-
RX_DMA_FPORT_MASK;
1272-
mac--;
1273-
}
1271+
else
1272+
mac = ((trxd.rxd4 >> RX_DMA_FPORT_SHIFT) &
1273+
RX_DMA_FPORT_MASK) - 1;
12741274

12751275
if (unlikely(mac < 0 || mac >= MTK_MAC_COUNT ||
12761276
!eth->netdev[mac]))
@@ -2233,6 +2233,9 @@ static void mtk_gdm_config(struct mtk_eth *eth, u32 config)
22332233

22342234
val |= config;
22352235

2236+
if (!i && eth->netdev[0] && netdev_uses_dsa(eth->netdev[0]))
2237+
val |= MTK_GDMA_SPECIAL_TAG;
2238+
22362239
mtk_w32(eth, val, MTK_GDMA_FWD_CFG(i));
22372240
}
22382241
/* Reset and enable PSE */
@@ -2255,12 +2258,17 @@ static int mtk_open(struct net_device *dev)
22552258

22562259
/* we run 2 netdevs on the same dma ring so we only bring it up once */
22572260
if (!refcount_read(&eth->dma_refcnt)) {
2258-
int err = mtk_start_dma(eth);
2261+
u32 gdm_config = MTK_GDMA_TO_PDMA;
2262+
int err;
22592263

2264+
err = mtk_start_dma(eth);
22602265
if (err)
22612266
return err;
22622267

2263-
mtk_gdm_config(eth, MTK_GDMA_TO_PDMA);
2268+
if (eth->soc->offload_version && mtk_ppe_start(&eth->ppe) == 0)
2269+
gdm_config = MTK_GDMA_TO_PPE;
2270+
2271+
mtk_gdm_config(eth, gdm_config);
22642272

22652273
napi_enable(&eth->tx_napi);
22662274
napi_enable(&eth->rx_napi);
@@ -2327,6 +2335,9 @@ static int mtk_stop(struct net_device *dev)
23272335

23282336
mtk_dma_free(eth);
23292337

2338+
if (eth->soc->offload_version)
2339+
mtk_ppe_stop(&eth->ppe);
2340+
23302341
return 0;
23312342
}
23322343

@@ -2832,6 +2843,7 @@ static const struct net_device_ops mtk_netdev_ops = {
28322843
#ifdef CONFIG_NET_POLL_CONTROLLER
28332844
.ndo_poll_controller = mtk_poll_controller,
28342845
#endif
2846+
.ndo_setup_tc = mtk_eth_setup_tc,
28352847
};
28362848

28372849
static int mtk_add_mac(struct mtk_eth *eth, struct device_node *np)
@@ -3088,6 +3100,17 @@ static int mtk_probe(struct platform_device *pdev)
30883100
goto err_free_dev;
30893101
}
30903102

3103+
if (eth->soc->offload_version) {
3104+
err = mtk_ppe_init(&eth->ppe, eth->dev,
3105+
eth->base + MTK_ETH_PPE_BASE, 2);
3106+
if (err)
3107+
goto err_free_dev;
3108+
3109+
err = mtk_eth_offload_init(eth);
3110+
if (err)
3111+
goto err_free_dev;
3112+
}
3113+
30913114
for (i = 0; i < MTK_MAX_DEVS; i++) {
30923115
if (!eth->netdev[i])
30933116
continue;
@@ -3162,6 +3185,7 @@ static const struct mtk_soc_data mt7621_data = {
31623185
.hw_features = MTK_HW_FEATURES,
31633186
.required_clks = MT7621_CLKS_BITMAP,
31643187
.required_pctl = false,
3188+
.offload_version = 2,
31653189
};
31663190

31673191
static const struct mtk_soc_data mt7622_data = {
@@ -3170,6 +3194,7 @@ static const struct mtk_soc_data mt7622_data = {
31703194
.hw_features = MTK_HW_FEATURES,
31713195
.required_clks = MT7622_CLKS_BITMAP,
31723196
.required_pctl = false,
3197+
.offload_version = 2,
31733198
};
31743199

31753200
static const struct mtk_soc_data mt7623_data = {

0 commit comments

Comments
 (0)