Skip to content

Commit 5dca274

Browse files
committed
doc: fix and simplify network-for-clones.md
Address the discussion in #4720 - Reflect that the doc is just an example. - Fix the doc so that it works. - Split the ingress setup as that is a specific requirement that may not always be required, and complicates the implementation. - Also simplify the test framework helpers since on revisiting some of the iptables rules were not needed. Signed-off-by: Pablo Barbáchano <[email protected]>
1 parent a2daa4d commit 5dca274

File tree

2 files changed

+96
-93
lines changed

2 files changed

+96
-93
lines changed
Lines changed: 64 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
11
# Network Connectivity for Clones
22

3-
This document presents the strategy to ensure continued network connectivity for
4-
multiple clones created from a single Firecracker microVM snapshot. This
5-
document also provides an overview of the scalability benchmarks we performed.
3+
This document presents a strategy to ensure continued network connectivity for
4+
multiple clones created from a single Firecracker microVM snapshot.
5+
6+
> \[!CAUTION\] This should be considered as just an example to get you started,
7+
> and we don't claim this is a performant or secure setup.
68
79
## Setup
810

@@ -13,14 +15,7 @@ names, and each guest will be resumed with the same network configuration, most
1315
importantly with the same IP address(es). To work around the former, each clone
1416
should be started within a separate network namespace (we can have multiple TAP
1517
interfaces with the same name, as long as they reside in distinct network
16-
namespaces). The latter can be mitigated by leveraging `iptables` `SNAT` and
17-
`DNAT` support. We choose a clone address (**CA**) for each clone, which is the
18-
new address that’s going to represent the guest, and make it so all packets
19-
leaving the VM have their source address rewritten to CA, and all incoming
20-
packets with the destination address equal to CA have it rewritten to the IP
21-
address configured inside the guest (which remains the same for all clones).
22-
Each individual clone continues to believe it’s using the original address, but
23-
outside the VM packets are assigned a different one for every clone.
18+
namespaces). The latter can be mitigated by leveraging `iptables` `NAT` support.
2419

2520
Let’s have a more detailed look at this approach. We assume each VM has a single
2621
network interface attached. If multiple interfaces with full connectivity are
@@ -37,8 +32,8 @@ corresponding virtio device (referred to as the guest IP address, for example
3732
Attempting to restore multiple clones from the same snapshot faces the problem
3833
of every single one of them attempting to use a TAP device with the original
3934
name, which is not possible by default. Therefore, we need to start each clone
40-
in a separate network namespace. This is already possible using the netns jailer
41-
parameter, described in the [documentation](../jailer.md). The specified
35+
in a separate network namespace. This is already possible using the `--netns`
36+
jailer parameter, described in the [documentation](../jailer.md). The specified
4237
namespace must already exist, so we have to create it first using
4338

4439
```bash
@@ -94,9 +89,7 @@ use the following commands (for namespace `fc0`):
9489

9590
```bash
9691
# create the veth pair inside the namespace
97-
sudo ip netns exec fc0 ip link add veth1 type veth peer name veth0
98-
# move veth1 to the global host namespace
99-
sudo ip netns exec fc0 ip link set veth1 netns 1
92+
sudo ip link add name veth1 type veth peer name veth0 netns fc0
10093

10194
sudo ip netns exec fc0 ip addr add 10.0.0.2/24 dev veth0
10295
sudo ip netns exec fc0 ip link set dev veth0 up
@@ -108,31 +101,32 @@ sudo ip link set dev veth1 up
108101
sudo ip netns exec fc0 ip route add default via 10.0.0.1
109102
```
110103

111-
### `iptables` rules for end-to-end connectivity
104+
### `iptables` rules for VM egress connectivity
112105

113106
The last step involves adding the `iptables` rules which change the
114107
source/destination IP address of packets on the fly (thus allowing all clones to
115-
have the same internal IP). We need to choose a clone address, which is unique
116-
on the host for each VM. In the demo, we use
117-
`192.168.<idx / 30>.<(idx % 30) * 8 + 3>` (which is `192.168.0.3` for
118-
`clone 0`):
108+
have the same internal IP).
109+
110+
```sh
111+
# Find the host egress device
112+
UPSTREAM=$(ip -j route list default |jq -r '.[0].dev')
113+
# anything coming from the VMs, we NAT the address
114+
iptables -t nat -A POSTROUTING -s 10.0.0.0/30 -o $UPSTREAM -j MASQUERADE
115+
# forward packets by default
116+
iptables -P FORWARD ACCEPT
117+
ip netns exec fc0 ip route add default via 10.0.0.1
118+
ip netns exec fc0 iptables -P FORWARD ACCEPT
119+
```
120+
121+
You may also want to configure the guest with a default route and a DNS
122+
nameserver:
119123

120124
```bash
121-
# for packets that leave the namespace and have the source IP address of the
122-
# original guest, rewrite the source address to clone address 192.168.0.3
123-
sudo ip netns exec fc0 iptables -t nat -A POSTROUTING -o veth0 \
124-
-s 192.168.241.2 -j SNAT --to 192.168.0.3
125-
126-
# do the reverse operation; rewrites the destination address of packets
127-
# heading towards the clone address to 192.168.241.2
128-
sudo ip netns exec fc0 iptables -t nat -A PREROUTING -i veth0 \
129-
-d 192.168.0.3 -j DNAT --to 192.168.241.2
130-
131-
# (adds a route on the host for the clone address)
132-
sudo ip route add 192.168.0.3 via 10.0.0.2
125+
ip route default via 10.0.0.1
126+
echo nameserver 8.8.8.8 >/etc/resolv.conf
133127
```
134128

135-
**Full connectivity to/from the clone should be present at this point.**
129+
**Connectivity from the clone should be present at this point.**
136130

137131
To make sure the guest also adjusts to the new environment, you can explicitly
138132
clear the ARP/neighbour table in the guest:
@@ -146,47 +140,39 @@ Otherwise, packets originating from the guest might be using old Link Layer
146140
Address for up to arp cache timeout seconds. After said timeout period,
147141
connectivity will work both ways even without an explicit flush.
148142

149-
## Scalability evaluation
150-
151-
We ran synthetic tests to determine the impact of the addtional iptables rules
152-
and namespaces on network performance. We compare the case where each VM runs as
153-
regular Firecracker (gets assigned a TAP interface and a unique IP address in
154-
the global namespace) versus the setup with a separate network namespace for
155-
each VM (together with the veth pair and additional rules). We refer to the
156-
former as the basic case, while the latter is the ns case. We measure latency
157-
with the `ping` command and throughput with `iperf`.
158-
159-
The experiments, ran on an Amazon AWS `m5d.metal` EC2 instace, go as follows:
160-
161-
- Set up 3000 network resource slots (different TAP interfaces for the basic
162-
case, and namespaces + everything else for ns). This is mainly to account for
163-
any difference the setup itself might make, even if there are not as many
164-
active endpoints at any given time.
165-
- Start 1000 Firecracker VMs, and pick `N < 1000` as the number of active VMs
166-
that are going to generate network traffic. For ping experiments, we ping each
167-
active VM from the host every 500ms for 30 seconds. For `iperf` experiments,
168-
we measure the average bandwidth of connections from the host to every active
169-
VM lasting 40 seconds. There is one separate client process per VM.
170-
- When `N = 100`, in the basic case we get average latencies of
171-
`0.315 ms (0.160/0.430 min/max)` for `ping`, and an average throughput of
172-
`2.25 Gbps (1.62/3.21 min/max)` per VM for `iperf`. In the ns case, the ping
173-
results **are bumped higher by around 10-20 us**, while the `iperf` results
174-
are virtually the same on average, with a higher minimum (1.73 Gbps) and a
175-
lower maximum (3 Gbps).
176-
- When `N = 1000`, we start facing desynchronizations caused by difficulties in
177-
starting (and thus finishing) the client processes all at the same time, which
178-
creates a wider distribution of results. In the basic case, the average
179-
latency for ping experiments has gone down to 0.305 ms, the minimum decreased
180-
to 0.155 ms, but the maximum increased to 0.640 ms. The average `iperf` per VM
181-
throughput is around `440 Mbps (110/3936 min/max)`. In the ns case, average
182-
`ping` latency is now `0.318 ms (0.170/0.650 min/max)`. For `iperf`, the
183-
average throughput is very close to basic at `~430 Mbps`, while the minimum
184-
and maximum values are lower at `85/3803 Mbps`.
185-
186-
**The above measurements give a significant degree of confidence in the
187-
scalability of the solution** (subject to repeating for different values of the
188-
experimental parameters, if necessary). The increase in latency is almost
189-
negligible considering usual end-to-end delays. The lower minimum throughput
190-
from the iperf measurements might be significant, but only if that magnitude of
191-
concurrent, data-intensive transfers is likely. Moreover, the basic measurements
192-
are close to an absolute upper bound.
143+
# Ingress connectivity
144+
145+
The above setup only provides egress connectivity. If in addition we also want
146+
to add ingress (in other words, make the guest VM routable outside the network
147+
namespace), then we need to choose a "clone address" that will represent this VM
148+
uniquely. For our example we can use IPs from `172.16.0.0/12`, for example
149+
`172.16.0.1`.
150+
151+
Then we can rewrite destination address heading towards the "clone address" to
152+
the guest IP.
153+
154+
```bash
155+
ip netns exec fc0 iptables -t nat -A PREROUTING -i veth0 \
156+
-d 172.16.0.1 -j DNAT --to 192.168.241.2
157+
```
158+
159+
And add a route on the host so we can access the guest VM from the host network
160+
namespace:
161+
162+
```bash
163+
ip route add 172.16.0.1 via 10.0.0.2
164+
```
165+
166+
To confirm that ingress connectivity works, try
167+
168+
```bash
169+
ping 172.16.0.1
170+
# or
171+
172+
```
173+
174+
# See also
175+
176+
For an improved setup with full ingress and egress connectivity to the
177+
individual VMs, see
178+
[this discussion](https://github.com/firecracker-microvm/firecracker/discussions/4720).

tests/framework/microvm_helpers.py

Lines changed: 32 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,9 @@ class MicrovmHelpers:
5858
# private block
5959
_supernet = ipaddress.IPv4Network("10.255.0.0/16")
6060
_subnets_gen = _supernet.subnets(new_prefix=30)
61+
# Addresses that can be used outside the netns. Could be public IPv4 blocks
62+
_ingress_net = ipaddress.IPv4Network("172.16.0.0/12")
63+
_ingress_gen = _ingress_net.hosts()
6164

6265
def __init__(self, vm):
6366
self.vm = vm
@@ -140,15 +143,15 @@ def how_to_docker(self):
140143
"""How to get into this container from outside"""
141144
return f"docker exec -it {DOCKER.id}"
142145

143-
def enable_ip_forwarding(self, iface="eth0"):
146+
def enable_ip_forwarding(self, iface="eth0", ingress_ipv4=None):
144147
"""Enables IP forwarding in the guest"""
145148
i = MicrovmHelpers.shared_subnet_ctr
146149
MicrovmHelpers.shared_subnet_ctr += 1
147150
netns = self.vm.netns.id
148151
veth_host = f"vethhost{i}"
149-
veth_vpn = f"vethvpn{i}"
152+
veth_guest = f"vethguest{i}"
150153
veth_net = next(self._subnets_gen)
151-
veth_host_ip, veth_vpn_ip = list(veth_net.hosts())
154+
veth_host_ip, veth_guest_ip = list(veth_net.hosts())
152155
iface = self.vm.iface[iface]["iface"]
153156
tap_host_ip = iface.host_ip
154157
tap_dev = iface.tap_name
@@ -170,30 +173,26 @@ def run_in_netns(cmd):
170173
return run(f"ip netns exec {netns} " + cmd)
171174

172175
# outside netns
173-
# iptables -L -v -n
176+
# iptables -L -v -n --line-numbers
174177
run(
175-
f"ip link add name {veth_host} type veth peer name {veth_vpn} netns {netns}"
178+
f"ip link add name {veth_host} type veth peer name {veth_guest} netns {netns}"
176179
)
177180
run(f"ip addr add {veth_host_ip}/{veth_net.prefixlen} dev {veth_host}")
178-
run_in_netns(f"ip addr add {veth_vpn_ip}/{veth_net.prefixlen} dev {veth_vpn}")
181+
run_in_netns(
182+
f"ip addr add {veth_guest_ip}/{veth_net.prefixlen} dev {veth_guest}"
183+
)
179184
run(f"ip link set {veth_host} up")
180-
run_in_netns(f"ip link set {veth_vpn} up")
185+
run_in_netns(f"ip link set {veth_guest} up")
181186

182-
run("iptables -P FORWARD DROP")
187+
run("iptables -P FORWARD ACCEPT")
183188
# iptables -L FORWARD
184189
# iptables -t nat -L
185190
run(
186191
f"iptables -t nat -A POSTROUTING -s {veth_net} -o {upstream_dev} -j MASQUERADE"
187192
)
188-
run(f"iptables -A FORWARD -i {upstream_dev} -o {veth_host} -j ACCEPT")
189-
run(f"iptables -A FORWARD -i {veth_host} -o {upstream_dev} -j ACCEPT")
190-
191-
# in the netns
192193
run_in_netns(f"ip route add default via {veth_host_ip}")
193-
run_in_netns(f"iptables -A FORWARD -i {tap_dev} -o {veth_vpn} -j ACCEPT")
194-
run_in_netns(f"iptables -A FORWARD -i {veth_vpn} -o {tap_dev} -j ACCEPT")
195194
run_in_netns(
196-
f"iptables -t nat -A POSTROUTING -s {tap_net} -o {veth_vpn} -j MASQUERADE"
195+
f"iptables -t nat -A POSTROUTING -s {tap_net} -o {veth_guest} -j MASQUERADE"
197196
)
198197

199198
# Configure the guest
@@ -207,3 +206,21 @@ def run_in_netns(cmd):
207206
.strip()
208207
)
209208
self.vm.ssh.run(f"echo nameserver {nameserver} >/etc/resolv.conf")
209+
210+
# only configure ingress if we get an IP
211+
if not ingress_ipv4:
212+
return
213+
214+
if not isinstance(ingress_ipv4, ipaddress.IPv4Address):
215+
ingress_ipv4 = next(self._ingress_gen)
216+
217+
guest_ip = iface.guest_ip
218+
219+
# packets heading towards the clone address are rewritten to the guest ip
220+
run_in_netns(
221+
f"iptables -t nat -A PREROUTING -i {veth_guest} -d {ingress_ipv4} -j DNAT --to {guest_ip}"
222+
)
223+
224+
# add a route on the host for the clone address
225+
run(f"ip route add {ingress_ipv4} via {veth_guest_ip}")
226+

0 commit comments

Comments
 (0)