Skip to content

Commit 7cc153f

Browse files
authored
Merge feature/host-network-device-ordering into master (#6725)
No conflict. Add two commits to 1. update datamodel_lifecycle 2. Make CI shellcheck happy, see #6724
2 parents 0d03639 + bf125a2 commit 7cc153f

29 files changed

+2916
-227
lines changed
Lines changed: 342 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,342 @@
1+
---
2+
title: Host Network Device Ordering on Networkd
3+
description: How does the host network device ordering work on networkd.
4+
---
5+
6+
Purpose
7+
-------
8+
9+
One of the Toolstack's functions is to maintain a pool of hosts. A pool can be
10+
constructed by joining a host into an existing pool. One challenge in this
11+
process is determining which pool-wide network a network device on the joining
12+
host should connect to.
13+
14+
At first glance, this could be resolved by specifying a mapping between an
15+
individual network device and a pool-wide network. However, this approach
16+
would be burdensome for administrators when managing many hosts. It would be
17+
more efficient if the Toolstack could determine this automatically.
18+
19+
To achieve this, the Toolstack components on two hosts need to independently
20+
work out consistent identifications for the host network devices and connect
21+
the network devices with the same identification to the same pool-wide network.
22+
The identifications on a host can be considered as an order, with each network
23+
device assigned a unique position in the order as its identification. Network
24+
devices with the same position will connect to the same network.
25+
26+
27+
The assumption
28+
--------------
29+
30+
Why can the Toolstack components on two hosts independently work out an expected
31+
order without any communication? This is possible only under the assumption that
32+
the hosts have identical hardware, firmware, software, and the way
33+
network devices are plugged into them. For example, an administrator will always
34+
plug the network devices into the same PCI slot position on multiple hosts if
35+
they want these network devices to connect to the same network.
36+
37+
The ordering is considered consistent if the positions of such network devices
38+
(plugged into the same PCI slot position) in the generated orders are the same.
39+
40+
41+
The biosdevname
42+
---------------
43+
Particularly, when the assumption above holds, a consistent initial order can be
44+
worked out on multiple hosts independently with the help of `biosdevname`. The
45+
"all_ethN" policy of the `biosdevname` utility can generate a device order based
46+
on whether the device is embedded or not, PCI cards in ascending slot order, and
47+
ports in ascending PCI bus/device/function order breadth-first. Since the hosts
48+
are identical, the orders generated by the `biosdevname` are consistent across
49+
the hosts.
50+
51+
An example of `biosdevname`'s output is as the following. The initial order can
52+
be derived from the `BIOS device` field.
53+
54+
```
55+
# biosdevname --policy all_ethN -d -x
56+
BIOS device: eth0
57+
Kernel name: enp5s0
58+
Permanent MAC: 00:02:C9:ED:FD:F0
59+
Assigned MAC : 00:02:C9:ED:FD:F0
60+
Bus Info: 0000:05:00.0
61+
...
62+
63+
BIOS device: eth1
64+
Kernel name: enp5s1
65+
Permanent MAC: 00:02:C9:ED:FD:F1
66+
Assigned MAC : 00:02:C9:ED:FD:F1
67+
Bus Info: 0000:05:01.0
68+
...
69+
```
70+
71+
However, the `BIOS device` of a particular network device may change with the
72+
addition or removal of devices. For example:
73+
74+
```
75+
# biosdevname --policy all_ethN -d -x
76+
BIOS device: eth0
77+
Kernel name: enp4s0
78+
Permanent MAC: EC:F4:BB:E6:D7:BB
79+
Assigned MAC : EC:F4:BB:E6:D7:BB
80+
Bus Info: 0000:04:00.0
81+
...
82+
83+
BIOS device: eth1
84+
Kernel name: enp5s0
85+
Permanent MAC: 00:02:C9:ED:FD:F0
86+
Assigned MAC : 00:02:C9:ED:FD:F0
87+
Bus Info: 0000:05:00.0
88+
...
89+
90+
BIOS device: eth2
91+
Kernel name: enp5s1
92+
Permanent MAC: 00:02:C9:ED:FD:F1
93+
Assigned MAC : 00:02:C9:ED:FD:F1
94+
Bus Info: 0000:05:01.0
95+
...
96+
```
97+
98+
Therefore, the order derived from these values is used solely for determining
99+
the initial order and the order of newly added devices.
100+
101+
Principles
102+
-----------
103+
* Initially, the order is aligned with PCI slots. This is to make the connection
104+
between cabling and order predictable: The network devices in identical PCI
105+
slots have the same position. The rationale is that PCI slots are more
106+
predictable than MAC addresses and correspond to physical locations.
107+
108+
* Once a previous order has been established, the ordering should be maintained
109+
as stable as possible despite changes to MAC addresses or PCI addresses. The
110+
rationale is that the assumption is less likely to hold as long as the hosts are
111+
experiencing updates and maintenance. Therefore, maintaining the stable order is
112+
the best choice for automatic ordering.
113+
114+
Notation
115+
--------
116+
117+
```
118+
mac:pci:position
119+
!mac:pci:position
120+
```
121+
122+
A network device is characterised by
123+
124+
* MAC address, which is unique.
125+
* PCI slot, which is not unique and multiple network devices can share a PCI
126+
slot. PCI addresses correspond to hardware PCI slots and thus are physically
127+
observable.
128+
* position, the position assigned to this network device by xcp-networkd. At any
129+
given time, no position is assigned twice but the sequence of positions may have
130+
holes.
131+
* The `!mac:pci:position` notation indicates that this postion was previously
132+
used but currently is free because the device it was assgined was removed.
133+
134+
On a Linux system, MAC and PCI addresses have specific formats. However, for
135+
simplicity, symbolic names are used here: MAC addresses use lowercase letters,
136+
PCI addresses use uppercase letters, and positions use numbers.
137+
138+
Scenarios
139+
---------
140+
141+
### The initial order
142+
143+
As mentioned above, the `biosdevname` can be used to generate consistent orders
144+
for the network devices on multiple hosts.
145+
146+
```
147+
current input: a:A b:D c:C
148+
initial order: a:A:0 c:C:1 b:D:2
149+
```
150+
151+
This only works if the assumption of identical hardware, firmware, software, and
152+
network device placement holds. And it is considered that the assumption will
153+
hold for the majority of the use cases.
154+
155+
Otherwise, the order can be generated from a user's configuration. The user can
156+
specify the order explicilty for individual hosts. However, administrators would
157+
prefer to avoid this as much as possible when managing many hosts.
158+
159+
```
160+
user spec: a::0 c::1 b::2
161+
current input: a:A b:D c:C
162+
initial order: a:A:0 c:C:1 b:D:2
163+
```
164+
165+
### Keep the order as stable as possible
166+
167+
Once an initial order is created on an individual host, it should be kept as
168+
stable as possible across host boot-ups and at runtime. For example, unless
169+
there are hardware changes, the position of a network device in the initial
170+
order should remain the same regardless of how many times the host is rebooted.
171+
172+
To achieve this, the initial order should be saved persistently on the host's
173+
local storage so it can be referenced in subsequent orderings. When performing
174+
another ordering after the initial order has been saved, the position of a
175+
currently unordered network device should be determined by finding its position
176+
in the last saved order. The MAC address of the network device is a reliable
177+
attribute for this purpose, as it is considered unique for each network device
178+
globally.
179+
180+
Therefore, the network devices in the saved order should have their MAC
181+
addresses saved together, effectively mapping each position to a MAC address.
182+
When performing an ordering, the stable position can be found by searching the
183+
last saved order using the MAC address.
184+
185+
```
186+
last order: a:A:0 c:C:1 b:D:2
187+
current input: a:A b:D c:C
188+
new order: a:A:0 c:C:1 b:D:2
189+
```
190+
191+
Name labels of the network devices are not considered reliable enough to
192+
identify particular devices. For example, if the name labels are determined by
193+
the PCI address via systemd, and a firmware update changes the PCI addresses of
194+
the network devices, the name labels will also change.
195+
196+
The PCI addresses are not considered reliable as well. They may change due to
197+
the firmeware update/setting changes or even plugging/unpluggig other devices.
198+
199+
```
200+
last order: a:A:0 c:C:1 b:D:2
201+
current input: a:A b:B c:E
202+
new order: a:A:0 c:E:1 b:B:2
203+
```
204+
205+
### Replacement
206+
207+
However, what happens when the MAC address of an unordered network device cannot
208+
be found in the last saved order? There are two possible scenarios:
209+
210+
1. It's a newly added network device since the last ordering.
211+
2. It's a new device that replaces an existing network device.
212+
213+
Replacement is a supported scenario, as an administrator might replace a broken
214+
network device with a new one.
215+
216+
This can be recognized by comparing the PCI address where the network device is
217+
located. Therefore, the PCI address of each network device should also be saved
218+
in the order. In this case, searching the PCI address in the order results in
219+
one of the following:
220+
221+
1. Not found: This means the PCI address was not occupied during the last
222+
ordering, indicating a newly added network device.
223+
2. Found with a MAC address, but another device with this MAC address is still
224+
present in the system: This suggests that the PCI address of an existing
225+
network device (with the same MAC address) has changed since the last ordering.
226+
This may be caused by either a device move or others like a firmware update. In
227+
this case, the current unordered network device is considered newly added.
228+
229+
```
230+
last order: a:A:0 c:C:1 b:D:2
231+
current input: a:A b:B c:C d:D
232+
new order: a:A:0 c:C:1 b:B:2 d:D:3
233+
```
234+
235+
3. Found with a MAC address, and no current devices have this MAC address: This
236+
indicates that a new network device has replaced the old one in the same PCI
237+
slot.
238+
The replacing network device should be assigned the same position as the
239+
replaced one.
240+
241+
```
242+
last order: a:A:0 c:C:1 b:D:2
243+
current input: a:A c:C d:D
244+
new order: a:A:0 c:C:1 d:D:2
245+
```
246+
247+
### Removed devices
248+
249+
A network device can be removed or unplugged since the last ordering. Its
250+
position, MAC address, and PCI address are saved for future reference, and its
251+
position will be reserved. This means there may be a gap in the order: a
252+
position that was previously assigned to a network device is now vacant because
253+
the device has been removed.
254+
255+
```
256+
last order: a:A:0 c:C:1 b:D:2
257+
current input: a:A b:D
258+
new order: a:A:0 !c:C:1 d:D:2
259+
```
260+
261+
### Newly added devices
262+
263+
As long as `the assumption` holds, newly added devices since the last ordering
264+
can be assigned positions consistently across multiple hosts. Newly added
265+
devices will not be assigned the positions reserved for removed devices.
266+
267+
```
268+
last order: a:A:0 !c:C:1 d:D:2
269+
current input: a:A d:D e:E
270+
new order: a:A:0 !c:C:1 d:D:2 e:E:3
271+
```
272+
273+
### Removed and then added back
274+
275+
It is a supported scenario for a removed device to be plugged back in,
276+
regardless of whether it is in the same PCI slot or not. This can be recognized
277+
by searching for the device in the saved removed devices using its MAC address.
278+
The reserved position will be reassigned to the device when it is added back.
279+
280+
```
281+
last order: a:A:0 !c:C:1 d:D:2
282+
current input: a:A c:F d:D e:E
283+
new order: a:A:0 c:F:1 d:D:2 e:E:3
284+
```
285+
286+
### Multinic functions
287+
288+
The multinic function is a special kind of network device. When this type of
289+
physical device is plugged into a PCI slot, multiple network devices are
290+
reported at a single PCI address. Additionally, the number of reported network
291+
devices may change due to driver updates.
292+
293+
```
294+
current input: a:A b:A c:A d:A
295+
initial order: a:A:0 b:A:1 c:A:2 d:A:3
296+
```
297+
298+
As long as `the assumption` holds, the initial order of these devices can be
299+
generated automatically and kept stable by using MAC addresses to identify
300+
individual devices. However, `biosdevname` cannot reliably generate an order for
301+
all devices reported at one PCI address. For devices located at the same PCI
302+
address, their MAC addresses are used to generate the initial order.
303+
304+
```
305+
last order: a:A:0 b:A:1 c:A:2 d:A:3 m:M:4 n:N:5
306+
current input: a:A b:A c:A d:A e:A f:A m:M n:N
307+
new order: a:A:0 b:A:1 c:A:2 d:A:3 m:M:4 n:N:5 e:A:6 f:A:7
308+
```
309+
310+
For example, suppose `biosdevname` generates an order for a multinic function
311+
and other non-multinic devices. Within this order, the N devices of the
312+
multinic function with MAC addresses mac[1], ..., mac[N] are assigned positions
313+
pos[1], ..., pos[N] correspondingly. `biosdevname` cannot ensure that the device
314+
with mac[1] is always assigned position pos[1]. Instead, it ensures that the
315+
entire set of positions pos[1], ..., pos[N] remains stable for the devices of
316+
the multinic function. Therefore, to ensure the order follows the MAC address
317+
order, the devices of the multinic function need to be sorted by their MAC
318+
addresses within the set of positions.
319+
320+
```
321+
last order: a:A:0 b:A:1 c:A:2 d:A:3 m:M:4
322+
current input: e:A f:A g:A h:A m:M
323+
new order: e:A:0 f:A:1 g:A:2 h:A:3 m:M:4
324+
```
325+
326+
Rare cases that can not be handled automatically
327+
------------------------------------------------
328+
329+
In summary, to keep the order stable, the auto-generated order needs to be saved
330+
for the next ordering. When performing an automatic ordering for the current
331+
network devices, either the MAC address or the PCI address is used to recognize
332+
the device that was assigned the same position in the last ordering. If neither
333+
the MAC address nor the PCI address can be used to find a position from the last
334+
ordering, the device is considered newly added and is assigned a new position.
335+
336+
However, following this sorting logic, the ordering result may not always be as
337+
expected. In practice, this can be caused by various rare cases, such as
338+
switching an existing network device to connect to another network, performing
339+
firmware updates, changing firmware settings, or plugging/unplugging network
340+
devices. It is not worth complicating the entire function for these rare cases.
341+
Instead, the initial user's configuration can be used to handle these rare
342+
scenarios.

ocaml/idl/datamodel_lifecycle.ml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -226,9 +226,9 @@ let prototyped_of_message = function
226226
| "VTPM", "create" ->
227227
Some "22.26.0"
228228
| "host", "update_firewalld_service_status" ->
229-
Some "25.33.0-next"
229+
Some "25.34.0"
230230
| "host", "get_tracked_user_agents" ->
231-
Some "25.33.0-next"
231+
Some "25.34.0"
232232
| "host", "set_ssh_auto_mode" ->
233233
Some "25.27.0"
234234
| "host", "set_console_idle_timeout" ->

ocaml/networkd/bin/network_monitor_thread.ml

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,7 @@ let bonds_status : (string, int * int) Hashtbl.t = Hashtbl.create 10
2626

2727
let monitor_whitelist =
2828
ref
29-
[
30-
"eth"
31-
; "vif" (* This includes "tap" owing to the use of standardise_name below *)
32-
]
29+
["vif" (* This includes "tap" owing to the use of standardise_name below *)]
3330

3431
let rpc xml =
3532
let open Xmlrpc_client in
@@ -108,7 +105,10 @@ let standardise_name name =
108105
newname
109106
with _ -> name
110107

111-
let get_link_stats () =
108+
let get_link_stats dbg () =
109+
let managed_host_net_devs =
110+
Network_server.Interface.get_interface_positions dbg () |> List.map fst
111+
in
112112
let open Netlink in
113113
let s = Socket.alloc () in
114114
Socket.connect s Socket.NETLINK_ROUTE ;
@@ -119,9 +119,10 @@ let get_link_stats () =
119119
List.exists
120120
(fun s -> Astring.String.is_prefix ~affix:s name)
121121
!monitor_whitelist
122+
|| List.mem name managed_host_net_devs
122123
in
123124
let is_vlan name =
124-
Astring.String.is_prefix ~affix:"eth" name && String.contains name '.'
125+
List.mem name managed_host_net_devs && String.contains name '.'
125126
in
126127
List.map (fun link -> standardise_name (Link.get_name link)) links
127128
|> (* Only keep interfaces with prefixes on the whitelist, and exclude VLAN
@@ -226,7 +227,7 @@ let rec monitor dbg () =
226227
Network_server.Bridge.get_all_bonds dbg from_cache
227228
in
228229
let add_bonds bonds devs = List.map fst bonds @ devs in
229-
let devs = get_link_stats () |> add_bonds bonds |> get_stats bonds in
230+
let devs = get_link_stats dbg () |> add_bonds bonds |> get_stats bonds in
230231
( if List.length bonds <> Hashtbl.length bonds_status then
231232
let dead_bonds =
232233
Hashtbl.fold

0 commit comments

Comments
 (0)