|
| 1 | +--- |
| 2 | +title: Host Network Device Ordering on Networkd |
| 3 | +description: How does the host network device ordering work on networkd. |
| 4 | +--- |
| 5 | + |
| 6 | +Purpose |
| 7 | +------- |
| 8 | + |
| 9 | +One of the Toolstack's functions is to maintain a pool of hosts. A pool can be |
| 10 | +constructed by joining a host into an existing pool. One challenge in this |
| 11 | +process is determining which pool-wide network a network device on the joining |
| 12 | +host should connect to. |
| 13 | + |
| 14 | +At first glance, this could be resolved by specifying a mapping between an |
| 15 | +individual network device and a pool-wide network. However, this approach |
| 16 | +would be burdensome for administrators when managing many hosts. It would be |
| 17 | +more efficient if the Toolstack could determine this automatically. |
| 18 | + |
| 19 | +To achieve this, the Toolstack components on two hosts need to independently |
| 20 | +work out consistent identifications for the host network devices and connect |
| 21 | +the network devices with the same identification to the same pool-wide network. |
| 22 | +The identifications on a host can be considered as an order, with each network |
| 23 | +device assigned a unique position in the order as its identification. Network |
| 24 | +devices with the same position will connect to the same network. |
| 25 | + |
| 26 | + |
| 27 | +The assumption |
| 28 | +-------------- |
| 29 | + |
| 30 | +Why can the Toolstack components on two hosts independently work out an expected |
| 31 | +order without any communication? This is possible only under the assumption that |
| 32 | +the hosts have identical hardware, firmware, software, and the way |
| 33 | +network devices are plugged into them. For example, an administrator will always |
| 34 | +plug the network devices into the same PCI slot position on multiple hosts if |
| 35 | +they want these network devices to connect to the same network. |
| 36 | + |
| 37 | +The ordering is considered consistent if the positions of such network devices |
| 38 | +(plugged into the same PCI slot position) in the generated orders are the same. |
| 39 | + |
| 40 | + |
| 41 | +The biosdevname |
| 42 | +--------------- |
| 43 | +Particularly, when the assumption above holds, a consistent initial order can be |
| 44 | +worked out on multiple hosts independently with the help of `biosdevname`. The |
| 45 | +"all_ethN" policy of the `biosdevname` utility can generate a device order based |
| 46 | +on whether the device is embedded or not, PCI cards in ascending slot order, and |
| 47 | +ports in ascending PCI bus/device/function order breadth-first. Since the hosts |
| 48 | +are identical, the orders generated by the `biosdevname` are consistent across |
| 49 | +the hosts. |
| 50 | + |
| 51 | +An example of `biosdevname`'s output is as the following. The initial order can |
| 52 | +be derived from the `BIOS device` field. |
| 53 | + |
| 54 | +``` |
| 55 | +# biosdevname --policy all_ethN -d -x |
| 56 | +BIOS device: eth0 |
| 57 | +Kernel name: enp5s0 |
| 58 | +Permanent MAC: 00:02:C9:ED:FD:F0 |
| 59 | +Assigned MAC : 00:02:C9:ED:FD:F0 |
| 60 | +Bus Info: 0000:05:00.0 |
| 61 | +... |
| 62 | +
|
| 63 | +BIOS device: eth1 |
| 64 | +Kernel name: enp5s1 |
| 65 | +Permanent MAC: 00:02:C9:ED:FD:F1 |
| 66 | +Assigned MAC : 00:02:C9:ED:FD:F1 |
| 67 | +Bus Info: 0000:05:01.0 |
| 68 | +... |
| 69 | +``` |
| 70 | + |
| 71 | +However, the `BIOS device` of a particular network device may change with the |
| 72 | +addition or removal of devices. For example: |
| 73 | + |
| 74 | +``` |
| 75 | +# biosdevname --policy all_ethN -d -x |
| 76 | +BIOS device: eth0 |
| 77 | +Kernel name: enp4s0 |
| 78 | +Permanent MAC: EC:F4:BB:E6:D7:BB |
| 79 | +Assigned MAC : EC:F4:BB:E6:D7:BB |
| 80 | +Bus Info: 0000:04:00.0 |
| 81 | +... |
| 82 | +
|
| 83 | +BIOS device: eth1 |
| 84 | +Kernel name: enp5s0 |
| 85 | +Permanent MAC: 00:02:C9:ED:FD:F0 |
| 86 | +Assigned MAC : 00:02:C9:ED:FD:F0 |
| 87 | +Bus Info: 0000:05:00.0 |
| 88 | +... |
| 89 | +
|
| 90 | +BIOS device: eth2 |
| 91 | +Kernel name: enp5s1 |
| 92 | +Permanent MAC: 00:02:C9:ED:FD:F1 |
| 93 | +Assigned MAC : 00:02:C9:ED:FD:F1 |
| 94 | +Bus Info: 0000:05:01.0 |
| 95 | +... |
| 96 | +``` |
| 97 | + |
| 98 | +Therefore, the order derived from these values is used solely for determining |
| 99 | +the initial order and the order of newly added devices. |
| 100 | + |
| 101 | +Principles |
| 102 | +----------- |
| 103 | +* Initially, the order is aligned with PCI slots. This is to make the connection |
| 104 | +between cabling and order predictable: The network devices in identical PCI |
| 105 | +slots have the same position. The rationale is that PCI slots are more |
| 106 | +predictable than MAC addresses and correspond to physical locations. |
| 107 | + |
| 108 | +* Once a previous order has been established, the ordering should be maintained |
| 109 | +as stable as possible despite changes to MAC addresses or PCI addresses. The |
| 110 | +rationale is that the assumption is less likely to hold as long as the hosts are |
| 111 | +experiencing updates and maintenance. Therefore, maintaining the stable order is |
| 112 | +the best choice for automatic ordering. |
| 113 | + |
| 114 | +Notation |
| 115 | +-------- |
| 116 | + |
| 117 | +``` |
| 118 | +mac:pci:position |
| 119 | +!mac:pci:position |
| 120 | +``` |
| 121 | + |
| 122 | +A network device is characterised by |
| 123 | + |
| 124 | +* MAC address, which is unique. |
| 125 | +* PCI slot, which is not unique and multiple network devices can share a PCI |
| 126 | +slot. PCI addresses correspond to hardware PCI slots and thus are physically |
| 127 | +observable. |
| 128 | +* position, the position assigned to this network device by xcp-networkd. At any |
| 129 | +given time, no position is assigned twice but the sequence of positions may have |
| 130 | +holes. |
| 131 | +* The `!mac:pci:position` notation indicates that this postion was previously |
| 132 | +used but currently is free because the device it was assgined was removed. |
| 133 | + |
| 134 | +On a Linux system, MAC and PCI addresses have specific formats. However, for |
| 135 | +simplicity, symbolic names are used here: MAC addresses use lowercase letters, |
| 136 | +PCI addresses use uppercase letters, and positions use numbers. |
| 137 | + |
| 138 | +Scenarios |
| 139 | +--------- |
| 140 | + |
| 141 | +### The initial order |
| 142 | + |
| 143 | +As mentioned above, the `biosdevname` can be used to generate consistent orders |
| 144 | +for the network devices on multiple hosts. |
| 145 | + |
| 146 | +``` |
| 147 | +current input: a:A b:D c:C |
| 148 | +initial order: a:A:0 c:C:1 b:D:2 |
| 149 | +``` |
| 150 | + |
| 151 | +This only works if the assumption of identical hardware, firmware, software, and |
| 152 | +network device placement holds. And it is considered that the assumption will |
| 153 | +hold for the majority of the use cases. |
| 154 | + |
| 155 | +Otherwise, the order can be generated from a user's configuration. The user can |
| 156 | +specify the order explicilty for individual hosts. However, administrators would |
| 157 | +prefer to avoid this as much as possible when managing many hosts. |
| 158 | + |
| 159 | +``` |
| 160 | +user spec: a::0 c::1 b::2 |
| 161 | +current input: a:A b:D c:C |
| 162 | +initial order: a:A:0 c:C:1 b:D:2 |
| 163 | +``` |
| 164 | + |
| 165 | +### Keep the order as stable as possible |
| 166 | + |
| 167 | +Once an initial order is created on an individual host, it should be kept as |
| 168 | +stable as possible across host boot-ups and at runtime. For example, unless |
| 169 | +there are hardware changes, the position of a network device in the initial |
| 170 | +order should remain the same regardless of how many times the host is rebooted. |
| 171 | + |
| 172 | +To achieve this, the initial order should be saved persistently on the host's |
| 173 | +local storage so it can be referenced in subsequent orderings. When performing |
| 174 | +another ordering after the initial order has been saved, the position of a |
| 175 | +currently unordered network device should be determined by finding its position |
| 176 | +in the last saved order. The MAC address of the network device is a reliable |
| 177 | +attribute for this purpose, as it is considered unique for each network device |
| 178 | +globally. |
| 179 | + |
| 180 | +Therefore, the network devices in the saved order should have their MAC |
| 181 | +addresses saved together, effectively mapping each position to a MAC address. |
| 182 | +When performing an ordering, the stable position can be found by searching the |
| 183 | +last saved order using the MAC address. |
| 184 | + |
| 185 | +``` |
| 186 | +last order: a:A:0 c:C:1 b:D:2 |
| 187 | +current input: a:A b:D c:C |
| 188 | +new order: a:A:0 c:C:1 b:D:2 |
| 189 | +``` |
| 190 | + |
| 191 | +Name labels of the network devices are not considered reliable enough to |
| 192 | +identify particular devices. For example, if the name labels are determined by |
| 193 | +the PCI address via systemd, and a firmware update changes the PCI addresses of |
| 194 | +the network devices, the name labels will also change. |
| 195 | + |
| 196 | +The PCI addresses are not considered reliable as well. They may change due to |
| 197 | +the firmeware update/setting changes or even plugging/unpluggig other devices. |
| 198 | + |
| 199 | +``` |
| 200 | +last order: a:A:0 c:C:1 b:D:2 |
| 201 | +current input: a:A b:B c:E |
| 202 | +new order: a:A:0 c:E:1 b:B:2 |
| 203 | +``` |
| 204 | + |
| 205 | +### Replacement |
| 206 | + |
| 207 | +However, what happens when the MAC address of an unordered network device cannot |
| 208 | +be found in the last saved order? There are two possible scenarios: |
| 209 | + |
| 210 | +1. It's a newly added network device since the last ordering. |
| 211 | +2. It's a new device that replaces an existing network device. |
| 212 | + |
| 213 | +Replacement is a supported scenario, as an administrator might replace a broken |
| 214 | +network device with a new one. |
| 215 | + |
| 216 | +This can be recognized by comparing the PCI address where the network device is |
| 217 | +located. Therefore, the PCI address of each network device should also be saved |
| 218 | +in the order. In this case, searching the PCI address in the order results in |
| 219 | +one of the following: |
| 220 | + |
| 221 | +1. Not found: This means the PCI address was not occupied during the last |
| 222 | +ordering, indicating a newly added network device. |
| 223 | +2. Found with a MAC address, but another device with this MAC address is still |
| 224 | +present in the system: This suggests that the PCI address of an existing |
| 225 | +network device (with the same MAC address) has changed since the last ordering. |
| 226 | +This may be caused by either a device move or others like a firmware update. In |
| 227 | +this case, the current unordered network device is considered newly added. |
| 228 | + |
| 229 | +``` |
| 230 | +last order: a:A:0 c:C:1 b:D:2 |
| 231 | +current input: a:A b:B c:C d:D |
| 232 | +new order: a:A:0 c:C:1 b:B:2 d:D:3 |
| 233 | +``` |
| 234 | + |
| 235 | +3. Found with a MAC address, and no current devices have this MAC address: This |
| 236 | +indicates that a new network device has replaced the old one in the same PCI |
| 237 | +slot. |
| 238 | +The replacing network device should be assigned the same position as the |
| 239 | +replaced one. |
| 240 | + |
| 241 | +``` |
| 242 | +last order: a:A:0 c:C:1 b:D:2 |
| 243 | +current input: a:A c:C d:D |
| 244 | +new order: a:A:0 c:C:1 d:D:2 |
| 245 | +``` |
| 246 | + |
| 247 | +### Removed devices |
| 248 | + |
| 249 | +A network device can be removed or unplugged since the last ordering. Its |
| 250 | +position, MAC address, and PCI address are saved for future reference, and its |
| 251 | +position will be reserved. This means there may be a gap in the order: a |
| 252 | +position that was previously assigned to a network device is now vacant because |
| 253 | +the device has been removed. |
| 254 | + |
| 255 | +``` |
| 256 | +last order: a:A:0 c:C:1 b:D:2 |
| 257 | +current input: a:A b:D |
| 258 | +new order: a:A:0 !c:C:1 d:D:2 |
| 259 | +``` |
| 260 | + |
| 261 | +### Newly added devices |
| 262 | + |
| 263 | +As long as `the assumption` holds, newly added devices since the last ordering |
| 264 | +can be assigned positions consistently across multiple hosts. Newly added |
| 265 | +devices will not be assigned the positions reserved for removed devices. |
| 266 | + |
| 267 | +``` |
| 268 | +last order: a:A:0 !c:C:1 d:D:2 |
| 269 | +current input: a:A d:D e:E |
| 270 | +new order: a:A:0 !c:C:1 d:D:2 e:E:3 |
| 271 | +``` |
| 272 | + |
| 273 | +### Removed and then added back |
| 274 | + |
| 275 | +It is a supported scenario for a removed device to be plugged back in, |
| 276 | +regardless of whether it is in the same PCI slot or not. This can be recognized |
| 277 | +by searching for the device in the saved removed devices using its MAC address. |
| 278 | +The reserved position will be reassigned to the device when it is added back. |
| 279 | + |
| 280 | +``` |
| 281 | +last order: a:A:0 !c:C:1 d:D:2 |
| 282 | +current input: a:A c:F d:D e:E |
| 283 | +new order: a:A:0 c:F:1 d:D:2 e:E:3 |
| 284 | +``` |
| 285 | + |
| 286 | +### Multinic functions |
| 287 | + |
| 288 | +The multinic function is a special kind of network device. When this type of |
| 289 | +physical device is plugged into a PCI slot, multiple network devices are |
| 290 | +reported at a single PCI address. Additionally, the number of reported network |
| 291 | +devices may change due to driver updates. |
| 292 | + |
| 293 | +``` |
| 294 | +current input: a:A b:A c:A d:A |
| 295 | +initial order: a:A:0 b:A:1 c:A:2 d:A:3 |
| 296 | +``` |
| 297 | + |
| 298 | +As long as `the assumption` holds, the initial order of these devices can be |
| 299 | +generated automatically and kept stable by using MAC addresses to identify |
| 300 | +individual devices. However, `biosdevname` cannot reliably generate an order for |
| 301 | +all devices reported at one PCI address. For devices located at the same PCI |
| 302 | +address, their MAC addresses are used to generate the initial order. |
| 303 | + |
| 304 | +``` |
| 305 | +last order: a:A:0 b:A:1 c:A:2 d:A:3 m:M:4 n:N:5 |
| 306 | +current input: a:A b:A c:A d:A e:A f:A m:M n:N |
| 307 | +new order: a:A:0 b:A:1 c:A:2 d:A:3 m:M:4 n:N:5 e:A:6 f:A:7 |
| 308 | +``` |
| 309 | + |
| 310 | +For example, suppose `biosdevname` generates an order for a multinic function |
| 311 | +and other non-multinic devices. Within this order, the N devices of the |
| 312 | +multinic function with MAC addresses mac[1], ..., mac[N] are assigned positions |
| 313 | +pos[1], ..., pos[N] correspondingly. `biosdevname` cannot ensure that the device |
| 314 | +with mac[1] is always assigned position pos[1]. Instead, it ensures that the |
| 315 | +entire set of positions pos[1], ..., pos[N] remains stable for the devices of |
| 316 | +the multinic function. Therefore, to ensure the order follows the MAC address |
| 317 | +order, the devices of the multinic function need to be sorted by their MAC |
| 318 | +addresses within the set of positions. |
| 319 | + |
| 320 | +``` |
| 321 | +last order: a:A:0 b:A:1 c:A:2 d:A:3 m:M:4 |
| 322 | +current input: e:A f:A g:A h:A m:M |
| 323 | +new order: e:A:0 f:A:1 g:A:2 h:A:3 m:M:4 |
| 324 | +``` |
| 325 | + |
| 326 | +Rare cases that can not be handled automatically |
| 327 | +------------------------------------------------ |
| 328 | + |
| 329 | +In summary, to keep the order stable, the auto-generated order needs to be saved |
| 330 | +for the next ordering. When performing an automatic ordering for the current |
| 331 | +network devices, either the MAC address or the PCI address is used to recognize |
| 332 | +the device that was assigned the same position in the last ordering. If neither |
| 333 | +the MAC address nor the PCI address can be used to find a position from the last |
| 334 | +ordering, the device is considered newly added and is assigned a new position. |
| 335 | + |
| 336 | +However, following this sorting logic, the ordering result may not always be as |
| 337 | +expected. In practice, this can be caused by various rare cases, such as |
| 338 | +switching an existing network device to connect to another network, performing |
| 339 | +firmware updates, changing firmware settings, or plugging/unplugging network |
| 340 | +devices. It is not worth complicating the entire function for these rare cases. |
| 341 | +Instead, the initial user's configuration can be used to handle these rare |
| 342 | +scenarios. |
0 commit comments