-
Notifications
You must be signed in to change notification settings - Fork 54
Description
Seen in dogfood after creating an instance, stopping it, and trying to start it again. I don't have better repro steps right now.
The sled agent logs on this sled rotated at around the time the relevant Propolis was created. The old log shows sled agent telling Nexus that the VMM is starting, but then the log cuts off without a clear indication of why there's no Propolis process in the zone:
// gjc: this is the instance of interest
23:22:49.756Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
instance_id = 52b2d377-d02e-4040-82fd-e3a1c0ce4b5b
state = InstanceRuntimeState { run_state: Starting, sled_id: 585d0736-3f48-45a2-a9c0-a43d20796a96, propolis_id: 091d25a0-35cc-4556-8b14-a6f6d82fe93d, dst_propolis_id: None, propolis_addr: Some([fd00:1122:3344:105::26]:12400), migration_id: None, propolis_gen: Generation(2), ncpus: InstanceCpuCount(2), memory: ByteCount(8589934592), hostname: "bootme", gen: Generation(11), time_updated: 2023-06-07T23:22:49.756949218Z }
23:22:49.838Z INFO SledAgent (PortManager): Created OPTE port
port = Port { inner: PortInner { name: "opte6", _ip: 172.30.0.5, mac: MacAddr6([168, 64, 37, 248, 158, 101]), slot: 0, vni: Vni { inner: 2783369 }, gateway: Gateway { mac: MacAddr6([168, 64, 37, 255, 119, 119]), ip: 172.30.0.1 }, vnic: "vopte6" } }
23:22:49.891Z INFO SledAgent (InstanceManager): Configuring new Omicron zone: oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d
instance_id = 52b2d377-d02e-4040-82fd-e3a1c0ce4b5b
23:22:49.923Z INFO SledAgent (InstanceManager): Installing Omicron zone: oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d
instance_id = 52b2d377-d02e-4040-82fd-e3a1c0ce4b5b
23:22:52.755Z INFO SledAgent (InstanceManager): Zone booting
instance_id = 52b2d377-d02e-4040-82fd-e3a1c0ce4b5b
zone = oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d
WARNING: Failed to delete OPTE port overlay VNIC while dropping port. The VNIC will not be cleaned up properly, and the xde device itself will not be deleted. Both the VNIC and the xde device must be deleted out of band, and it will not be possible to recreate the xde device until then. Error: DeleteVnicError { name: "vopte6", err: CommandFailure(CommandFailureInfo { command: "/usr/sbin/dladm delete-vnic vopte6", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "dladm: vnic deletion failed: link busy\n" }) }
// gjc: old log cuts off here
// gjc: new log begins here
BRM42220014 # looker -f ./oxide-sled-agent\:default.log
23:57:59.026Z INFO SledAgent (dropshot (SledAgent)): accepted connection
local_addr = [fd00:1122:3344:105::1]:12345
remote_addr = [fd00:1122:3344:106::4]:44379
23:57:59.026Z INFO SledAgent (InstanceManager): Mapping virtual NIC to physical host
mapping = SetVirtualNetworkInterfaceHost { virtual_ip: 172.30.0.8, virtual_mac: MacAddr(MacAddr6([168, 64, 37, 252, 135, 30])), physical_host_ip: fd00:1122:3344:108::1, vni: Vni(7796804) }
23:57:59.026Z INFO SledAgent (dropshot (SledAgent)): request completed
local_addr = [fd00:1122:3344:105::1]:12345
method = PUT
remote_addr = [fd00:1122:3344:106::4]:44379
req_id = b1839181-d657-40ac-a432-f880158cc52b
response_code = 204
uri = /v2p/1c626997-d332-49a2-a81e-1c5cfcd3d600
The zone exists:
BRM42220014 # zoneadm list
global
oxz_switch
oxz_ntp
oxz_crucible_oxp_53c0c042-5225-42df-81cf-6dbecf8e2582
oxz_crucible_oxp_3d9dfd38-97b4-41de-b0d0-4d2034f463b7
oxz_crucible_oxp_3a555b82-8caa-4373-8ac7-6a7b1950ec0a
oxz_crucible_oxp_8a51df51-d69c-45e2-9b71-0aa7b4fd58f4
oxz_crucible_oxp_e4d786f7-5735-4426-8c82-8b4604ae5484
oxz_crucible_oxp_a58f6b4f-b1b4-4f09-a72f-7aa17e7bda2e
oxz_crucible_oxp_9ceafe70-961e-438a-91bb-eb6eec9d6318
oxz_crucible_oxp_c24e47d1-0aae-4018-a21e-b0a63787cf9e
oxz_crucible_oxp_074fb480-5745-4a42-a3cb-b1b465cbd57f
oxz_crucible_oxp_50fd51c4-12af-4ac2-bda7-1b245705245a
oxz_propolis-server_2f422cf8-393c-4401-9ce6-820cede5dec5
oxz_propolis-server_f60ad21f-bad6-4d1d-8876-1d0529e86c13
oxz_propolis-server_3e387595-e2f6-4f51-97c9-10f81e93c3c7
oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d
But it has no Propolis service:
root@oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d:~# svcs | grep propolis
root@oxz_propolis-server_091d25a0-35cc-4556-8b14-a6f6d82fe93d:~#
And there are no leftover logs from a prior invocation of the service. It's been several hours since this happened, but at least right now there's plenty of ramdisk space available on the sled:
Filesystem Size Used Available Capacity Mounted on
rpool/ROOT/ramdisk 3.81G 1.32G 2.49G 35% /
Asking the instance to stop while it's in this state currently pushes it into a zombie Destroyed state; that's probably a result of #3260, the fix for which merged today and hasn't been picked up into dogfood yet. (The zone is successfully torn down in this case, so if the instance were instead "Stopped"--as it should be once the fix for #3260 is in place--it should be possible to start it again.)
So the things to follow up on here are:
- Why didn't this Propolis start (or why did it fail?)
- If we can't figure out why this Propolis didn't start, how do we make it possible to figure that out?
- Why wasn't sled agent able to detect this condition and mark the instance as Failed?
- What is the appropriate remediation for an instance that's in this state? Is it just to stop the instance?