-
Notifications
You must be signed in to change notification settings - Fork 62
Open
Labels
Sled AgentRelated to the Per-Sled Configuration and ManagementRelated to the Per-Sled Configuration and Management
Description
There are many spots in the Sled Agent where we send notifications up to Nexus, such as:
omicron/sled-agent/src/server.rs
Lines 83 to 123 in 17ab9fd
| let notify_nexus = || async { | |
| info!( | |
| log, | |
| "contacting server nexus, registering sled: {}", sled_id | |
| ); | |
| let role = if is_scrimlet { | |
| nexus_client::types::SledRole::Scrimlet | |
| } else { | |
| nexus_client::types::SledRole::Gimlet | |
| }; | |
| let nexus_client = lazy_nexus_client | |
| .get() | |
| .await | |
| .map_err(|err| BackoffError::transient(err.to_string()))?; | |
| nexus_client | |
| .sled_agent_put( | |
| &sled_id, | |
| &nexus_client::types::SledAgentStartupInfo { | |
| sa_address: sled_address.to_string(), | |
| role, | |
| }, | |
| ) | |
| .await | |
| .map_err(|err| BackoffError::transient(err.to_string())) | |
| }; | |
| let log_notification_failure = |err, delay| { | |
| warn!( | |
| log, | |
| "failed to notify nexus about sled agent: {}, will retry in {:?}", err, delay; | |
| ); | |
| }; | |
| retry_notify( | |
| internal_service_policy_with_max( | |
| std::time::Duration::from_secs(1), | |
| ), | |
| notify_nexus, | |
| log_notification_failure, | |
| ) | |
| .await | |
| .expect("Expected an infinite retry loop contacting Nexus"); |
These notification calling sites are spread around Sled Agent. For unrelated updates, this is no problem. However, for related updates, this can unfortunately result in "earlier updates" trampling over "later ones".
Example problems:
- If we attempt to notify Nexus about a new sled - as a "Gimlet" - but later try to notify Nexus about the sled being updated to a "Scrimlet", then the ordering of the two notifications is critical. It's important that Nexus eventually sees the sled as a Scrimlet, and if a retry loop happens to mean that the "see sled as gimlet" notification arrives last, we'd end up in an inconsistent state.
- We can try to notify Nexus about datasets before zpools, and about zpools before the sled itself. This results in a "failure + retry", which is fine, but produces some confusing log messages.
Proposal:
- Create a more broadly-usable "notification queue", where calls to nexus may be serialized.
Metadata
Metadata
Assignees
Labels
Sled AgentRelated to the Per-Sled Configuration and ManagementRelated to the Per-Sled Configuration and Management