Skip to content

[sled agent] Serialize Nexus notification queue #1917

@smklein

Description

@smklein

There are many spots in the Sled Agent where we send notifications up to Nexus, such as:

let notify_nexus = || async {
info!(
log,
"contacting server nexus, registering sled: {}", sled_id
);
let role = if is_scrimlet {
nexus_client::types::SledRole::Scrimlet
} else {
nexus_client::types::SledRole::Gimlet
};
let nexus_client = lazy_nexus_client
.get()
.await
.map_err(|err| BackoffError::transient(err.to_string()))?;
nexus_client
.sled_agent_put(
&sled_id,
&nexus_client::types::SledAgentStartupInfo {
sa_address: sled_address.to_string(),
role,
},
)
.await
.map_err(|err| BackoffError::transient(err.to_string()))
};
let log_notification_failure = |err, delay| {
warn!(
log,
"failed to notify nexus about sled agent: {}, will retry in {:?}", err, delay;
);
};
retry_notify(
internal_service_policy_with_max(
std::time::Duration::from_secs(1),
),
notify_nexus,
log_notification_failure,
)
.await
.expect("Expected an infinite retry loop contacting Nexus");

These notification calling sites are spread around Sled Agent. For unrelated updates, this is no problem. However, for related updates, this can unfortunately result in "earlier updates" trampling over "later ones".

Example problems:

  • If we attempt to notify Nexus about a new sled - as a "Gimlet" - but later try to notify Nexus about the sled being updated to a "Scrimlet", then the ordering of the two notifications is critical. It's important that Nexus eventually sees the sled as a Scrimlet, and if a retry loop happens to mean that the "see sled as gimlet" notification arrives last, we'd end up in an inconsistent state.
  • We can try to notify Nexus about datasets before zpools, and about zpools before the sled itself. This results in a "failure + retry", which is fine, but produces some confusing log messages.

Proposal:

  • Create a more broadly-usable "notification queue", where calls to nexus may be serialized.

Metadata

Metadata

Assignees

Labels

Sled AgentRelated to the Per-Sled Configuration and Management

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions