[sled agent] Serialize Nexus notification queue

There are many spots in the Sled Agent where we send notifications up to Nexus, such as: 

https://github.com/oxidecomputer/omicron/blob/17ab9fdc69022ebfc8f72e85ddc4da1e658b6d1b/sled-agent/src/server.rs#L83-L123

These notification calling sites are spread around Sled Agent. For unrelated updates, this is no problem. However, for related updates, this can unfortunately result in "earlier updates" trampling over "later ones".

Example problems:
- If we attempt to notify Nexus about a new sled - as a "Gimlet" - but later try to notify Nexus about the sled being updated to a "Scrimlet", then the ordering of the two notifications is critical. It's important that Nexus eventually sees the sled as a Scrimlet, and if a retry loop happens to mean that the "see sled as gimlet" notification arrives last, we'd end up in an inconsistent state.
- We can try to notify Nexus about datasets before zpools, and about zpools before the sled itself. This results in a "failure + retry", which is fine, but produces some confusing log messages.
 
Proposal:
- Create a more broadly-usable "notification queue", where calls to nexus may be serialized.

	let notify_nexus = \|\| async {
	info!(
	log,
	"contacting server nexus, registering sled: {}", sled_id
	);
	let role = if is_scrimlet {
	nexus_client::types::SledRole::Scrimlet
	} else {
	nexus_client::types::SledRole::Gimlet
	};

	let nexus_client = lazy_nexus_client
	.get()
	.await
	.map_err(\|err\| BackoffError::transient(err.to_string()))?;
	nexus_client
	.sled_agent_put(
	&sled_id,
	&nexus_client::types::SledAgentStartupInfo {
	sa_address: sled_address.to_string(),
	role,
	},
	)
	.await
	.map_err(\|err\| BackoffError::transient(err.to_string()))
	};
	let log_notification_failure = \|err, delay\| {
	warn!(
	log,
	"failed to notify nexus about sled agent: {}, will retry in {:?}", err, delay;
	);
	};
	retry_notify(
	internal_service_policy_with_max(
	std::time::Duration::from_secs(1),
	),
	notify_nexus,
	log_notification_failure,
	)
	.await
	.expect("Expected an infinite retry loop contacting Nexus");

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[sled agent] Serialize Nexus notification queue #1917

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[sled agent] Serialize Nexus notification queue #1917

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions