Skip to content

Conversation

@labbott
Copy link
Contributor

@labbott labbott commented Oct 2, 2025

No description provided.

@morlandi7 morlandi7 linked an issue Oct 13, 2025 that may be closed by this pull request
@labbott labbott force-pushed the reconfigurator_cosmo branch from 164893c to ee4ae51 Compare October 15, 2025 16:48
Copy link
Contributor

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comment! I have a few questions.

Has this been tested on a real cosmo yet? Is it even possible at this point?

Do the SP, RoT, RoT Bootloader, Host OS take roughly the same time to reset on a Cosmo as they do on a Gimlet? We have several timeouts scattered throughout the reconfigurator code based on reset times on a gimlet.

Here we give each component a very generous amount to finish its post update actions. In the simplest of cases like the SP, we just reset the device, but others like the RoT Bootloader and the Host OS have several steps.

// Timeouts, timeouts: always wrong!
//
// We have to pick some maximum time we're willing to wait for the `post_update`
// hook to complete. In general this hook is responsible for resetting the
// updated target to cause it to boot into its new version, but the details vary
// wildly by device type (e.g., post-update RoT bootloader actions require
// multiple RoT resets) and the expected amount of time also varies wildly
// (e.g., resetting a gimlet SP takes a few seconds, resetting a sidecar SP can
// take 10s of seconds, resetting a sled after a host OS update takes minutes).
fn post_update_timeout(update: &PendingMgsUpdate) -> Duration {
match &update.details {
PendingMgsUpdateDetails::Sp { .. } => {
// We're resetting an SP; use a generous timeout for sleds and power
// shelf controllers (which should take a few seconds) and an even
// more generous timeout for switches (which we've seen take 10-20
// seconds in practice).
match update.sp_type {
SpType::Sled | SpType::Power => Duration::from_secs(60),
SpType::Switch => Duration::from_secs(120),
}
}
PendingMgsUpdateDetails::Rot { .. } => {
// Resetting the RoT should be quick (a few seconds), but we wait
// for boot info after the reset.
Duration::from_secs(90)
}
PendingMgsUpdateDetails::RotBootloader { .. } => {
// Resetting the bootloader requires multiple RoT resets; give this
// a longer timeout.
Duration::from_secs(210)
}
PendingMgsUpdateDetails::HostPhase1(..) => {
// Resetting a sled takes several minutes (mostly DRAM training);
// give something very generous here to wait for it to come back.
Duration::from_secs(30 * 60)
}
}
}

Here we poll the device every 10 seconds to see if it has been reset.

if try_reset {
// We retry this until the component update has been successfully
// updated, or we get some error *other* than a communication error or
// some other transient error. There is intentionally no timeout here.
// If we've staged an update but not managed to reset the device,
// there's no point where we'd want to stop trying to do so.
while let Err(error) =
update_helper.post_update(log, &mut mgs_clients, update).await
{
if error.is_fatal() {
let error = InlineErrorChain::new(&error);
error!(log, "post_update failed"; &error);
return Err(ApplyUpdateError::SpComponentResetFailed(
error.to_string(),
));
}
// We only care whether the update has completed. We ignore all
// pre-check errors because they could all be transient if a reset
// is in the process of happening.
if let Ok(PrecheckStatus::UpdateComplete) =
update_helper.precheck(log, &mut mgs_clients, update).await
{
break;
}
tokio::time::sleep(RESET_DELAY_INTERVAL).await;
}
}

For the RoT and RoT bootloader we wait a specific amount of time for boot info to show up

pub const WAIT_FOR_BOOT_INFO_TIMEOUT: Duration = Duration::from_secs(120);
const WAIT_FOR_BOOT_INFO_INTERVAL: Duration = Duration::from_secs(10);

/// Poll the RoT asking for its boot information. This confirms that the RoT has
/// been succesfully reset
pub async fn wait_for_boot_info(
log: &Logger,
mgs_clients: &mut MgsClients,
sp_type: SpType,
sp_slot: u16,
timeout: Duration,
) -> Result<RotState, PostUpdateError> {
let before = Instant::now();
loop {
debug!(log, "waiting for boot info to confirm a successful reset");
match mgs_clients
.try_all_serially(log, |mgs_client| async move {
mgs_client
.sp_rot_boot_info(
&sp_type,
sp_slot,
SpComponent::ROT.const_as_str(),
&GetRotBootInfoParams {
version: RotBootInfo::HIGHEST_KNOWN_VERSION,
},
)
.await
})
.await
{
Ok(state) => match state.clone() {
RotState::V2 { .. } | RotState::V3 { .. } => {
debug!(log, "successfuly retrieved boot info");
return Ok(state.into_inner());
}
// The RoT is probably still booting
RotState::CommunicationFailed { message } => {
if before.elapsed() >= timeout {
error!(
log,
"failed to get RoT boot info";
"error" => %message
);
return Err(PostUpdateError::FatalError {
error: message,
});
}
info!(
log,
"failed getting RoT boot info (will retry)";
"error" => %message,
);
tokio::time::sleep(WAIT_FOR_BOOT_INFO_INTERVAL).await;
}
},
// The RoT might still be booting
Err(error) => {
let e = InlineErrorChain::new(&error);
if before.elapsed() >= timeout {
error!(
log,
"failed to get RoT boot info";
&e,
);
return Err(PostUpdateError::FatalError {
error: e.to_string(),
});
}
info!(
log,
"failed getting RoT boot info (will retry)";
e,
);
tokio::time::sleep(WAIT_FOR_BOOT_INFO_INTERVAL).await;
}
}
}
}

/// Poll the RoT asking for its boot information. This is used to check
/// the state for RoT bootloader image errors after RoT is reset
async fn wait_for_stage0_next_image_check(
log: &Logger,
mgs_clients: &mut MgsClients,
sp_type: SpType,
sp_slot: u16,
timeout: Duration,
) -> Result<Option<RotImageError>, PostUpdateError> {
debug!(log, "attempting to verify image validity");
match wait_for_boot_info(log, mgs_clients, sp_type, sp_slot, timeout).await
{
Ok(state) => match state {
// The minimum we will ever return is v3.
// Additionally, V2 does not report image errors, so we cannot
// know with certainty if a signature check came back with errors
RotState::V2 { .. } => {
let error = "unexpected RoT version: 2".to_string();
error!(
log,
"failed to get RoT boot info";
"error" => &error
);
return Err(PostUpdateError::FatalError { error });
}
RotState::V3 { stage0next_error, .. } => {
debug!(log, "successfully completed an image signature check");
return Ok(stage0next_error);
}
// This is unreachable because wait_for_boot_info loops for some
// time if it encounters `CommunicationFailed`, and if it hits the
// timeout, it will return an error.
RotState::CommunicationFailed { message } => {
error!(
log,
"failed to get RoT boot info";
"error" => %message
);
return Err(PostUpdateError::FatalError { error: message });
}
},
Err(error) => return Err(error),
}
}

// We wait for boot info to ensure a successful reset
wait_for_boot_info(
log,
mgs_clients,
update.sp_type,
update.slot_id,
WAIT_FOR_BOOT_INFO_TIMEOUT,
)
.await?;

Sorry for the late review btw! I was out for a couple of weeks

Comment on lines 968 to 1111
// We assume a valid model ID for sleds
model: match sp_id.type_ {
SpType::Sled => GIMLET_SLED_MODEL.to_string(),
_ => format!("dummy_{}", sp_id.type_),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not too complex, it would be really nice to have some tests using a cosmo (unit or reconfigurator-cli ones).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the simulated tests to add a single cosmo sled

"913-0000023" => Some(Self::Cosmo),
"913-0000019" | "913-0000006" => Some(Self::Gimlet),
COSMO_SLED_MODEL => Some(Self::Cosmo),
GIMLET_SLED_MODEL | "913-0000006" => Some(Self::Gimlet),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise this was already here, but is 913-0000006 just a Gimlet ID that is used for test or legacy devices? Or is this used elsewhere? Do you mind adding a comment or so we know where this ID comes from, and where it is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, I actually think that's incorrect now that I'm double checking it.

@dancrossnyc When you were working on #9016 where did you grab "913-0000006" from? At least according to pilot that's a sidecar serial https://github.com/oxidecomputer/pilot/blob/main/common/src/sp.rs#L135

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(@labbott and I discussed this in chat; just recapping here.)

I cannot recall; I thought I saw it in an RFD, but that must be mistaken. My sleuthing uncovered the Duro/Terra is the source of truth here, and it's not registered in OANA. I must have cribbed it from somewhere, and I believe Pilot over my own recollection, so this should change.

@labbott
Copy link
Contributor Author

labbott commented Nov 4, 2025

Thanks for addressing my comment! I have a few questions.

Has this been tested on a real cosmo yet? Is it even possible at this point?

Unfortunately we've been blocked on cosmo device ID. Without a device ID, we can't connect via sprockets which means we can't run RSS which means we can't run any actual services to run reconfigurator. This is also why I've led this PR sit here.

Do the SP, RoT, RoT Bootloader, Host OS take roughly the same time to reset on a Cosmo as they do on a Gimlet? We have several timeouts scattered throughout the reconfigurator code based on reset times on a gimlet.

This is a slightly tricky question to answer. We know that we've seen Cosmo take more time than Gimlet for some things but we also haven't had a chance to finish profiling to see if this is expected or not. I think this will be something that will become more obvious as we actually start testing on actual hardware (thank you for the pointers on where to check!)

@jclulow
Copy link
Collaborator

jclulow commented Nov 6, 2025

Unfortunately we've been blocked on cosmo device ID. Without a device ID, we can't connect via sprockets which means we can't run RSS which means we can't run any actual services to run reconfigurator. This is also why I've led this PR sit here.

NB: I have just completed signing on the two Cosmos in question. They're now pending reinstallation into london; see oxidecomputer/meta#776!

@labbott labbott force-pushed the reconfigurator_cosmo branch from 9c98062 to a3eef8a Compare November 7, 2025 15:48
@labbott
Copy link
Contributor Author

labbott commented Nov 10, 2025

   15 2025-11-10T15:47:34.879Z b379914c-3d5d-4ecb-8c0b-d7ccabdc473c  enabled: update Sled 15 (BRM13250012) host phase 1 to 18.0.0-0.ci+gitca25fe74cf3
   16 2025-11-10T15:52:49.391Z d5ca6377-bbfc-45e0-9c04-a9a535ca63bf  enabled: update Sled 15 (BRM13250012) host phase 1 to 18.0.0-0.ci+gitca25fe74cf3
   17 2025-11-10T16:15:07.444Z 2754d860-0664-4ae2-ac52-d85a806759b0  enabled: 
   18 2025-11-10T16:15:59.222Z ee854b4e-c2cb-4868-9855-35453ad2f401  enabled: update Sled 17 (BRM22250001) host phase 1 to 18.0.0-0.ci+gitca25fe74cf3
   19 2025-11-10T16:21:00.640Z dccf1c5f-8fd5-47f1-96c5-6434de78e951  enabled: update Sled 17 (BRM22250001) host phase 1 to 18.0.0-0.ci+gitca25fe74cf3

Attempt on a live system seems to work! Going to see about cleaning up the tests a bit more.

Copy link
Contributor

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So excited to see this is working on a real system! I have a few questions regarding the tests, but otherwise this is looking great, thanks.

Comment on lines +1199 to 1200
.gimlet_host_phase_1_artifacts(
ARTIFACT_HASH_HOST_PHASE_1_V1,
ARTIFACT_HASH_HOST_PHASE_1_V1,
)
.cosmo_host_phase_1_artifacts(
ARTIFACT_HASH_HOST_PHASE_1_V1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about these artifact hashes being the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one where I think it's okay to be the same because these are used as an intermediate step to indicate an upgrade needs to be happen.

/// Hash of a fake "version 1" artifact for host OS phase 1
///     
/// This can be used to produce an inventory collection for a host slot that
/// needs an update.

pub(super) const ARTIFACT_HASH_GIMLET_HOST_PHASE_1: ArtifactHash =
ArtifactHash([29; 32]);
/// Hash of fake artifact for host OS phase 1 (for a fake version 1.5)
pub(super) const ARTIFACT_HASH_HOST_PHASE_1_V1_5: ArtifactHash =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: For consistency, it'd be nice to have this and any other hash that is gimlet-specific, say it in the name like the other constants that were changed.

Comment on lines +1402 to +1429
} else if kind == ArtifactKind::GIMLET_ROT_IMAGE_A
&& sled_type == Some(OxideSled::Cosmo)
{
ARTIFACT_HASH_ROT_COSMO_A
} else if kind == ArtifactKind::GIMLET_ROT_IMAGE_B
&& sled_type == Some(OxideSled::Cosmo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these not be ArtifactKind::COSMO_ROT_IMAGE_*?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where some of the weirdness with cosmo comes in. We are treating cosmos as just another subset of gimlet as far as the SP/RoT are concerned so using AritfactKind::GIMLET_ROT_A is correct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! That's interesting. Do you mind writing a comment explaining that?

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just a couple minor style nits

assert_eq!(first_update.artifact_hash, ARTIFACT_HASH_HOST_PHASE_1);
assert_eq!(
first_update.artifact_hash,
ARTIFACT_HASH_COSMO_HOST_PHASE_1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the second update in this test pick up the Cosmo hash instead of Gimlet?

Nevermind - found the change to TestBoards that makes the second sled a Cosmo

use typed_rng::TypedUuidRng;
let unique = unique.unwrap_or_else(|| hardware_slot.to_string());
let model = format!("model{}", unique);
let model = GIMLET_SLED_MODEL.to_string();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth extending reconfigurator-cli to be able to specify the model of a sled when adding it, right? (Definitely not as part of this PR and not particularly urgent - happy to file an issue for someone to pick up at some point.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#9390 filed this

}

impl<'a> TestBoardCollectionBuilder<'a> {
#[allow(clippy::too_many_arguments)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style nit - I think I'm inclined to go with clippy on this function, particularly since most of the arguments are the same type (so the caller has to just know the correct order). Maybe this warrants something like a

fn new(boards: &'a TestBoards, description: TestBoardCollectionBuilderDescription)

where description contains named fields for all these Artifact* arguments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth. This new is a private function so I figured it was okay to only have to do it a limited number of places (exactly one in the current code). If it's bothersome I'll do a refactor.

@labbott labbott force-pushed the reconfigurator_cosmo branch from 8dfdf58 to 1986a96 Compare November 12, 2025 15:15
@labbott labbott merged commit e02810b into main Nov 12, 2025
18 checks passed
@labbott labbott deleted the reconfigurator_cosmo branch November 12, 2025 20:17
sunshowers added a commit that referenced this pull request Nov 13, 2025
Fix a semantic merge conflict between #9134 and #9372.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Successfully update a Cosmo via reconfigurator

6 participants