-
Notifications
You must be signed in to change notification settings - Fork 63
Add cosmo support to reconfigurator #9134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
164893c to
ee4ae51
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comment! I have a few questions.
Has this been tested on a real cosmo yet? Is it even possible at this point?
Do the SP, RoT, RoT Bootloader, Host OS take roughly the same time to reset on a Cosmo as they do on a Gimlet? We have several timeouts scattered throughout the reconfigurator code based on reset times on a gimlet.
Here we give each component a very generous amount to finish its post update actions. In the simplest of cases like the SP, we just reset the device, but others like the RoT Bootloader and the Host OS have several steps.
omicron/nexus/mgs-updates/src/driver_update.rs
Lines 645 to 682 in 81d8226
| // Timeouts, timeouts: always wrong! | |
| // | |
| // We have to pick some maximum time we're willing to wait for the `post_update` | |
| // hook to complete. In general this hook is responsible for resetting the | |
| // updated target to cause it to boot into its new version, but the details vary | |
| // wildly by device type (e.g., post-update RoT bootloader actions require | |
| // multiple RoT resets) and the expected amount of time also varies wildly | |
| // (e.g., resetting a gimlet SP takes a few seconds, resetting a sidecar SP can | |
| // take 10s of seconds, resetting a sled after a host OS update takes minutes). | |
| fn post_update_timeout(update: &PendingMgsUpdate) -> Duration { | |
| match &update.details { | |
| PendingMgsUpdateDetails::Sp { .. } => { | |
| // We're resetting an SP; use a generous timeout for sleds and power | |
| // shelf controllers (which should take a few seconds) and an even | |
| // more generous timeout for switches (which we've seen take 10-20 | |
| // seconds in practice). | |
| match update.sp_type { | |
| SpType::Sled | SpType::Power => Duration::from_secs(60), | |
| SpType::Switch => Duration::from_secs(120), | |
| } | |
| } | |
| PendingMgsUpdateDetails::Rot { .. } => { | |
| // Resetting the RoT should be quick (a few seconds), but we wait | |
| // for boot info after the reset. | |
| Duration::from_secs(90) | |
| } | |
| PendingMgsUpdateDetails::RotBootloader { .. } => { | |
| // Resetting the bootloader requires multiple RoT resets; give this | |
| // a longer timeout. | |
| Duration::from_secs(210) | |
| } | |
| PendingMgsUpdateDetails::HostPhase1(..) => { | |
| // Resetting a sled takes several minutes (mostly DRAM training); | |
| // give something very generous here to wait for it to come back. | |
| Duration::from_secs(30 * 60) | |
| } | |
| } | |
| } |
Here we poll the device every 10 seconds to see if it has been reset.
omicron/nexus/mgs-updates/src/driver_update.rs
Lines 415 to 443 in 81d8226
| if try_reset { | |
| // We retry this until the component update has been successfully | |
| // updated, or we get some error *other* than a communication error or | |
| // some other transient error. There is intentionally no timeout here. | |
| // If we've staged an update but not managed to reset the device, | |
| // there's no point where we'd want to stop trying to do so. | |
| while let Err(error) = | |
| update_helper.post_update(log, &mut mgs_clients, update).await | |
| { | |
| if error.is_fatal() { | |
| let error = InlineErrorChain::new(&error); | |
| error!(log, "post_update failed"; &error); | |
| return Err(ApplyUpdateError::SpComponentResetFailed( | |
| error.to_string(), | |
| )); | |
| } | |
| // We only care whether the update has completed. We ignore all | |
| // pre-check errors because they could all be transient if a reset | |
| // is in the process of happening. | |
| if let Ok(PrecheckStatus::UpdateComplete) = | |
| update_helper.precheck(log, &mut mgs_clients, update).await | |
| { | |
| break; | |
| } | |
| tokio::time::sleep(RESET_DELAY_INTERVAL).await; | |
| } | |
| } |
For the RoT and RoT bootloader we wait a specific amount of time for boot info to show up
omicron/nexus/mgs-updates/src/rot_updater.rs
Lines 30 to 32 in 81d8226
| pub const WAIT_FOR_BOOT_INFO_TIMEOUT: Duration = Duration::from_secs(120); | |
| const WAIT_FOR_BOOT_INFO_INTERVAL: Duration = Duration::from_secs(10); |
omicron/nexus/mgs-updates/src/rot_updater.rs
Lines 268 to 344 in 81d8226
| /// Poll the RoT asking for its boot information. This confirms that the RoT has | |
| /// been succesfully reset | |
| pub async fn wait_for_boot_info( | |
| log: &Logger, | |
| mgs_clients: &mut MgsClients, | |
| sp_type: SpType, | |
| sp_slot: u16, | |
| timeout: Duration, | |
| ) -> Result<RotState, PostUpdateError> { | |
| let before = Instant::now(); | |
| loop { | |
| debug!(log, "waiting for boot info to confirm a successful reset"); | |
| match mgs_clients | |
| .try_all_serially(log, |mgs_client| async move { | |
| mgs_client | |
| .sp_rot_boot_info( | |
| &sp_type, | |
| sp_slot, | |
| SpComponent::ROT.const_as_str(), | |
| &GetRotBootInfoParams { | |
| version: RotBootInfo::HIGHEST_KNOWN_VERSION, | |
| }, | |
| ) | |
| .await | |
| }) | |
| .await | |
| { | |
| Ok(state) => match state.clone() { | |
| RotState::V2 { .. } | RotState::V3 { .. } => { | |
| debug!(log, "successfuly retrieved boot info"); | |
| return Ok(state.into_inner()); | |
| } | |
| // The RoT is probably still booting | |
| RotState::CommunicationFailed { message } => { | |
| if before.elapsed() >= timeout { | |
| error!( | |
| log, | |
| "failed to get RoT boot info"; | |
| "error" => %message | |
| ); | |
| return Err(PostUpdateError::FatalError { | |
| error: message, | |
| }); | |
| } | |
| info!( | |
| log, | |
| "failed getting RoT boot info (will retry)"; | |
| "error" => %message, | |
| ); | |
| tokio::time::sleep(WAIT_FOR_BOOT_INFO_INTERVAL).await; | |
| } | |
| }, | |
| // The RoT might still be booting | |
| Err(error) => { | |
| let e = InlineErrorChain::new(&error); | |
| if before.elapsed() >= timeout { | |
| error!( | |
| log, | |
| "failed to get RoT boot info"; | |
| &e, | |
| ); | |
| return Err(PostUpdateError::FatalError { | |
| error: e.to_string(), | |
| }); | |
| } | |
| info!( | |
| log, | |
| "failed getting RoT boot info (will retry)"; | |
| e, | |
| ); | |
| tokio::time::sleep(WAIT_FOR_BOOT_INFO_INTERVAL).await; | |
| } | |
| } | |
| } | |
| } |
omicron/nexus/mgs-updates/src/rot_bootloader_updater.rs
Lines 284 to 327 in 81d8226
| /// Poll the RoT asking for its boot information. This is used to check | |
| /// the state for RoT bootloader image errors after RoT is reset | |
| async fn wait_for_stage0_next_image_check( | |
| log: &Logger, | |
| mgs_clients: &mut MgsClients, | |
| sp_type: SpType, | |
| sp_slot: u16, | |
| timeout: Duration, | |
| ) -> Result<Option<RotImageError>, PostUpdateError> { | |
| debug!(log, "attempting to verify image validity"); | |
| match wait_for_boot_info(log, mgs_clients, sp_type, sp_slot, timeout).await | |
| { | |
| Ok(state) => match state { | |
| // The minimum we will ever return is v3. | |
| // Additionally, V2 does not report image errors, so we cannot | |
| // know with certainty if a signature check came back with errors | |
| RotState::V2 { .. } => { | |
| let error = "unexpected RoT version: 2".to_string(); | |
| error!( | |
| log, | |
| "failed to get RoT boot info"; | |
| "error" => &error | |
| ); | |
| return Err(PostUpdateError::FatalError { error }); | |
| } | |
| RotState::V3 { stage0next_error, .. } => { | |
| debug!(log, "successfully completed an image signature check"); | |
| return Ok(stage0next_error); | |
| } | |
| // This is unreachable because wait_for_boot_info loops for some | |
| // time if it encounters `CommunicationFailed`, and if it hits the | |
| // timeout, it will return an error. | |
| RotState::CommunicationFailed { message } => { | |
| error!( | |
| log, | |
| "failed to get RoT boot info"; | |
| "error" => %message | |
| ); | |
| return Err(PostUpdateError::FatalError { error: message }); | |
| } | |
| }, | |
| Err(error) => return Err(error), | |
| } | |
| } |
omicron/nexus/mgs-updates/src/rot_bootloader_updater.rs
Lines 269 to 277 in 81d8226
| // We wait for boot info to ensure a successful reset | |
| wait_for_boot_info( | |
| log, | |
| mgs_clients, | |
| update.sp_type, | |
| update.slot_id, | |
| WAIT_FOR_BOOT_INFO_TIMEOUT, | |
| ) | |
| .await?; |
Sorry for the late review btw! I was out for a couple of weeks
| // We assume a valid model ID for sleds | ||
| model: match sp_id.type_ { | ||
| SpType::Sled => GIMLET_SLED_MODEL.to_string(), | ||
| _ => format!("dummy_{}", sp_id.type_), | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not too complex, it would be really nice to have some tests using a cosmo (unit or reconfigurator-cli ones).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the simulated tests to add a single cosmo sled
sled-hardware/types/src/lib.rs
Outdated
| "913-0000023" => Some(Self::Cosmo), | ||
| "913-0000019" | "913-0000006" => Some(Self::Gimlet), | ||
| COSMO_SLED_MODEL => Some(Self::Cosmo), | ||
| GIMLET_SLED_MODEL | "913-0000006" => Some(Self::Gimlet), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realise this was already here, but is 913-0000006 just a Gimlet ID that is used for test or legacy devices? Or is this used elsewhere? Do you mind adding a comment or so we know where this ID comes from, and where it is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, I actually think that's incorrect now that I'm double checking it.
@dancrossnyc When you were working on #9016 where did you grab "913-0000006" from? At least according to pilot that's a sidecar serial https://github.com/oxidecomputer/pilot/blob/main/common/src/sp.rs#L135
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(@labbott and I discussed this in chat; just recapping here.)
I cannot recall; I thought I saw it in an RFD, but that must be mistaken. My sleuthing uncovered the Duro/Terra is the source of truth here, and it's not registered in OANA. I must have cribbed it from somewhere, and I believe Pilot over my own recollection, so this should change.
Unfortunately we've been blocked on cosmo device ID. Without a device ID, we can't connect via sprockets which means we can't run RSS which means we can't run any actual services to run reconfigurator. This is also why I've led this PR sit here.
This is a slightly tricky question to answer. We know that we've seen Cosmo take more time than Gimlet for some things but we also haven't had a chance to finish profiling to see if this is expected or not. I think this will be something that will become more obvious as we actually start testing on actual hardware (thank you for the pointers on where to check!) |
NB: I have just completed signing on the two Cosmos in question. They're now pending reinstallation into london; see oxidecomputer/meta#776! |
9c98062 to
a3eef8a
Compare
Attempt on a live system seems to work! Going to see about cleaning up the tests a bit more. |
karencfv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So excited to see this is working on a real system! I have a few questions regarding the tests, but otherwise this is looking great, thanks.
| .gimlet_host_phase_1_artifacts( | ||
| ARTIFACT_HASH_HOST_PHASE_1_V1, | ||
| ARTIFACT_HASH_HOST_PHASE_1_V1, | ||
| ) | ||
| .cosmo_host_phase_1_artifacts( | ||
| ARTIFACT_HASH_HOST_PHASE_1_V1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question about these artifact hashes being the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one where I think it's okay to be the same because these are used as an intermediate step to indicate an upgrade needs to be happen.
/// Hash of a fake "version 1" artifact for host OS phase 1
///
/// This can be used to produce an inventory collection for a host slot that
/// needs an update.
| pub(super) const ARTIFACT_HASH_GIMLET_HOST_PHASE_1: ArtifactHash = | ||
| ArtifactHash([29; 32]); | ||
| /// Hash of fake artifact for host OS phase 1 (for a fake version 1.5) | ||
| pub(super) const ARTIFACT_HASH_HOST_PHASE_1_V1_5: ArtifactHash = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: For consistency, it'd be nice to have this and any other hash that is gimlet-specific, say it in the name like the other constants that were changed.
| } else if kind == ArtifactKind::GIMLET_ROT_IMAGE_A | ||
| && sled_type == Some(OxideSled::Cosmo) | ||
| { | ||
| ARTIFACT_HASH_ROT_COSMO_A | ||
| } else if kind == ArtifactKind::GIMLET_ROT_IMAGE_B | ||
| && sled_type == Some(OxideSled::Cosmo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these not be ArtifactKind::COSMO_ROT_IMAGE_*?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where some of the weirdness with cosmo comes in. We are treating cosmos as just another subset of gimlet as far as the SP/RoT are concerned so using AritfactKind::GIMLET_ROT_A is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! That's interesting. Do you mind writing a comment explaining that?
jgallagher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just a couple minor style nits
| assert_eq!(first_update.artifact_hash, ARTIFACT_HASH_HOST_PHASE_1); | ||
| assert_eq!( | ||
| first_update.artifact_hash, | ||
| ARTIFACT_HASH_COSMO_HOST_PHASE_1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the second update in this test pick up the Cosmo hash instead of Gimlet?
Nevermind - found the change to TestBoards that makes the second sled a Cosmo
| use typed_rng::TypedUuidRng; | ||
| let unique = unique.unwrap_or_else(|| hardware_slot.to_string()); | ||
| let model = format!("model{}", unique); | ||
| let model = GIMLET_SLED_MODEL.to_string(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth extending reconfigurator-cli to be able to specify the model of a sled when adding it, right? (Definitely not as part of this PR and not particularly urgent - happy to file an issue for someone to pick up at some point.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#9390 filed this
| } | ||
|
|
||
| impl<'a> TestBoardCollectionBuilder<'a> { | ||
| #[allow(clippy::too_many_arguments)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor style nit - I think I'm inclined to go with clippy on this function, particularly since most of the arguments are the same type (so the caller has to just know the correct order). Maybe this warrants something like a
fn new(boards: &'a TestBoards, description: TestBoardCollectionBuilderDescription)where description contains named fields for all these Artifact* arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth. This new is a private function so I figured it was okay to only have to do it a limited number of places (exactly one in the current code). If it's bothersome I'll do a refactor.
8dfdf58 to
1986a96
Compare
No description provided.