Skip to content

Conversation

sunshowers
Copy link
Contributor

@sunshowers sunshowers commented May 20, 2025

Included in this PR:

  • Add validation for zone images
  • Write tests for success and failure cases
  • Extend the test_installinator_fetch wicketd integration test to also ensure installinator + sled-agent have a coherent view of the world

The last point forced me to expose a bunch of the status via a public interface. We'll be able to reuse this information (though probably not in this full form) and send it up as part of the inventory.

Depends on:

Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
@sunshowers sunshowers changed the title [9/n] [wip] [sled-agent] validate zone images as written in mupdate-override dataset [10/n] [sled-agent] validate zone images as written in mupdate-override dataset May 29, 2025
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1

[skip ci]
@sunshowers sunshowers marked this pull request as ready for review May 29, 2025 01:14
Created using spr 1.3.6-beta.1
fn from_directory(dir: &Utf8Path) -> Result<Self> {
let control_plane_dir = dir.join("zones");
// The install dataset goes into a directory called "install".
let control_plane_dir = dir.join("install");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this just wrong before? (Hopefully only used by tests??)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it was only used by tests. Actually hooking sled-agent's zone logic to installinator found the issue.

pub boot_disk_override:
Result<Option<MupdateOverrideInfo>, MupdateOverrideReadError>,

/// Status of the non-boot disks. This results in warnings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit / question - this only results in warnings in bad cases, right? The typical case is MatchesPresent which would not emit warnings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, updated the doc to make this clearer.

)?;
}
ArtifactReadResult::Error(error) => {
writeln!(f, " {}: error ({})", artifact.file_name, error)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this print the whole error chain since ArtifactReadResult itself isn't an Error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly -- in practice there isn't going to be a source, but I've added InlineErrorChain::new just in case there is one in the future.

let artifacts =
MupdateOverrideArtifactsResult::new(dataset_dir, data);
if artifacts.is_valid() {
// If there are errors, return them as appropriate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment in the wrong branch? If artifacts.is_valid(), there are no errors right?

Copy link
Contributor Author

@sunshowers sunshowers Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Though I ended up redoing most of this in the latest update since I had to split up the zone manifest and mupdate override into separate files.

Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Comment on lines 207 to 224
#[derive(Clone, Debug, PartialEq)]
pub enum ZoneManifestNonBootResult {
/// The manifest is present and matches the value on the boot disk.
/// Information about individual zone hashes matching is stored in the
/// `ZoneManifestArtifactsResult`.
Matches(ZoneManifestArtifactsResult),

/// The manifest is absent -- this is an error case because it is expected
/// to be present.
NotFound,

/// A mismatch between the boot disk and the other disk was detected.
Mismatch(ZoneManifestNonBootMismatch),

/// An error occurred while reading the mupdate override info on this disk.
ReadError(InstallMetadataReadError),
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of these types are duplicated with some minor modifications -- deliberately so, since there are subtle differences (e.g. the underlying install_dataset_metadata doesn't error out if the file is missing, but here we do).

Created using spr 1.3.6-beta.1
@jgallagher jgallagher self-requested a review June 6, 2025 12:08
@sunshowers sunshowers changed the title [10/n] [sled-agent] validate zone images as written in mupdate-override dataset [10/n] [sled-agent] validate zone images as written in zone manifest Jun 9, 2025
@sunshowers sunshowers changed the base branch from sunshowers/spr/main.9n-wip-sled-agent-validate-zone-images-as-written-in-mupdate-override-dataset to main June 10, 2025 00:32
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Comment on lines +38 to +39
/// Testing aid.
impl PartialEq for ArcSerdeJsonError {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to put this behind #[cfg(test)]? Maybe doesn't matter, since I wouldn't expect real code to be comparing errors anyway...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be ok for now, but in an upcoming PR I move this into a more central location and it would become more annoying to manage.

manifest,
)),
Ok(None) => {
// The file is missing -- this is an error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this some in chat - I think this will break a couple of manufacturing / dev workflows, because they produce sleds that start up without installinator having run (and therefore without having had a chance to get a zones.json file). We could change the workflows to write out a zones.json (with some amount of work / logistical pain, I think), but it also seems okay for us to hash the zones we find ourselves if there's no zones.json present at all.

Sorry I didn't think about this the first time around :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, working on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the back and forth on this - the new synthesized manifest LGTM.

]
.into_iter()
.collect(),
mupdate_id: MupdateUuid::new_v4(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is causing the new test failures, maybe? The display impl now includes this ID, which is different each time the test runs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I get for writing code at midnight. Thanks.

Created using spr 1.3.6-beta.1
@sunshowers sunshowers enabled auto-merge (squash) June 18, 2025 17:42
@sunshowers sunshowers merged commit 5465ccf into main Jun 18, 2025
17 checks passed
@sunshowers sunshowers deleted the sunshowers/spr/9n-wip-sled-agent-validate-zone-images-as-written-in-mupdate-override-dataset branch June 18, 2025 19:05
sunshowers added a commit that referenced this pull request Jun 18, 2025
* Move `ZoneImageFileSource` to illumos-utils and use it in the zone
image builder.
* Move default file name logic to sled-agent-zone-images, since it cares
about that in the context of mupdate override logic.

Depends on:

* #8190
* everything before that
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants