Skip to content

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Nov 4, 2022

  • Adds hardware module to Sled Agent for minimally monitoring devinfo output. This will certainly evolve, but this PR includes a "bare minimum" real example of tracking the tofino driver.
  • Stop relying on rss-config.toml to decide if we're running on a scrimlet. Instead...
    • (General case) Rely on monitoring hardware
    • (Testing, manual case) Provide a force-scrimlet option to make the sled agent assume that it is a scrimlet

Fixes https://github.com/oxidecomputer/minimum-upgradable-product/issues/19
Part of #1917
Part of #823
Pre-requisite for https://github.com/oxidecomputer/minimum-upgradable-product/issues/16
Pre-requisite for https://github.com/oxidecomputer/minimum-upgradable-product/issues/18

@smklein smklein added the Sled Agent Related to the Per-Sled Configuration and Management label Nov 4, 2022

// Scan the existing system for noteworthy events
// that may have happened before we started monitoring
if self.inner.hardware.is_scrimlet() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding an attached tofino driver tells us that we're running on a scrimlet, but it doesn't mean we have all the resources needed to launch a switch zone. Specifically, we need the tfpkt0 link to exist. Without that, tfportd can't create tfports, so mgs and maghemite can't run. In practice, I would expect that link to exist long before sled-agent runs, but it's at least a theoretical concern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this in #1933 , when actually launching the switch zone

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sean this looks good! Mostly minor comments from me. Based on @davepacheco's comments and my rethinking about the nexus queue in #1917, I'm not sure we want to stick with that strategy. However, I'm fine merging this as is so we can start using it and revisit that strategy in follow ups with whatever we decide upon.

/// as illumos systems.
pub struct Hardware {
log: Logger,
inner: Mutex<HardwareInner>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this Mutex actually needed? It looks like scrimlet config is only set during initialization.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly, this isn't really needed. I was trying to emulate the OPTE usage here: https://github.com/oxidecomputer/omicron/tree/main/sled-agent/src/opte

... But in reality, the Sled Agent only support the "sim" version on non-illumos platforms. If we try running the "real" sled agent outside illumos, tons of stuff will break (access to zones, dladm, ipadm, etc).

I've decided to keep most of this module as stubs, so that editor support (like rust analyzer) will still work on platforms like Linux, but it'll hopefully be more obvious that this code should not be callable when we start approaching the "real illumos" interfaces. (done in 4863667)

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for the cleanup and comments.

@smklein smklein merged commit 6fa27a3 into main Nov 10, 2022
@smklein smklein deleted the scrimlet-autodetect branch November 10, 2022 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Sled Agent Related to the Per-Sled Configuration and Management

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants