Skip to content

Conversation

jgallagher
Copy link
Contributor

@jgallagher jgallagher commented Jun 8, 2022

Removes sprockets proxies, fixing #1161.

The messages sent inside the sprockets session are [u32-length-prefix | u32-version-number | json].

This PR does not replace the trust quorum client/server that performs share collection to rebuild the rack secret, but that will come in a followup PR; the intent is that request_share (currently a placeholder, and marked with #[allow(dead_code)] as of this PR) will become not-a-placeholder, and it will be served inside the sprockets session instead of shares being fetched via SPDM.

Minor notes on testing:

  • Testing on a single sled without a simulated SP requires Correct key names in config-rss.toml #1172, and the logs indicate that no SP is available (and therefore none of the communications happen inside sprockets):
[2022-06-08T15:23:54.144568273Z]  INFO: SledAgent/BootstrapAgentServer/8063 on helios: Accepted connection (remote_addr=[fdb0:5254:f:4458::1]:36354)
[2022-06-08T15:23:54.145589403Z]  INFO: SledAgent/BootstrapAgentRssHandler/8063 on helios: No SP available; proceeding without sprockets auth (BootstrapAgentClient=[fdb0:5254:f:4458::1]:12346)
[2022-06-08T15:23:54.150164324Z]  INFO: SledAgent/BootstrapAgentServer/8063 on helios: No SP available; proceeding without sprockets auth (remote_addr=[fdb0:5254:f:4458::1]:36354)
[2022-06-08T15:23:54.153453961Z]  INFO: SledAgent/BootstrapAgent/8063 on helios: Loading Sled Agent: SledAgentRequest { subnet: Ipv6Subnet { net: Ipv6Net(Ipv6Network { addr: fd00:1122:3344:101::, prefix: 64 }) } } (server=fb0f7546-4d46-40ca-9d56-cbb810684ca7)
[2022-06-08T15:23:54.154737795Z]  INFO: SledAgent/8063 on helios: setting up sled agent server

(The double logs are because we're seeing both the client and server note that no sprockets are in play.)

  • Testing on multiple sleds requires, at least for me, reverting [sled-agent] Allocate VNICs over etherstubs, fix inter-zone routing #1066 (and minor changes that came later related to it), as the bootstrap peer discovery is currently busted on main. After doing so, and after using thing-flinger to set up simulated SPs, we can see sprockets sessions negotiated. I'm currently logging the serial number of the peer because logging the entire cert is very noisy, but I'm sure that will change over time:

sled 0 (running RSS, simulated SP serial 1000...)

[2022-06-08T15:48:04.105546585Z]  INFO: SledAgent/RSS/1032 on helios1: Plan written to storage
[2022-06-08T15:48:04.105935659Z]  INFO: SledAgent/BootstrapAgentRssHandler/1032 on helios1: Received initialization request from RSS (target_sled=[fdb0:5254:8a:8c1c::1]:12346)
    request: SledAgentRequest { subnet: Ipv6Subnet { net: Ipv6Net(Ipv6Network { addr: fd00:1122:3344:102::, prefix: 64 }) } }
[2022-06-08T15:48:04.10644305Z]  INFO: SledAgent/BootstrapAgentRssHandler/1032 on helios1: Received initialization request from RSS (target_sled=[fdb0:5254:e4:589f::1]:12346)
    request: SledAgentRequest { subnet: Ipv6Subnet { net: Ipv6Net(Ipv6Network { addr: fd00:1122:3344:101::, prefix: 64 }) } }
[2022-06-08T15:48:04.106933454Z]  INFO: SledAgent/BootstrapAgentRssHandler/1032 on helios1: SP available; establishing sprockets session (BootstrapAgentClient=[fdb0:5254:8a:8c1c::1]:12346)
[2022-06-08T15:48:04.107368339Z]  INFO: SledAgent/BootstrapAgentServer/1032 on helios1: Accepted connection (remote_addr=[fdb0:5254:8a:8c1c::1]:48023)
[2022-06-08T15:48:04.107858463Z]  INFO: SledAgent/BootstrapAgentServer/1032 on helios1: SP available; establishing sprockets session (remote_addr=[fdb0:5254:8a:8c1c::1]:48023)
[2022-06-08T15:48:04.108294888Z]  INFO: SledAgent/BootstrapAgentRssHandler/1032 on helios1: SP available; establishing sprockets session (BootstrapAgentClient=[fdb0:5254:e4:589f::1]:12346)
[2022-06-08T15:48:04.108736952Z]  INFO: SledAgent/BootstrapAgentRssHandler/1032 on helios1: Negotiated sprockets session (BootstrapAgentClient=[fdb0:5254:8a:8c1c::1]:12346)
    peer_serial_number: SerialNumber([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
[2022-06-08T15:48:04.109228546Z]  INFO: SledAgent/BootstrapAgentServer/1032 on helios1: Negotiated sprockets session (remote_addr=[fdb0:5254:8a:8c1c::1]:48023)
    peer_serial_number: SerialNumber([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
[2022-06-08T15:48:04.109759482Z]  INFO: SledAgent/BootstrapAgent/1032 on helios1: Loading Sled Agent: SledAgentRequest { subnet: Ipv6Subnet { net: Ipv6Net(Ipv6Network { addr: fd00:1122:3344:102::, prefix: 64 }) } } (server=fb0f7546-4d46-40ca-9d56-cbb810684ca7
)
[2022-06-08T15:48:04.110356616Z]  INFO: SledAgent/1032 on helios1: setting up sled agent server
[2022-06-08T15:48:04.111208528Z]  INFO: SledAgent/BootstrapAgentRssHandler/1032 on helios1: Negotiated sprockets session (BootstrapAgentClient=[fdb0:5254:e4:589f::1]:12346)
    peer_serial_number: SerialNumber([2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

sled 1 (simulated SP serial 2000...)

[2022-06-08T15:48:04.096270806Z]  INFO: SledAgent/BootstrapAgentServer/1939 on helios2: Accepted connection (remote_addr=[fdb0:5254:8a:8c1c::1]:59450)
[2022-06-08T15:48:04.097623645Z]  INFO: SledAgent/BootstrapAgentServer/1939 on helios2: SP available; establishing sprockets session (remote_addr=[fdb0:5254:8a:8c1c::1]:59450)
[2022-06-08T15:48:04.119871769Z]  INFO: SledAgent/BootstrapAgentServer/1939 on helios2: Negotiated sprockets session (remote_addr=[fdb0:5254:8a:8c1c::1]:59450)
    peer_serial_number: SerialNumber([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
[2022-06-08T15:48:04.120954244Z]  INFO: SledAgent/BootstrapAgent/1939 on helios2: Loading Sled Agent: SledAgentRequest { subnet: Ipv6Subnet { net: Ipv6Net(Ipv6Network { addr: fd00:1122:3344:101::, prefix: 64 }) } } (server=fb0f7546-4d46-40ca-9d56-cbb810684ca7

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. It's awesome how easy an async sprockets session is to use when it implements AsyncRead/AsyncWrite. Really happy we made that change!

#[derive(Clone, Copy, Debug, Serialize_repr, Deserialize_repr, PartialEq)]
#[repr(u32)]
pub enum Version {
V1 = 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest not using an enum here with such a large repr. The problem is that if we really have many versions, each one will have to be maintained in this enum, even if our code no longer supports that version. Otherwise, it won't deserialize successfully if we are talking to an unsupported version and we won't report what version it's looking for. It's probably easier to just use a u32 or newtype.

The other option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to switch this to a plain u32, although it looks like your comment got cut off partway through. What's the other option? 😂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, probably left over from something I started typing and thought I deleted. There are probably other options, but u32 is my suggestion :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched in e076dd6

.await
.map_err(Error::WriteLengthPrefix)?;
stream.write_all(&buf).await.map_err(Error::WriteRequest)?;
stream.flush().await.map_err(Error::FlushRequest)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice flush 😆

// allocating based on the length prefix they send, so it should be fine to
// be a little sloppy here and just pick something far larger than we ever
// expect to see.
const MAX_REQUEST_LEN: u32 = 128 << 20;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe share this between client and server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quasi-intentionally picked different sizes here, although maybe that was being too clever. Requests will certainly be larger than responses in general (especially as the init request grows to include the list of devices in the quorum), but in practice we don't expect this bound to ever be hit in either case. I have a slight preference for keeping them separate only because it allows them to be local to their respective functions, but if you'd rather I share one I'm fine with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't even notice they were different. I'm fine with this as is. Carry on!

@jgallagher jgallagher force-pushed the bootstrap-sprockets-raw branch from ae75789 to e076dd6 Compare June 8, 2022 20:20
@jgallagher jgallagher enabled auto-merge (squash) June 8, 2022 20:46
@jgallagher jgallagher merged commit 4a3815b into main Jun 8, 2022
@jgallagher jgallagher deleted the bootstrap-sprockets-raw branch June 8, 2022 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants