Skip to content

Nexus-driven SP update: Check version before resetting? #9136

@jgallagher

Description

@jgallagher

(This is spawned out of #9133.) In the SP update's post_update hook, we unconditionally send a reset. At a higher level, post_update is called in a potentially-infinite loop to deal with transient communication problems. Should we change post_update to first check the version of the target SP? It's possible we thought a previous reset attempt failed with a transient error when the SP actually did receive and act on the request. (This is exactly what happened in #9133.)

Given the APIs we have today, this would inherently be a TOCTOU, similar to other flavors of Nexus <-> SP TOCTOU issues described in oxidecomputer/hubris#2178, but I think it would be a pretty harmless one? If we checked the version, got "old", then the device reset for some other reason, then we reset it again, this should correct itself the next time through.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions