-
Notifications
You must be signed in to change notification settings - Fork 62
Description
(This is spawned out of #9133.) In the SP update's post_update hook, we unconditionally send a reset. At a higher level, post_update is called in a potentially-infinite loop to deal with transient communication problems. Should we change post_update to first check the version of the target SP? It's possible we thought a previous reset attempt failed with a transient error when the SP actually did receive and act on the request. (This is exactly what happened in #9133.)
Given the APIs we have today, this would inherently be a TOCTOU, similar to other flavors of Nexus <-> SP TOCTOU issues described in oxidecomputer/hubris#2178, but I think it would be a pretty harmless one? If we checked the version, got "old", then the device reset for some other reason, then we reset it again, this should correct itself the next time through.