Skip to content

mupdate/update recovery flow should ensure that all deployment units are at known versions before proceeding with other operations #8726

@sunshowers

Description

@sunshowers

As of #8456, the blueprint planner is able to set and clear the remove_mupdate_override field within blueprints. Part of the PR is determining when we have fully recovered from a MUPdate.

Currently, there are two conditions, either of which being false is an indication that we still need to do work to recover from a MUPdate:

  1. The target release has been updated since the last time a MUPdate was detected.
  2. The remove_mupdate_override field has been cleared from all sleds in the blueprint.

We need to add a third requirement here: all deployment units on all sleds must be at known versions before we're confident about proceeding with updates. "All deployment units" includes:

  • zone image sources
  • host phase 2 and phase 1 images
  • SP and RoT images
  • (other units?)

Some questions:

  • Do we want to block zone additions on this condition long-term? We have a chicken switch called add_zones_with_mupdate_override for this, set to true on customer systems for r16, but the current plan is for that switch to be set to false for r17. Is that desired?
  • Currently, the only TUF repo we check deployment unit sources against is the current target release. This means that the remediation path in case sleds with different MUPdate versions are detected is annoying: the operator would have to set each MUPdated-to release as the target release, wait for a planner run, and repeat until done.

There's also the open question of how we present these conditions and remediation paths to the operator. That doesn't block this issue, but it would probably block r17.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions