Skip to content

Conversation

jgallagher
Copy link
Contributor

Prior to this PR, any change to the OmicronZoneConfig of an already-running zone resulted in that zone being shut down and restarted. With this PR, we allow a single kind of change that does not restart the zone. If the image_source swaps from InstallDataset to Artifact { hash } (or vice versa) and the hash of this zone in the install dataset exactly matches the Artifact { hash }, we don't need to do anything: we're already running the exact zone as desired, just started from a different place.

Fixes #8463.

Makes #8510 slightly worse; yet another ZoneKind::make_a_string() method. (This one is at least guaranteed-by-test to be an extension of an existing one. I don't feel a lot better about that though.)

I'll test this on a racklette before merging; will put notes below once I do. But I think this is contained enough that it can be reviewed before that's done.

@jgallagher
Copy link
Contributor Author

Testing this on berlin looks good.

Setup:

  • Installed this branch, using mupdate (to ensure we got the zone manifest from installinator)
  • Reconfigured Nexus to allow TUF repo uploads
  • Uploaded both the repo from this branch and a repo built from main

I derived a blueprint and changed the image source for every zone on one sled:

    omicron zones:
    ----------------------------------------------------------------------------------------------------------------------------------------------
    zone type         zone id                                image source                                     disposition   underlay IP
    ----------------------------------------------------------------------------------------------------------------------------------------------
*   cockroach_db      d3dc0053-5868-4099-9792-20daa1987f63   - install dataset                                in service    fd00:1122:3344:104::4
     └─                                                      + artifact: version 15.0.0-0.ci+git4d80d1cbc5a
*   cockroach_db      dd17acd5-4ba0-4a1d-92f1-e68f2d86606e   - install dataset                                in service    fd00:1122:3344:104::3
     └─                                                      + artifact: version 15.0.0-0.ci+git4d80d1cbc5a
*   crucible          11da1e85-ba84-42cb-b42f-f476cccc0577   - install dataset                                in service    fd00:1122:3344:104::c
     └─                                                      + artifact: version 15.0.0-0.ci+git4d80d1cbc5a
*   crucible          2d34ddee-25c5-428c-b560-713bc336d5c6   - install dataset                                in service    fd00:1122:3344:104::10
     └─                                                      + artifact: version 15.0.0-0.ci+git4d80d1cbc5a
*   crucible          395310b6-f69d-4a0a-948c-258ff21fad98   - install dataset                                in service    fd00:1122:3344:104::a
     └─                                                      + artifact: version 15.0.0-0.ci+git4d80d1cbc5a
*   crucible          6a39e344-52a7-4f6d-af50-5ba338d948dd   - install dataset                                in service    fd00:1122:3344:104::b
     └─                                                      + artifact: version 15.0.0-0.ci+git4d80d1cbc5a
*   crucible          88ed03ab-6587-48ec-bd4a-d68c30f98863   - install dataset                                in service    fd00:1122:3344:104::7
     └─                                                      + artifact: version 15.0.0-0.ci+git4d80d1cbc5a
*   crucible          9a656f0b-1555-4c0e-8965-06ea44add08f   - install dataset                                in service    fd00:1122:3344:104::e
     └─                                                      + artifact: version 15.0.0-0.ci+git4d80d1cbc5a
*   crucible          a0b488e3-8863-4d63-84ac-d15660c6b1b7   - install dataset                                in service    fd00:1122:3344:104::f
     └─                                                      + artifact: version 15.0.0-0.ci+git4d80d1cbc5a
*   crucible          a309c32c-4ed9-4f7e-a14d-13871864254e   - install dataset                                in service    fd00:1122:3344:104::d
     └─                                                      + artifact: version 15.0.0-0.ci+git4d80d1cbc5a
*   crucible          b7c91d70-50ff-4891-b456-941f345aae9b   - install dataset                                in service    fd00:1122:3344:104::8
     └─                                                      + artifact: version 15.0.0-0.ci+git4d80d1cbc5a
*   crucible          cdd73351-a66f-4360-a689-db6b92e0919b   - install dataset                                in service    fd00:1122:3344:104::9
     └─                                                      + artifact: version 15.0.0-0.ci+git4d80d1cbc5a
*   crucible_pantry   180b2bbc-89a5-4d2c-b7e0-632d305979bc   - install dataset                                in service    fd00:1122:3344:104::6
     └─                                                      + artifact: version 15.0.0-0.ci+gitd4cf01a9438
*   internal_ntp      bba45af2-3283-402e-91e0-5fcdf8e73d1e   - install dataset                                in service    fd00:1122:3344:104::11
     └─                                                      + artifact: version 15.0.0-0.ci+git4d80d1cbc5a
*   nexus             003d0721-b895-4e3b-aaee-5955b3214af2   - install dataset                                in service    fd00:1122:3344:104::5
     └─                                                      + artifact: version 15.0.0-0.ci+git4d80d1cbc5a

All the zones except crucible_pantry were set to a version from this branch (i.e., should not be bounced); crucible_pantry was set to use an artifact from main (i.e., with a different hash, so should be bounced).

Making this new blueprint the target, sled-agent logs show the expected behavior; most zones had their configs updated in place, and the pantry zone was shut down and restarted:

18:18:56.297Z INFO SledAgent (ConfigReconcilerTask): starting reconciliation due to config change
    file = sled-agent/config-reconciler/src/reconciler_task.rs:309
18:18:56.297Z INFO SledAgent (ConfigReconcilerTask): updating config for zone without restarting; only change is zone image source (install dataset <-> artifact; hash of zone image matches in both)
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:1032
    zone_name = oxz_nexus_003d0721-b895-4e3b-aaee-5955b3214af2
18:18:56.298Z INFO SledAgent (ConfigReconcilerTask): updating config for zone without restarting; only change is zone image source (install dataset <-> artifact; hash of zone image matches in both)
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:1032
    zone_name = oxz_crucible_11da1e85-ba84-42cb-b42f-f476cccc0577
18:18:56.298Z INFO SledAgent (ConfigReconcilerTask): starting shutdown of running zone; config has changed
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:187
    new-config = OmicronZoneConfig { id: 180b2bbc-89a5-4d2c-b7e0-632d305979bc (service), filesystem_pool: Some(External(1d74a7e6-d4a5-4bba-8e0b-ea3c8ced62ad (external_zpool))), zone_type: CruciblePantry { address: [fd00:1122:3344:104::6]:17000 }, image_source: Artifact { hash: ArtifactHash("4a8b4283c848123bf5ac02ff3b8fae969e6c61f5a827a20cc70658e9127b93a8") } }
    old-config = OmicronZoneConfig { id: 180b2bbc-89a5-4d2c-b7e0-632d305979bc (service), filesystem_pool: Some(External(1d74a7e6-d4a5-4bba-8e0b-ea3c8ced62ad (external_zpool))), zone_type: CruciblePantry { address: [fd00:1122:3344:104::6]:17000 }, image_source: InstallDataset }
    zone = oxz_crucible_pantry_180b2bbc-89a5-4d2c-b7e0-632d305979bc
18:18:56.298Z INFO SledAgent (ConfigReconcilerTask): updating config for zone without restarting; only change is zone image source (install dataset <-> artifact; hash of zone image matches in both)
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:1032
    zone_name = oxz_crucible_2d34ddee-25c5-428c-b560-713bc336d5c6
18:18:56.298Z INFO SledAgent (ConfigReconcilerTask): updating config for zone without restarting; only change is zone image source (install dataset <-> artifact; hash of zone image matches in both)
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:1032
    zone_name = oxz_crucible_395310b6-f69d-4a0a-948c-258ff21fad98
18:18:56.298Z INFO SledAgent (ConfigReconcilerTask): updating config for zone without restarting; only change is zone image source (install dataset <-> artifact; hash of zone image matches in both)
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:1032
    zone_name = oxz_crucible_6a39e344-52a7-4f6d-af50-5ba338d948dd
18:18:56.298Z INFO SledAgent (ConfigReconcilerTask): updating config for zone without restarting; only change is zone image source (install dataset <-> artifact; hash of zone image matches in both)
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:1032
    zone_name = oxz_crucible_88ed03ab-6587-48ec-bd4a-d68c30f98863
18:18:56.298Z INFO SledAgent (ConfigReconcilerTask): updating config for zone without restarting; only change is zone image source (install dataset <-> artifact; hash of zone image matches in both)
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:1032
    zone_name = oxz_crucible_9a656f0b-1555-4c0e-8965-06ea44add08f
18:18:56.298Z INFO SledAgent (ConfigReconcilerTask): updating config for zone without restarting; only change is zone image source (install dataset <-> artifact; hash of zone image matches in both)
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:1032
    zone_name = oxz_crucible_a0b488e3-8863-4d63-84ac-d15660c6b1b7
18:18:56.298Z INFO SledAgent (ConfigReconcilerTask): updating config for zone without restarting; only change is zone image source (install dataset <-> artifact; hash of zone image matches in both)
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:1032
    zone_name = oxz_crucible_a309c32c-4ed9-4f7e-a14d-13871864254e
18:18:56.298Z INFO SledAgent (ConfigReconcilerTask): updating config for zone without restarting; only change is zone image source (install dataset <-> artifact; hash of zone image matches in both)
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:1032
    zone_name = oxz_crucible_b7c91d70-50ff-4891-b456-941f345aae9b
18:18:56.298Z INFO SledAgent (ConfigReconcilerTask): updating config for zone without restarting; only change is zone image source (install dataset <-> artifact; hash of zone image matches in both)
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:1032
    zone_name = oxz_ntp_bba45af2-3283-402e-91e0-5fcdf8e73d1e
18:18:56.298Z INFO SledAgent (ConfigReconcilerTask): updating config for zone without restarting; only change is zone image source (install dataset <-> artifact; hash of zone image matches in both)
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:1032
    zone_name = oxz_crucible_cdd73351-a66f-4360-a689-db6b92e0919b
18:18:56.298Z INFO SledAgent (ConfigReconcilerTask): updating config for zone without restarting; only change is zone image source (install dataset <-> artifact; hash of zone image matches in both)
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:1032
    zone_name = oxz_cockroachdb_d3dc0053-5868-4099-9792-20daa1987f63
18:18:56.298Z INFO SledAgent (ConfigReconcilerTask): updating config for zone without restarting; only change is zone image source (install dataset <-> artifact; hash of zone image matches in both)
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:1032
    zone_name = oxz_cockroachdb_dd17acd5-4ba0-4a1d-92f1-e68f2d86606e
18:18:56.298Z INFO SledAgent (ConfigReconcilerTask): shutting down running zone
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:589
    zone = oxz_crucible_pantry_180b2bbc-89a5-4d2c-b7e0-632d305979bc
18:18:59.234Z INFO SledAgent (ConfigReconcilerTask): halt_and_remove_logged: Previous zone state: Running
    file = illumos-utils/src/zone.rs:461
    zone = oxz_crucible_pantry_180b2bbc-89a5-4d2c-b7e0-632d305979bc
18:18:59.324Z INFO SledAgent (ConfigReconcilerTask): starting new zone
    config = OmicronZoneConfig { id: 180b2bbc-89a5-4d2c-b7e0-632d305979bc (service), filesystem_pool: Some(External(1d74a7e6-d4a5-4bba-8e0b-ea3c8ced62ad (external_zpool))), zone_type: CruciblePantry { address: [fd00:1122:3344:104::6]:17000 }, image_source: Artifact { hash: ArtifactHash("4a8b4283c848123bf5ac02ff3b8fae969e6c61f5a827a20cc70658e9127b93a8") } }
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:304

Confirmed by checking zone uptimes; the pantry zone had just restarted, and all other zones had uptimes back to when they originally started.

A final confirmation: inventory shows that all the zones (even the non-bounced ones) are running from the requested artifacts:

    ZONES: 15
        ID                                           KIND                    IMAGE_SOURCE
        003d0721-b895-4e3b-aaee-5955b3214af2         nexus                   artifact: b0ab1aa8304e51cdf6f90d3a50791f151743edc5108984e556726f0ccb692cb1
        11da1e85-ba84-42cb-b42f-f476cccc0577         crucible                artifact: a23d81d8e6fc69d8d791add2ebfb9bb29f98d1a7ce3c6f64ef9c32d829887c05
        180b2bbc-89a5-4d2c-b7e0-632d305979bc         crucible_pantry         artifact: 4a8b4283c848123bf5ac02ff3b8fae969e6c61f5a827a20cc70658e9127b93a8
        2d34ddee-25c5-428c-b560-713bc336d5c6         crucible                artifact: a23d81d8e6fc69d8d791add2ebfb9bb29f98d1a7ce3c6f64ef9c32d829887c05
        395310b6-f69d-4a0a-948c-258ff21fad98         crucible                artifact: a23d81d8e6fc69d8d791add2ebfb9bb29f98d1a7ce3c6f64ef9c32d829887c05
        6a39e344-52a7-4f6d-af50-5ba338d948dd         crucible                artifact: a23d81d8e6fc69d8d791add2ebfb9bb29f98d1a7ce3c6f64ef9c32d829887c05
        88ed03ab-6587-48ec-bd4a-d68c30f98863         crucible                artifact: a23d81d8e6fc69d8d791add2ebfb9bb29f98d1a7ce3c6f64ef9c32d829887c05
        9a656f0b-1555-4c0e-8965-06ea44add08f         crucible                artifact: a23d81d8e6fc69d8d791add2ebfb9bb29f98d1a7ce3c6f64ef9c32d829887c05
        a0b488e3-8863-4d63-84ac-d15660c6b1b7         crucible                artifact: a23d81d8e6fc69d8d791add2ebfb9bb29f98d1a7ce3c6f64ef9c32d829887c05
        a309c32c-4ed9-4f7e-a14d-13871864254e         crucible                artifact: a23d81d8e6fc69d8d791add2ebfb9bb29f98d1a7ce3c6f64ef9c32d829887c05
        b7c91d70-50ff-4891-b456-941f345aae9b         crucible                artifact: a23d81d8e6fc69d8d791add2ebfb9bb29f98d1a7ce3c6f64ef9c32d829887c05
        bba45af2-3283-402e-91e0-5fcdf8e73d1e         internal_ntp            artifact: 61788b075520f3d26e4ee01ed0390c3827a8bce58ecac24980ae86353620fefe
        cdd73351-a66f-4360-a689-db6b92e0919b         crucible                artifact: a23d81d8e6fc69d8d791add2ebfb9bb29f98d1a7ce3c6f64ef9c32d829887c05
        d3dc0053-5868-4099-9792-20daa1987f63         cockroach_db            artifact: faf8c04d0ae6af9ec2aaed4d7d8b7c6334e84c2785b40db9b7b5127ad5b75d1b
        dd17acd5-4ba0-4a1d-92f1-e68f2d86606e         cockroach_db            artifact: faf8c04d0ae6af9ec2aaed4d7d8b7c6334e84c2785b40db9b7b5127ad5b75d1b
    last reconciled config: matches ledgered config
        no orphaned datasets
        all disks reconciled successfully
        all datasets reconciled successfully
        all zones reconciled successfully
    reconciler task status: idle (finished at 2025-07-03 18:19:06.524143 UTC after running for 10.225717875s)

@jgallagher jgallagher merged commit c16f14f into main Jul 3, 2025
16 checks passed
@jgallagher jgallagher deleted the john/config-reconciler-dont-bounce-if-hash-matches branch July 3, 2025 18:36
sunshowers added a commit to oxidecomputer/tufaceous that referenced this pull request Jul 3, 2025
…file name (#33)

In oxidecomputer/omicron#8514 we found that the
artifact and file names don't actually match all the time. With this
change, both the artifact and the file name will have to be specified.

oxidecomputer/omicron#8510 is the issue that
tracks cleaning this up.
sunshowers added a commit that referenced this pull request Jul 3, 2025
In #8514 we found that the artifact and file names don't actually match
all the time. With this change, both the artifact and the file name have
to be specified.

#8510 is the issue that tracks cleaning this up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sled-agent should not bounce zones if the image source changes but their hashes match

2 participants