Skip to content

[Extensions] Improve maxSurge/maxUnavailable handling for multi-zone worker pools #798

@timuthy

Description

@timuthy

How to categorize this issue?

/area usability
/kind enhancement
/priority 3

What would you like to be added:
Today, maxSurge and maxUnavailable values are configured at the worker pool level (ref). Provider extensions usually distribute the configured values if multiple multiple zones are configured (ref).

Although distributing these numbers is generally acceptable, it seems unclear to end-users and thus can end in an unacceptable and unexpected cluster upgrade behavior. This is especially true when maxSurge < len(zones) and maxSurge < len(zones) && maxUnavailable < maxSurge

Example:

    workers:
        name: worker
        machine:
          type: n1-standard-4
          image:
            name: gardenlinux
            version: 318.8.0
        maximum: 5
        minimum: 3
        maxSurge: 1
        maxUnavailable: 0
        zones:
        - europe-west1-a
        - europe-west1-b
        - europe-west1-c

This will result in 3 MachineDeployments:

MachineDeployment Zone maxSurge maxUnavailable
worker-z1 europe-west1-a 1 0
worker-z2 europe-west1-b 0 0
worker-z3 europe-west1-b 0 0

While the workers in europe-west1-a are upgraded in a rolling fashion, the ones in europe-west1-b and europe-west1-c are just replaced. During the upgrade procedure, the cluster will have less Nodes then configured in workers[*].minimum.

We see the following options to improve this user experience (only when maxSurge < len(zones)):

  • Change API validation so that maxSurge >= len(zones) --> incompatible and will probably many automation functionalities around Gardener.
  • Automatically set maxSurge: 1 for each zone (suggested by @AxiomSamarth @himanshu-kun) --> solves many "standard" cases in which maxUnavailable is not used.
  • The worker actuator sets the configured values zone by zone when an upgrade is performed --> comes close to what is expected by end-users but implies long running worker reconciliations.
  • Other thoughts?

Why is this needed:
Needed for better user experience to avoid unexpected outages.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/usabilityUsability relatedkind/enhancementEnhancement, improvement, extensionlifecycle/rottenNobody worked on this for 12 months (final aging stage)needs/planningNeeds (more) planning with other MCM maintainerspriority/2Priority (lower number equals higher priority)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions