[Extensions] Improve maxSurge/maxUnavailable handling for multi-zone worker pools

**How to categorize this issue?**

/area usability
/kind enhancement
/priority 3

**What would you like to be added**:
Today, `maxSurge` and `maxUnavailable` values are configured at the worker pool level ([ref](https://github.com/gardener/gardener/blob/8fe28dfd64377d5a62854c200b2618da4d01c8b9/example/90-shoot.yaml#L37-L38)). Provider extensions usually **distribute** the configured values if multiple multiple zones are configured ([ref](https://github.com/gardener/gardener-extension-provider-aws/blob/dee11db8d82be58a1b96168ceef7721a72c51e9d/pkg/controller/worker/machines.go#L188-L189)).

Although distributing these numbers is generally acceptable, it seems unclear to end-users and thus can end in an unacceptable and unexpected cluster upgrade behavior. This is especially true when `maxSurge < len(zones)` and `maxSurge < len(zones) && maxUnavailable < maxSurge`

Example:

```yaml
    workers:
        name: worker
        machine:
          type: n1-standard-4
          image:
            name: gardenlinux
            version: 318.8.0
        maximum: 5
        minimum: 3
        maxSurge: 1
        maxUnavailable: 0
        zones:
        - europe-west1-a
        - europe-west1-b
        - europe-west1-c
```

This will result in 3 `MachineDeployments`:

| MachineDeployment | Zone | maxSurge | maxUnavailable  |
| ------------- |:-------------|:-------------:| -----:|
| worker-z1 | europe-west1-a | 1 | 0 |
| worker-z2 | europe-west1-b | 0 | 0 |
| worker-z3 | europe-west1-b | 0 | 0 |

While the workers in `europe-west1-a` are upgraded in a rolling fashion, the ones in `europe-west1-b` and `europe-west1-c` are just replaced. During the upgrade procedure, the cluster will have less `Node`s then configured in `workers[*].minimum`.

We see the following options to improve this user experience (only when `maxSurge < len(zones)`):
- Change API validation so that `maxSurge >= len(zones)` --> incompatible and will probably many automation functionalities around Gardener.
- Automatically set `maxSurge: 1` for each zone (suggested by @AxiomSamarth @himanshu-kun) --> solves many "standard" cases in which `maxUnavailable` is not used.
- The worker actuator sets the configured values zone by zone when an upgrade is performed --> comes close to what is expected by end-users but implies long running worker reconciliations.
- Other thoughts?

**Why is this needed**:
Needed for better user experience to avoid unexpected outages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Extensions] Improve maxSurge/maxUnavailable handling for multi-zone worker pools #798

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MachineDeployment	Zone	maxSurge
worker-z1	europe-west1-a	1
worker-z2	europe-west1-b	0
worker-z3	europe-west1-b	0

[Extensions] Improve maxSurge/maxUnavailable handling for multi-zone worker pools #798

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions