Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 22 additions & 12 deletions config/charts/inferencepool/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,19 +103,30 @@ $ helm install triton-llama3-8b-instruct \

To deploy the EndpointPicker in a high-availability (HA) active-passive configuration, you can enable leader election. When enabled, the EPP deployment will have multiple replicas, but only one "leader" replica will be active and ready to process traffic at any given time. If the leader pod fails, another pod will be elected as the new leader, ensuring service continuity.

To enable HA, set `inferenceExtension.flags.has-enable-leader-election` to `true` and increase the number of replicas in your `values.yaml` file:
To enable HA, set `inferenceExtension.enableLeaderElection` to `true`.

```yaml
inferenceExtension:
replicas: 3
has-enable-leader-election: true
```
* Via `--set` flag:

Then apply it with:
```txt
helm install vllm-llama3-8b-instruct \
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
--set inferenceExtension.enableLeaderElection=true \
--set provider=[none|gke] \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
```

```txt
helm install vllm-llama3-8b-instruct ./config/charts/inferencepool -f values.yaml
```
* Via `values.yaml`:

```yaml
inferenceExtension:
enableLeaderElection: true
```

Then apply it with:

```txt
helm install vllm-llama3-8b-instruct ./config/charts/inferencepool -f values.yaml
```

### Install with Monitoring

Expand Down Expand Up @@ -171,8 +182,7 @@ The following table list the configurable parameters of the chart.
| `inferenceExtension.extraServicePorts` | List of additional service ports to expose. Defaults to `[]`. |
| `inferenceExtension.flags` | List of flags which are passed through to endpoint picker. Example flags, enable-pprof, grpc-port etc. Refer [runner.go](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/cmd/epp/runner/runner.go) for complete list. |
| `inferenceExtension.affinity` | Affinity for the endpoint picker. Defaults to `{}`. |
| `inferenceExtension.tolerations` | Tolerations for the endpoint picker. Defaults to `[]`. |
| `inferenceExtension.flags.has-enable-leader-election` | Enable leader election for high availability. When enabled, only one EPP pod (the leader) will be ready to serve traffic. |
| `inferenceExtension.tolerations` | Tolerations for the endpoint picker. Defaults to `[]`. | |
| `inferenceExtension.monitoring.interval` | Metrics scraping interval for monitoring. Defaults to `10s`. |
| `inferenceExtension.monitoring.secret.name` | Name of the service account token secret for metrics authentication. Defaults to `inference-gateway-sa-metrics-reader-secret`. |
| `inferenceExtension.monitoring.prometheus.enabled` | Enable Prometheus ServiceMonitor creation for EPP metrics collection. Defaults to `false`. |
Expand Down
30 changes: 23 additions & 7 deletions config/charts/inferencepool/templates/epp-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,19 @@ metadata:
labels:
{{- include "gateway-api-inference-extension.labels" . | nindent 4 }}
spec:
replicas: {{ .Values.inferenceExtension.replicas | default 1 }}
{{- if .Values.inferenceExtension.enableLeaderElection }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to articulate this where the default is 3 if leaderElection is enabled, and the default is 1 otherwise?

This would still allow a user to specify replica count if desired.

We currently suggest active-passive as a best HA practice, but a user could decide they would rather use active-active, incur the performance cost(or maybe their algo works fine with active-active), and use multiple replicas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tradeoff between simplicity and best practice vs the flexibility I had to make. I think in helm we should prioritize the former, as advanced users can always fork and tweak for the additional flexibility they want.

So for the current best practices, I think we recommend HA with 3 replicas for "critical" use cases, and 1 replica for non-critical. We don't recommend active-active due to routing performance reasons. Users can do that if they understand the details, but we don't offer that out of the box in helm. My worry is that if we offer that, they will find the performance worse than what we advertise, and it's not obvious why that happens.

Open for debate but I think simplicity is quite important here. In the current state it's easy to shoot yourself in the foot with multiple active-active replicas without understanding the performance penalty, and meanwhile it's hard to configure the leader election properly as there are 3 things going on (the flag, the replicas, and the rbac).

Copy link
Contributor Author

@liu-cong liu-cong Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we just need to add some more documentation, and explain that if you want to go active-active, you can tweak this way, and here are the implications, bla bla.

Is this an acceptable outcome? I do think that users who want active-active need to understand the implications, and likely "advanced" use cases. We don't need to make it simple, but we need to articulate it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 3 is a reasonable default that I don't think many would want to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change makes the replicas field not overridable, but hardcoded.

In the current state it's easy to shoot yourself in the foot with multiple active-active replicas without understanding the performance penalty

I agree with the above, therefore I suggest keeping replicas always 1 when leader election is disabled,
but can we change it at least in leaderEnabled setup?
it should be possible to override the number of replicas easily.
I'm expecting something like:

{{- if .Values.inferenceExtension.enableLeaderElection }}
replicas: {{ .Values.inferenceExtension.replicas | default 3 }}
{{- else }}
replicas: 1
{{- end }}

so we get default of 3/1 (depends on the HA settings), but can still override the value of replicas as we wish.

Copy link
Contributor

@nirrozenbaum nirrozenbaum Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, another proposal -
we can remove the enableLeaderElection flag completely from helm and use only replicas field.
then we add if/else to the helm templates -
if replicas is 1 - no leader election
if replicas is more than 1 - leader election enabled.

this way, there is no way for users to get confused in their setup because they set only a single field and we set the leader election for them automatically.
so we change the deployment template as follows:

{{- if gt .Values.inferenceExtension.replicas 1 }}
- --ha-enable-leader-election

I like this proposal more, since it keeps the user away from leader election enabled/disabled and keeps him focused only on the number of replicas.
we currently don't want to support active active mode, and therefore it shouldn't be possible to configure it through our helm chart.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created PR #1628

replicas: 3
{{- else }}
replicas: 1
{{- end }}
strategy:
# The current recommended EPP deployment pattern is to have a single active replica. This ensures
# optimal performance of the stateful operations such prefix cache aware scorer.
# The Recreate strategy the old replica is killed immediately, and allow the new replica(s) to
# quickly take over. This is particularly important in the high availability set up with leader
# election, as the rolling update strategy would prevent the old leader being killed because
# otherwise the maxUnavailable would be 100%.
type: Recreate
selector:
matchLabels:
{{- include "gateway-api-inference-extension.selectorLabels" . | nindent 6 }}
Expand All @@ -33,10 +45,6 @@ spec:
- "json"
- --config-file
- "/config/{{ .Values.inferenceExtension.pluginsConfigFile }}"
{{- range .Values.inferenceExtension.flags }}
- "--{{ .name }}"
- "{{ .value }}"
{{- end }}
{{- if eq (.Values.inferencePool.modelServerType | default "vllm") "triton-tensorrt-llm" }}
- --total-queued-requests-metric
- "nv_trt_llm_request_metrics{request_type=waiting}"
Expand All @@ -45,6 +53,14 @@ spec:
- --lora-info-metric
- "" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet.
{{- end }}
{{- if .Values.inferenceExtension.enableLeaderElection }}
- --ha-enable-leader-election
{{- end }}
# Pass additional flags via the inferenceExtension.flags field in values.yaml.
{{- range .Values.inferenceExtension.flags }}
- "--{{ .name }}"
- "{{ .value }}"
{{- end }}
ports:
- name: grpc
containerPort: 9002
Expand Down Expand Up @@ -77,8 +93,8 @@ spec:
port: 9003
service: inference-extension
{{- end }}
initialDelaySeconds: 5
periodSeconds: 10
periodSeconds: 2

env:
- name: NAMESPACE
valueFrom:
Expand Down
4 changes: 4 additions & 0 deletions config/charts/inferencepool/templates/gke.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ spec:
kind: InferencePool
name: {{ .Release.Name }}
default:
# Set a more aggressive health check than the default 5s for faster switch
# over during EPP rollout.
timeoutSec: 2
checkIntervalSec: 2
config:
type: HTTP
httpHealthCheck:
Expand Down