Skip to content

Conversation

nirrozenbaum
Copy link
Contributor

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR fixes a bug of EPP crashing in high scale. the bug happens due to requestID being empty and then prefix plugin maps different requests to the same PluginState entry, which results in different go routines trying to write to the same map concurrently. more details in issue #1489.
This PR guarantees requestID is always set. if it's not supplied as a header, we generate it while handling the request headers, before starting to handle the request body.

Which issue(s) this PR fixes:

Fixes #1489

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 28, 2025
Copy link

netlify bot commented Aug 28, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit b100b9f
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68b1366f4bc5a90008e04154
😎 Deploy Preview https://deploy-preview-1490--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 28, 2025
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and danehans August 28, 2025 06:38
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nirrozenbaum

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 28, 2025
@nirrozenbaum
Copy link
Contributor Author

/cc @kfswain

@k8s-ci-robot k8s-ci-robot requested a review from kfswain August 28, 2025 06:38
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 28, 2025
@kfswain
Copy link
Collaborator

kfswain commented Aug 28, 2025

I notice this deals with the requestID problem, but leave the pluginState concurrent unsafe still.

I don't think we have any guarantees that a plugin will provide the appropriate keys when calling Write, leading to a similar issue

@nirrozenbaum
Copy link
Contributor Author

I notice this deals with the requestID problem, but leave the pluginState concurrent unsafe still.

I don't think we have any guarantees that a plugin will provide the appropriate keys when calling Write, leading to a similar issue

I'm not sure I followed the problem you're describing.
I'll state here the assumptions of PluginState code:

  • a state is maintained per request. as long as a plugin uses PluginState of its own (it's not shared with another plugin), which is the current state, and as long as different requests get different requestID, that means that in different requests we fetch different values from the sync.map.
  • a plugin never runs two extension points (or more) in parallel for the same request.

the sync map is used to map between requestID -> map[string]any (just a [key, value] pairs for that request).
each extension point of the prefix plugin can do with this map whatever we want, without the need to lock it, because the code is contention-free (based on the above assumptions).
the map is relevant only for current plugin and only for current request.

a plugin can reuse the same StateKey as many times as we want for different requests. this is not an issue.
in fact this is even verified in the unit tests.
a plugin can also use the same StateKey in the same request. but that would be serial and not in parallel, so again no issue.

@nirrozenbaum
Copy link
Contributor Author

I notice this deals with the requestID problem, but leave the pluginState concurrent unsafe still.
I don't think we have any guarantees that a plugin will provide the appropriate keys when calling Write, leading to a similar issue

I'm not sure I followed the problem you're describing. I'll state here the assumptions of PluginState code:

  • a state is maintained per request. as long as a plugin uses PluginState of its own (it's not shared with another plugin), which is the current state, and as long as different requests get different requestID, that means that in different requests we fetch different values from the sync.map.
  • a plugin never runs two extension points (or more) in parallel for the same request.

the sync map is used to map between requestID -> map[string]any (just a [key, value] pairs for that request). each extension point of the prefix plugin can do with this map whatever we want, without the need to lock it, because the code is contention-free (based on the above assumptions). the map is relevant only for current plugin and only for current request.

a plugin can reuse the same StateKey as many times as we want for different requests. this is not an issue. in fact this is even verified in the unit tests. a plugin can also use the same StateKey in the same request. but that would be serial and not in parallel, so again no issue.

was your intention that the internal map should also be using sync map because we don't trust the caller?

@kfswain
Copy link
Collaborator

kfswain commented Aug 28, 2025

was your intention that the internal map should also be using sync map because we don't trust the caller?

Yes, if a plugin implemented absentmindedly passes in an empty string for the requestID, that shouldn't crash the entire EPP, and thats an easy one to miss in testing

@nirrozenbaum
Copy link
Contributor Author

Yes, if a plugin implemented absentmindedly passes in an empty string for the requestID, that shouldn't crash the entire EPP, and thats an easy one to miss in testing

yeah, sure. I can change that.
do you prefer in this PR or a separate one?

@kfswain
Copy link
Collaborator

kfswain commented Aug 28, 2025

This PR works for me

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 29, 2025
@nirrozenbaum
Copy link
Contributor Author

This PR works for me

Done 💪🏻

@kfswain
Copy link
Collaborator

kfswain commented Aug 29, 2025

/lgtm

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 29, 2025
@k8s-ci-robot k8s-ci-robot merged commit 2220efc into kubernetes-sigs:main Aug 29, 2025
10 checks passed
@nirrozenbaum nirrozenbaum deleted the fix-empty-req-id branch August 29, 2025 17:13
kfswain pushed a commit to kfswain/llm-instance-gateway that referenced this pull request Sep 1, 2025
…igs#1490)

* if request id was not supplied in header, generate uuid

Signed-off-by: Nir Rozenbaum <[email protected]>

* convert map to sync.map in plugin state

Signed-off-by: Nir Rozenbaum <[email protected]>

---------

Signed-off-by: Nir Rozenbaum <[email protected]>
nirrozenbaum added a commit that referenced this pull request Sep 5, 2025
* if request id was not supplied in header, generate uuid

Signed-off-by: Nir Rozenbaum <[email protected]>

* convert map to sync.map in plugin state

Signed-off-by: Nir Rozenbaum <[email protected]>

---------

Signed-off-by: Nir Rozenbaum <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EPP crashing during scale test
3 participants