-
Notifications
You must be signed in to change notification settings - Fork 82
feat: Add a scoring plugin to distribute new groups evenly #357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bc08bee
to
5ee44e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a new scoring plugin called "no-hit-lru-scorer" that helps distribute new groups evenly across pods by tracking which pods have least recently handled cache-miss requests. The plugin uses an LRU cache to give higher scores to pods that haven't recently served new groups, helping to balance cache growth.
Key changes:
- Implements NoHitLRU scorer that tracks pod usage for cache-miss requests
- Adds comprehensive test coverage for the new plugin functionality
- Integrates the plugin into the registration system with sample configuration
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
pkg/plugins/scorer/no_hit_lru.go | Core implementation of the NoHitLRU scorer plugin |
pkg/plugins/scorer/no_hit_lru_test.go | Comprehensive test suite covering functionality and edge cases |
pkg/plugins/register.go | Registers the new plugin in the plugin system |
deploy/config/sim-epp-no-hit-lru.yaml | Sample configuration demonstrating plugin usage |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
1a56bda
to
1e04f9d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good - a few comments.
pkg/plugins/scorer/no_hit_lru.go
Outdated
// - LRU ordering is with respect to when a pod last received a cold request. | ||
// - Least recently used (or never used) pods get highest score (1.0) | ||
// - Most recently used pods get lowest score (approaching 0.0) | ||
func (s *NoHitLRU) Score(ctx context.Context, cycleState *types.CycleState, request *types.LLMRequest, pods []types.Pod) map[types.Pod]float64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might help readability if the main logical blocks in this call are separated to different functions.
|
||
// Read prefix cache state to determine if this is a cold request | ||
// This is treated as an optimization - if the state isn't available, we assume cold request | ||
prefixState, err := types.ReadCycleStateKey[*prefix.SchedulingContextState](cycleState, plugins.StateKey(s.prefixPluginName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker but would be great: do you think you can extend and support the same optimization for the precise-prefix-cache-scorer
too? I personally am eager to use this scorer with the latter.
It is also possible that this optimization can be achieved with unequal weighting (e.g., ratio of 2:1 between prefix-cache scorers and this one) - but could also be fragile in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea. Maybe I can add this in a follow up PR? I'm a bit worried about mixing concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
pkg/plugins/scorer/no_hit_lru.go
Outdated
type NoHitLRU struct { | ||
typedName plugins.TypedName | ||
lruCache *lru.Cache[string, struct{}] // pod name -> dummy value (we only care about order) | ||
mutex *sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lru.Cache
package is thread-safe, the Add
and Keys
calls are protected by an internal lock. The use of this mutex is therefore purely for snapshot semantics: If Keys()
is called, the mutex makes sure no other request will progress before the scoring. Might be good to test the effect of this locking and profile the overhead as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only with unit tests, but I did do a benchmark and added a test with multiple threads. Results support removing mutex imo. Being stale by 1ms on rare occasions shouldn't significantly break the intent behind this -- cache balancing. Here are benchmark results:
https://gist.github.com/usize/b936f67655029c05ba701218c1bfe5da
@usize looks good overall, thank you for the contribution! I think we're just missing updates to the architecture doc which contains info about plugins and their configuration - then good to do. |
TYSM, I'll follow up with the "phase 2" that the task described, and also file an issue and make a patch for the the improvement mentioned here: #357 (comment) 🙏🏻 |
- `lruSize` (optional): The maximum number of pods to track in the LRU cache. Defaults to 1024. | ||
|
||
**Note:** This scorer is designed to work alongside a prefix cache scorer (such as `prefix-cache-scorer` or | ||
`precise-prefix-cache-scorer`). If no prefix cache state is available, all requests are treated as cold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think following with a yaml example of what a configuration looks like is helpful here. Additionally, should state that when integrating with a prefix-cache scorer, the latter should be defined first in the scheduling profile.
Great work @usize, thank you! /lgtm |
@usize can you verify the commits? See this for more info https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification |
When a request results in no cache hit it opens a new group. New groups are likely to disproportiately grow the cache where they land. To avoid uneven cache growth, a new scorer prefers to send new groups to pods that have least recently handled a request from a new group. Signed-off-by: usize <[email protected]>
Signed-off-by: usize <[email protected]>
Signed-off-by: usize <[email protected]>
Signed-off-by: usize <[email protected]>
If no prefix cache state is available, treat all requests as cold. This will still allow for some benefit in terms of balancing. Signed-off-by: usize <[email protected]>
Instead of inspecting the state of the LRU cache directly, it's possible to verify behavior via the usual scoring API. Signed-off-by: usize <[email protected]>
The LRU Cache's size should be included in its error message. Signed-off-by: usize <[email protected]>
Signed-off-by: Maroon Ayoub <[email protected]> Signed-off-by: usize <[email protected]>
…llm-d#358) Signed-off-by: Kay Yan <[email protected]> Co-authored-by: Nir Rozenbaum <[email protected]> Signed-off-by: usize <[email protected]>
Signed-off-by: usize <[email protected]>
Signed-off-by: Kellen Swain <[email protected]> Signed-off-by: usize <[email protected]>
Signed-off-by: usize <[email protected]>
Signed-off-by: usize <[email protected]>
Signed-off-by: usize <[email protected]>
* bump llm-d-kv-cache-manager version (v0.3.2-rc1) Signed-off-by: Maroon Ayoub <[email protected]> * bump kvc mgr version Signed-off-by: Maroon Ayoub <[email protected]> --------- Signed-off-by: Maroon Ayoub <[email protected]> Signed-off-by: Maroon Ayoub <[email protected]> Signed-off-by: usize <[email protected]>
Signed-off-by: usize <[email protected]>
…llm-d#368) Bumps the go-dependencies group with 1 update: [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo). Updates `github.com/onsi/ginkgo/v2` from 2.25.3 to 2.26.0 - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](onsi/ginkgo@v2.25.3...v2.26.0) --- updated-dependencies: - dependency-name: github.com/onsi/ginkgo/v2 dependency-version: 2.26.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
The plugin uses a LRU cache to assign higher scores to pods that have less recently handled a request matching no existing prefixes (thus, a new group). This should help even cache growth, as new groups are likely to add more new blocks to the cache than existing groups.
For simplicity, this implementation relies on state produced by the prefix-cache-scorer plugin in order to track whether a request hits or misses the cache. Because of that, prefix-cache-scorer is required for this to work properly in production.
Fixes: #346