Skip to content

[Enhancement] Adaptive approximate prefix cache indexer #1304

@liu-cong

Description

@liu-cong

What would you like to be added:

In essence EPP is guessing the prefix cache state of each model server. The sources of error can come from: a) we don’t know accurately the total cache size on the model server (though this can be obtained via metrics) b) we use character based prefix match instead of token based and c) any unknown model server eviction behavior. How can we improve the guess accuracy?

  1. We can configure an initial cache size based on the number of total tokens the server can hold. vLLM has a cache_info metric that exposes the cache size.

  2. We can better estimate the character:token ratio by analyzing the request/response bytes vs. the token counts returned by the usage_stats in the response. We can take an average of the last X minutes.

  3. We can use a feedback signal of the actual prefix cache hit ratio of a request vs. the guessed cache hit from EPP. If we guessed wrong, we can shrink the EPP cache size to maintain more recent entries which have higher accuracy.

Why is this needed:

  • This improves the approximate prefix cache accuracy
  • This removes the need from the user to configure the prefix cache indexer properly

Metadata

Metadata

Assignees

Labels

needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions