[Enhancement] Adaptive approximate prefix cache indexer

**What would you like to be added**:

In essence EPP is guessing the prefix cache state of each model server. The sources of error can come from: a) we don’t know accurately the total cache size on the model server (though this can be obtained via metrics) b) we use character based prefix match instead of token based and c) any unknown model server eviction behavior. How can we improve the guess accuracy? 

1. We can configure an initial cache size based on the number of total tokens the server can hold. vLLM has a `cache_info` metric that exposes the cache size.

1. We can better estimate the character:token ratio by analyzing the request/response bytes vs. the token counts returned by the `usage_stats` in the response. We can take an average of the last X minutes.

1. We can use a feedback signal of the [actual prefix cache hit ratio of a request](https://github.com/vllm-project/vllm/blob/v0.8.5/vllm/entrypoints/openai/protocol.py#L119) vs. the guessed cache hit from EPP. If we guessed wrong, we can shrink the EPP cache size to maintain more recent entries which have higher accuracy.

**Why is this needed**:

* This improves the approximate prefix cache accuracy
* This removes the need from the user to configure the prefix cache indexer properly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement] Adaptive approximate prefix cache indexer #1304

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Enhancement] Adaptive approximate prefix cache indexer #1304

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions