-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[v1] Remove bind_kv_cache and self.kv_cache in model runner #14098
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -34,7 +34,6 @@ | |||
from vllm.v1.sample.metadata import SamplingMetadata | ||||
from vllm.v1.sample.rejection_sampler import INVALID_TOKEN_ID, RejectionSampler | ||||
from vllm.v1.spec_decode.ngram_proposer import NgramProposer | ||||
from vllm.v1.utils import bind_kv_cache | ||||
from vllm.v1.worker.gpu_input_batch import CachedRequestState, InputBatch | ||||
from vllm.v1.worker.lora_model_runner_mixin import LoRAModelRunnerMixin | ||||
|
||||
|
@@ -135,7 +134,6 @@ def __init__( | |||
|
||||
# Lazy initialization | ||||
# self.model: nn.Module # Set after load_model | ||||
self.kv_caches: list[torch.Tensor] = [] | ||||
# req_id -> (input_id -> encoder_output) | ||||
self.encoder_cache: dict[str, dict[int, torch.Tensor]] = {} | ||||
|
||||
|
@@ -1382,10 +1380,13 @@ def initialize_kv_cache(self, kv_cache_config: KVCacheConfig) -> None: | |||
else: | ||||
raise NotImplementedError | ||||
|
||||
bind_kv_cache( | ||||
kv_caches, | ||||
self.vllm_config.compilation_config.static_forward_context, | ||||
self.kv_caches) | ||||
# Associates each attention layer in the `forward_context` with the | ||||
# initialized KV cache. | ||||
forward_context = self.vllm_config.compilation_config \ | ||||
.static_forward_context | ||||
for layer_name, kv_cache in kv_caches.items(): | ||||
# NOTE: Use list because of v0 PP virtual engine. | ||||
forward_context[layer_name].kv_cache = [kv_cache] | ||||
Comment on lines
+1383
to
+1389
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is still a kind of "binding"? To unify the interface for runners, would the following be better? bind_kv_cache(
kv_caches,
self.vllm_config.compilation_config.static_forward_context,
) And in TPU model runner: bind_kv_cache(
kv_caches,
self.vllm_config.compilation_config.static_forward_context,
self.kv_caches,
) So we could have def bind_kv_cache(
kv_caches: dict[str, torch.Tensor],
forward_context: dict[str, "Attention"],
runner_kv_caches: Optional[list[torch.Tensor]] = None,
) -> None:
# Bind runner kv caches. Also comment
# this is only used by TPU for now.
if runner_kv_caches is not None:
...
# Bind forward_context.
... If TPU is the only model runner (now and future) that needs this, alternatively we could have a separate utility: def bind_runner_kv_caches(
kv_caches: dict[str, torch.Tensor],
runner_kv_caches: list[torch.Tensor],
)
... And in TPU model runner: bind_kv_cache(
kv_caches,
self.vllm_config.compilation_config.static_forward_context,
)
bind_runner_kv_caches(
kv_caches,
self.kv_caches,
) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In tpu_model_runner, The binding to forward context is quite simple, so I prefer to implement it in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok if this is confirmed please add TODO/FIXME to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually @heheda12345 , could you check if v1's tpu_model_runner is really using the self.kv_caches? I did a quick check. It seems that vllm/vllm/v1/worker/tpu_model_runner.py Line 823 in 04421df
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is used in |
||||
|
||||
def get_kv_cache_spec(self) -> KVCacheSpec: | ||||
""" | ||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file shouldn't be here (in
vllm/
)?