-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Description
Proposal to improve performance
In #3951 we disable bonus tokens (token sampled from verifier model assuming all proposal tokens are accepted) because its KV is not generated for the draft model. We can fix this by "prefilling" the KV of bonus tokens in the draft model. Note that for proposal methods not requiring KV (e.g. prompt lookup), we can re-enable bonus tokens and get a speedup there.
The impact of this performance improvement depends on the speculation length. For low K, e.g. 1, where the probability of accepting the single spec token is high (~= how aligned the draft model and target model are on the sequence), it has high impact because accepting 1 token allows us to emit 2 tokens (1 speculative and 1 bonus). Since we disable bonus tokens, we can now only emit 1 token (the accepted speculative one).
For higher K the impact is less as the likelihood of accepting all speculative tokens is exponentially lower.
vllm/vllm/model_executor/layers/rejection_sampler.py
Lines 311 to 315 in 323f27b
# We disable bonus tokens because it causes corrupt KV cache for | |
# proposal methods that require KV cache. We can fix it by "prefilling" | |
# the bonus token in the proposer. The following issue tracks the fix. | |
# https://github.com/vllm-project/vllm/issues/4212 | |
output_with_bonus_tokens[:, -1] = -1 |