[Speculative decoding] [Performance]: Re-enable bonus tokens

### Proposal to improve performance

In https://github.com/vllm-project/vllm/pull/3951 we disable bonus tokens (token sampled from verifier model assuming all proposal tokens are accepted) because its KV is not generated for the draft model. We can fix this by "prefilling" the KV of bonus tokens in the draft model. Note that for proposal methods not requiring KV (e.g. prompt lookup), we can re-enable bonus tokens and get a speedup there.

The impact of this performance improvement depends on the speculation length. For low K, e.g. 1, where the probability of accepting the single spec token is high (~= how aligned the draft model and target model are on the sequence), it has high impact because accepting 1 token allows us to emit 2 tokens (1 speculative and 1 bonus). Since we disable bonus tokens, we can now only emit 1 token (the accepted speculative one).

For higher K the impact is less as the likelihood of accepting all speculative tokens is exponentially lower.

https://github.com/vllm-project/vllm/blob/323f27b9048713cdbab31995265975842a937167/vllm/model_executor/layers/rejection_sampler.py#L311-L315


	# We disable bonus tokens because it causes corrupt KV cache for
	# proposal methods that require KV cache. We can fix it by "prefilling"
	# the bonus token in the proposer. The following issue tracks the fix.
	# https://github.com/vllm-project/vllm/issues/4212
	output_with_bonus_tokens[:, -1] = -1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Speculative decoding] [Performance]: Re-enable bonus tokens #4212

Proposal to improve performance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Speculative decoding] [Performance]: Re-enable bonus tokens #4212

Description

Proposal to improve performance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions