[Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models #5765

sroy745 · 2024-06-22T19:58:06Z

FILL IN THE PR DESCRIPTION HERE

In this PR we make the following changes

Update the spec_decode_worker to keep track of the sequence_ids which we were assigned bonus token ids in their last forward pass. We record this only for the MultiStepWorker since other Worker types don't utilize the KV cache for token generation. Currently we don't clear out the sequence ids from this list even on sequence termination. We need a way to get notified on sequence termination and remove those sequence ids
Updated the MultiStepWorker to expand the batch during step 0. During batch expansion we check to see which of the sequence ids were assigned bonus tokens in their last forward pass. For each of those sequences we add a new sequence (with the same seq_id) without the bonus token. Once the forward pass for step 0 is completed we filter out the response to retain only those responses which correspond to the original sequences.
Added a flag --disable-bonus-tokens-in-kv-cache to enable/disable bonus tokens for MultiStepWorker.

Some numbers from e2e tests. Note that the e2e tests don't use cuda graphs. The draft model is JackFram/llama-68m and the target model is JackFram/llama-160m. We use a batch size of 64. Completion time for num_speculation = 1 shows ~33% speedup

w/o bonus Processed prompts: 100%|█████████████████64/64 [00:06<00:00, 10.13it/s, est. speed input: 78.48 toks/s, output: 2592.40 toks/s]
with bonus Processed prompts: 100%|█████████████████████ 64/64 [00:04<00:00, 15.50it/s, est. speed input: 120.13 toks/s, output: 3968.28 toks/s]

Pull from head

cadedaniel · 2024-06-24T17:45:10Z

LMK once it's ready for review @sroy745

cadedaniel

approach looks good. seems this will conflict with #5799. we should work out a way to combine both.

vllm/spec_decode/ngram_worker.py

cadedaniel · 2024-06-25T00:29:36Z

tests/spec_decode/e2e/test_multistep_correctness.py

+    and batch sizes when bonus token acceptance is enabled. It ensures 
+    correctness by comparing the output of speculative decoding with the baseline.
+    """
+    run_greedy_equality_correctness_test(baseline_llm_generator,


Test looks good. Can you manually check the draft acceptance rate? for the same draft/target model it should be 100%. without your fix (and with bonus token enabled) the acceptance rate goes to like 80%.

Ideally we have a test for this, not sure how easy that is..

Yeah I thought of adding such an e2e test but I could find an easy way to access the metrics_collector and the stats. As you suggested I added an e2e test in test_multistep_correctness.py with both draft and target model as same ("JackFram/llama-68m") and number of speculation as 3. The System efficiency with bonus token enabled is 1 and with bonus token disabled is 0.75

With bonus token
INFO 07-08 06:27:27 metrics.py:316] Speculative metrics: Draft acceptance rate: 1.000, System efficiency: 1.000, Number of speculative tokens: 3, Number of accepted tokens: 21696, Number of draft tokens tokens: 21696, Number of emitted tokens tokens: 28928.

Without bonus token
Speculative metrics: Draft acceptance rate: 1.000, System efficiency: 0.750, Number of speculative tokens: 3, Number of accepted tokens: 21696, Number of draft tokens tokens: 21696, Number of emitted tokens tokens: 21696.

vllm/engine/arg_utils.py

cadedaniel · 2024-06-25T00:31:12Z

vllm/spec_decode/multi_step_worker.py

+        indices_of_original_sequence_groups = []
+        for seq_group in execute_model_req.seq_group_metadata_list:
+            seq_ids_with_bonus_tokens = []
+            for seq_id, seq_data in seq_group.seq_data.items():


approach looks good. pretty messy but not anything easy to fix that.

Tried to simplify it a bit by moving some of the logic to 2 helper functions _shallow_copy_seq_group_metadata and _copy_seq_metadata_excluding_last_token. PTAL.

cadedaniel · 2024-06-25T00:32:19Z

tests/spec_decode/test_spec_decode_worker.py

    target_worker.execute_model.assert_called_once_with(execute_model_req)

+@torch.inference_mode()
+def test_populate_seq_ids_with_bonus_tokens():


one edge case is where a sequence is skipped but is present in seq_with_bonus_token_in_last_step

Added this case in Forward pass # 2.

cadedaniel · 2024-07-08T07:56:14Z

Awesome. Will take a look tomorrow.

cadedaniel

Can we manually verify the following?

When bonus tokens are enabled and the same model is used for draft and target, we get 100% draft acceptance rate. This indicates that the KV of the draft model is ~equal to the KV of the target model.

cadedaniel · 2024-07-09T07:29:39Z

vllm/spec_decode/multi_step_worker.py

+                # Also reduce num_computed_tokens by 1 since we are not
+                # including the last output token.


Can we add more comment here motivating this? I thought the worker's themselves don't take this into account unless chunked prefill is enabled.

I included this to keep the value consistent. It probably not needed since it gets used only for chunked prefill. However I see this being updated elsewhere in the sd code e.g. https://sourcegraph.com/github.com/vllm-project/vllm/-/blob/vllm/spec_decode/draft_model_runner.py?L116 . Hence I added this. Is this confusing? Should I remove it?

Gotcha -- yeah can leave in. Add a comment?

Added a note.

vllm/spec_decode/spec_decode_worker.py

tests/spec_decode/test_multi_step_worker.py

cadedaniel · 2024-07-09T07:45:06Z

tests/spec_decode/test_multi_step_worker.py

+
+
+@torch.inference_mode()
+def test_expand_execute_model_request_for_bonus_tokens():


Why do we need both this test and test_same_output_for_multi_step_with_batch_expansion?

fine to have both but we should add something in the docstring which explains the difference

Removed this test as well as test_filter_model_output. Both should be covered by test_same_output_for_multi_step_with_batch_expansion as you suggested.

cadedaniel · 2024-07-09T07:47:48Z

tests/spec_decode/test_multi_step_worker.py

+    output_indices_to_retain = random.sample(range(num_steps),
+                                             max(1, num_steps // 2))


I am confused why we sample from range(num_steps). shouldn't it be range(batch_size)?

It should be range(batch_size). I removed the test though in favor of test_same_output_for_multi_step_with_batch_expansion

cadedaniel · 2024-07-09T07:51:43Z

tests/spec_decode/test_spec_decode_worker.py

+    # Forward Pass : 0
+    # Set the last token ID to -1 for all indices not in
+    # seq_indexes_with_bonus_tokens to indicate the lack of bonus token in
+    # those indices.
+    accepted_token_ids[mask, -1:] = -1
+    worker = SpecDecodeWorker(draft_worker,


Can we break the three tests in this test into their own tests? it will be easier to follow and debug

I modified the test to initialize the internal data structures with some fake data and then run a forward pass to simulate all 3 cases. PTAL. It should be simpler to follow now while covering all the cases.

cadedaniel · 2024-07-09T07:56:58Z

Oh I just saw your response

Yeah I thought of adding such an e2e test but I could find an easy way to access the metrics_collector and the stats. As you suggested I added an e2e test in test_multistep_correctness.py with both draft and target model as same ("JackFram/llama-68m") and number of speculation as 3. The System efficiency with bonus token enabled is 1 and with bonus token disabled is 0.75

With bonus token
INFO 07-08 06:27:27 metrics.py:316] Speculative metrics: Draft acceptance rate: 1.000, System efficiency: 1.000, Number of speculative tokens: 3, Number of accepted tokens: 21696, Number of draft tokens tokens: 21696, Number of emitted tokens tokens: 28928.

Without bonus token
Speculative metrics: Draft acceptance rate: 1.000, System efficiency: 0.750, Number of speculative tokens: 3, Number of accepted tokens: 21696, Number of draft tokens tokens: 21696, Number of emitted tokens tokens: 21696.

This is awesome! so exciting to see this working!

sroy745 · 2024-07-10T03:20:42Z

Thanks for the review. Addressed the comments. PTAL

Merge main

cadedaniel

Looks great!

cadedaniel · 2024-07-10T04:09:57Z

vllm/spec_decode/multi_step_worker.py

+                # Also reduce num_computed_tokens by 1 since we are not
+                # including the last output token.


Gotcha -- yeah can leave in. Add a comment?

sroy745

Thanks for the review. Addressed your comment and rebased. Should be ready to merge once the tests complete.

sroy745 · 2024-07-10T06:15:55Z

vllm/spec_decode/multi_step_worker.py

+                # Also reduce num_computed_tokens by 1 since we are not
+                # including the last output token.


Added a note.

cadedaniel · 2024-07-10T23:02:54Z

Merged!

…or KV cache based models (vllm-project#5765) (cherry picked from commit ae151d7)

…or KV cache based models (vllm-project#5765)

…or KV cache based models (vllm-project#5765) Signed-off-by: Alvant <[email protected]>

…or KV cache based models (vllm-project#5765) Signed-off-by: LeiWang1999 <[email protected]>

sroy745 and others added 8 commits May 28, 2024 20:39

Merge pull request #1 from vllm-project/main

5650b95

Pull from head

Merge branch 'vllm-project:main' into main

8f36146

Merge branch 'vllm-project:main' into main

9e75057

Merge branch 'vllm-project:main' into main

db2c679

Merge branch 'vllm-project:main' into main

8d7512c

Merge branch 'vllm-project:main' into main

1473f74

Merge branch 'vllm-project:main' into main

4013e1a

Enabling bonus token in speculative decoding for KV cache based models

6c55024

sroy745 marked this pull request as draft June 22, 2024 19:58

sroy745 and others added 13 commits June 22, 2024 20:29

Fixing comments and reverting changes in conftest.py

4e27a2c

Reverting changes in conftest

fc20e26

Fixing a syntax error

55a8b2b

Fix syntax dif

2d7eb37

Add new SpecDecodeWorker tests

8c129ed

Merge branch 'main' into vllm-sd-enable-bonus-token-1

12c6fad

Fix function signatures and add test

af94752

Fix interfaces

e96ac72

Reduce num speculations in tests

d39c82f

More cleanup and comments

efce280

Revert vllm/spec_decode/util.py

4ef9f24

Revert vllm/spec_decode/util.py

3b50193

Remove space

c4fae4f

sroy745 changed the title ~~Enabling bonus token in speculative decoding for KV cache based models~~ [Speculative Decoding] [WIP] Enabling bonus token in speculative decoding for KV cache based models Jun 24, 2024

More fixes

5f9772c

sroy745 marked this pull request as ready for review June 24, 2024 17:35

sroy745 changed the title ~~[Speculative Decoding] [WIP] Enabling bonus token in speculative decoding for KV cache based models~~ [Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models Jun 24, 2024

Fix passing function parameter

79608c9

Fix est_dynamic_spec_decode.py

69d0d47

cadedaniel reviewed Jun 25, 2024

View reviewed changes

sroy745 added 3 commits July 8, 2024 06:41

Revert changes to vllm/config.py

e689de0

Fix config.py

fd63121

Fix config.py

ca49aa1

cadedaniel reviewed Jul 9, 2024

View reviewed changes

sroy745 added 5 commits July 10, 2024 00:14

Address comments

c245d7a

Fix format

fe06b46

Revert a change

72d6857

Make tests concise

aea0bbf

Fix a test comment

decae31

sroy745 and others added 3 commits July 9, 2024 21:05

Merge pull request #6 from vllm-project/main

1b459c3

Merge main

Rebasing to include medusa_worker.py

c9dcd58

Add comment

72217c1

cadedaniel approved these changes Jul 10, 2024

View reviewed changes

Add note

fe84b0d

sroy745 commented Jul 10, 2024

View reviewed changes

sroy745 added 2 commits July 10, 2024 17:38

Dummy commit

9f9c5b7

Dummy commit

780ac4d

cadedaniel merged commit ae151d7 into vllm-project:main Jul 10, 2024

jiqing-feng mentioned this pull request Jul 11, 2024

[Bug]: Speculative decoding cannot match tokens #6285

Closed

adityagoel14 pushed a commit to adityagoel14/vllm-torchrun-test that referenced this pull request Jul 11, 2024

[Speculative Decoding] Enabling bonus token in speculative decoding f…

f786ec7

…or KV cache based models (vllm-project#5765) (cherry picked from commit ae151d7)

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

[Speculative Decoding] Enabling bonus token in speculative decoding f…

8885b8b

…or KV cache based models (vllm-project#5765)

LiuXiaoxuanPKU mentioned this pull request Sep 22, 2024

[SpecDec][Misc] Cleanup, remove bonus token logic. #8701

Merged

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Speculative Decoding] Enabling bonus token in speculative decoding f…

2319caa

…or KV cache based models (vllm-project#5765) Signed-off-by: Alvant <[email protected]>

llsj14 mentioned this pull request Oct 27, 2024

[Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode #9730

Merged

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[Speculative Decoding] Enabling bonus token in speculative decoding f…

b5e9e41

…or KV cache based models (vllm-project#5765) Signed-off-by: LeiWang1999 <[email protected]>

		# Also reduce num_computed_tokens by 1 since we are not
		# including the last output token.



		@torch.inference_mode()
		def test_expand_execute_model_request_for_bonus_tokens():

		output_indices_to_retain = random.sample(range(num_steps),
		max(1, num_steps // 2))

Uh oh!

[Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models #5765

[Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models #5765

Uh oh!

Conversation

sroy745 commented Jun 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cadedaniel commented Jun 24, 2024

Uh oh!

cadedaniel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadedaniel commented Jul 8, 2024

Uh oh!

cadedaniel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadedaniel commented Jul 9, 2024

Uh oh!

sroy745 commented Jul 10, 2024

Uh oh!

cadedaniel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sroy745 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadedaniel commented Jul 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

sroy745 commented Jun 22, 2024 •

edited

Loading