[V1][Bug][Spec Decode]: Overhead of SD with PC is 30-40% higher than baseline

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
==============================                                                                                                                       
         vLLM Info                                                                                                                                   
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.9.1.dev705+g01220ce89 (git sha: 01220ce89)
```

</details>


### 🐛 Describe the bug
Setup: 
* H100 80GB
* llama 3.1 8b
* MTBench

It seems that overhead of eagle on ttft is 30-40% especially in cases when Prompt caching benefits the model with good cache hits. 

mtbench | ttft (ms)  |   |  
-- | -- | -- | --
BS | baseline | eagle | Degradation
BS 4 - first run | 14.55 | 18.77 | 1.290034364
BS 4 - repeat run | 14.26 | 18.56 | 1.301542777

`BS 4 first run` means BS is 4 and for a freshly started server we send the MTBench dataset once.
`BS 4 repeat run` means BS is 4 and for the same server we send the MTBench dataset again.
The hope is that the ttft of `4 repeat run` will be much lower than `4 first run` since the server has processed MTBench once already.

Serve log contains this
> INFO 06-23 14:21:48 [gpu_worker.py:232] Available KV cache memory: 51.63 GiB                                                                         
INFO 06-23 14:21:48 [kv_cache_utils.py:716] GPU KV cache size: 410,128 tokens                                                                        
INFO 06-23 14:21:48 [kv_cache_utils.py:720] Maximum concurrency for 131,072 tokens per request: 3.13x                                                

which means the kv cache can store 400k unique tokens. MTBench has 80 prompts and total context tokens in this dataset is 8k. With generating 100 tokens per prompt, total tokens that would reside in the block pool would be 8k + 80*100 = 16k which is less than 400k tokens. However,
1. ttft has 2 component: a. prefill b. scheduler/sampler/other overhead. Since `BS 4 repeat run` is almost same as `BS 4 first run` it means that b. is the dominating factor in this setup
2. The overhead of eagle from b. is 30% higher than baseline

Point 2 becomes more important for longer context if the prompt is already cached. In those cases, PC will benefit the model and the overhead will be mostly b. However, eagle in that scenario will see ttft 30-40% higher than vanilla. 

Commands
```
# vanilla
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct  --disable-log-requests --port 9001

# eagle
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct  --disable-log-requests --port 9001  --speculative_config '{"method": "eagle","model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 2}'


# bench
python3 benchmarks/benchmark_serving.py --port 9001 --save-result  --backend vllm  --model meta-llama/Llama-3.1-8B-Instruct  --endpoint /v1/completions  --dataset-name hf  --dataset-path philschmid/mt-bench  --num-prompts 80  --max-concurrency 4
```


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1][Bug][Spec Decode]: Overhead of SD with PC is 30-40% higher than baseline #19996

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

mtbench	ttft (ms)
BS	baseline	eagle	Degradation
BS 4 - first run	14.55	18.77	1.290034364
BS 4 - repeat run	14.26	18.56	1.301542777

Uh oh!

[V1][Bug][Spec Decode]: Overhead of SD with PC is 30-40% higher than baseline #19996

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions