Skip to content

[V1][Bug][Spec Decode]: Overhead of SD with PC is 30-40% higher than baseline #19996

@ekagra-ranjan

Description

@ekagra-ranjan

Your current environment

The output of python collect_env.py
==============================                                                                                                                       
         vLLM Info                                                                                                                                   
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.9.1.dev705+g01220ce89 (git sha: 01220ce89)

🐛 Describe the bug

Setup:

  • H100 80GB
  • llama 3.1 8b
  • MTBench

It seems that overhead of eagle on ttft is 30-40% especially in cases when Prompt caching benefits the model with good cache hits.

mtbench ttft (ms)     
BS baseline eagle Degradation
BS 4 - first run 14.55 18.77 1.290034364
BS 4 - repeat run 14.26 18.56 1.301542777

BS 4 first run means BS is 4 and for a freshly started server we send the MTBench dataset once.
BS 4 repeat run means BS is 4 and for the same server we send the MTBench dataset again.
The hope is that the ttft of 4 repeat run will be much lower than 4 first run since the server has processed MTBench once already.

Serve log contains this

INFO 06-23 14:21:48 [gpu_worker.py:232] Available KV cache memory: 51.63 GiB
INFO 06-23 14:21:48 [kv_cache_utils.py:716] GPU KV cache size: 410,128 tokens
INFO 06-23 14:21:48 [kv_cache_utils.py:720] Maximum concurrency for 131,072 tokens per request: 3.13x

which means the kv cache can store 400k unique tokens. MTBench has 80 prompts and total context tokens in this dataset is 8k. With generating 100 tokens per prompt, total tokens that would reside in the block pool would be 8k + 80*100 = 16k which is less than 400k tokens. However,

  1. ttft has 2 component: a. prefill b. scheduler/sampler/other overhead. Since BS 4 repeat run is almost same as BS 4 first run it means that b. is the dominating factor in this setup
  2. The overhead of eagle from b. is 30% higher than baseline

Point 2 becomes more important for longer context if the prompt is already cached. In those cases, PC will benefit the model and the overhead will be mostly b. However, eagle in that scenario will see ttft 30-40% higher than vanilla.

Commands

# vanilla
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct  --disable-log-requests --port 9001

# eagle
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct  --disable-log-requests --port 9001  --speculative_config '{"method": "eagle","model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 2}'


# bench
python3 benchmarks/benchmark_serving.py --port 9001 --save-result  --backend vllm  --model meta-llama/Llama-3.1-8B-Instruct  --endpoint /v1/completions  --dataset-name hf  --dataset-path philschmid/mt-bench  --num-prompts 80  --max-concurrency 4

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions