-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
Your current environment
The output of python collect_env.py
==============================
vLLM Info
==============================
ROCM Version : Could not collect
Neuron SDK Version : N/A
vLLM Version : 0.9.1.dev705+g01220ce89 (git sha: 01220ce89)
🐛 Describe the bug
Setup:
- H100 80GB
- llama 3.1 8b
- MTBench
It seems that overhead of eagle on ttft is 30-40% especially in cases when Prompt caching benefits the model with good cache hits.
mtbench | ttft (ms) | ||
---|---|---|---|
BS | baseline | eagle | Degradation |
BS 4 - first run | 14.55 | 18.77 | 1.290034364 |
BS 4 - repeat run | 14.26 | 18.56 | 1.301542777 |
BS 4 first run
means BS is 4 and for a freshly started server we send the MTBench dataset once.
BS 4 repeat run
means BS is 4 and for the same server we send the MTBench dataset again.
The hope is that the ttft of 4 repeat run
will be much lower than 4 first run
since the server has processed MTBench once already.
Serve log contains this
INFO 06-23 14:21:48 [gpu_worker.py:232] Available KV cache memory: 51.63 GiB
INFO 06-23 14:21:48 [kv_cache_utils.py:716] GPU KV cache size: 410,128 tokens
INFO 06-23 14:21:48 [kv_cache_utils.py:720] Maximum concurrency for 131,072 tokens per request: 3.13x
which means the kv cache can store 400k unique tokens. MTBench has 80 prompts and total context tokens in this dataset is 8k. With generating 100 tokens per prompt, total tokens that would reside in the block pool would be 8k + 80*100 = 16k which is less than 400k tokens. However,
- ttft has 2 component: a. prefill b. scheduler/sampler/other overhead. Since
BS 4 repeat run
is almost same asBS 4 first run
it means that b. is the dominating factor in this setup - The overhead of eagle from b. is 30% higher than baseline
Point 2 becomes more important for longer context if the prompt is already cached. In those cases, PC will benefit the model and the overhead will be mostly b. However, eagle in that scenario will see ttft 30-40% higher than vanilla.
Commands
# vanilla
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests --port 9001
# eagle
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests --port 9001 --speculative_config '{"method": "eagle","model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 2}'
# bench
python3 benchmarks/benchmark_serving.py --port 9001 --save-result --backend vllm --model meta-llama/Llama-3.1-8B-Instruct --endpoint /v1/completions --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --max-concurrency 4
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.