You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Log how much time loading a compiled artifact takes
When people say they don't like the "torch.compile startup time", they
mean two things:
1) the cold start time
2) the warm start time (when the vLLM disk cache has already been
populated).
We had logging for (1), we didn't have (2). This PR adds (2)
Test Plan:
I ran `VLLM_USE_V1=1 python benchmark_latency.py --model meta-llama/Meta-Llama-3-8B --batch-size 1 -O '{"level": 3, "compile_sizes": {1, 2}}'`
And observed the following logs:
```
INFO 04-18 08:26:11 [backends.py:431] Dynamo bytecode transform time:
5.03 s
INFO 04-18 08:26:15 [backends.py:120] Directly load the compiled
graph(s) for shape None from the cache, took 4.190 s
INFO 04-18 08:26:18 [kv_cache_utils.py:634] GPU KV cache size: 532,032
tokens
```
Side note: it's probably not good that loading from the cache takes 4
seconds?
Signed-off-by: rzou <[email protected]>
0 commit comments