-
Notifications
You must be signed in to change notification settings - Fork 32
feat: Report more vllm metrics #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Pavloveuge
wants to merge
12
commits into
triton-inference-server:main
Choose a base branch
from
Pavloveuge:report_more_vllm_metric
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+228
−1
Open
Changes from 3 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
b87a1cd
add new metrics reporting
Pavloveuge 1a1b66e
update readme
Pavloveuge 7a0096e
update test
Pavloveuge 41a3919
Update src/utils/metrics.py
Pavloveuge c5f9751
Update src/utils/metrics.py
Pavloveuge 6ac6108
Update README.md
Pavloveuge 221a1c1
Update ci/L0_backend_vllm/metrics_test/vllm_metrics_test.py
Pavloveuge f419648
Update README.md
Pavloveuge 6896178
Update src/utils/metrics.py
Pavloveuge 93d8895
move gauges before counters
6be53bd
revert suggested change
9b96279
remove deprecated metrics, fix namings
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -227,12 +227,24 @@ VLLM stats are reported by the metrics endpoint in fields that are prefixed with | |
| counter_prompt_tokens | ||
| # Number of generation tokens processed. | ||
| counter_generation_tokens | ||
| # Number of preemption tokens processed. | ||
| counter_preemption_tokens | ||
Pavloveuge marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| # Histogram of number of tokens per engine_step. | ||
| histogram_iteration_tokens | ||
| # Histogram of time to first token in seconds. | ||
| histogram_time_to_first_token | ||
| # Histogram of time per output token in seconds. | ||
| histogram_time_per_output_token | ||
| # Histogram of end to end request latency in seconds. | ||
| histogram_e2e_time_request | ||
| # Histogram of time spent in WAITING phase for request. | ||
| histogram_queue_time_request | ||
| # Histogram of time spent in RUNNING phase for request. | ||
| histogram_inference_time_request | ||
| # Histogram of time spent in PREFILL phase for request. | ||
| histogram_prefill_time_request | ||
| # Histogram of time spent in DECODE phase for request. | ||
| histogram_decode_time_request | ||
| # Number of prefill tokens processed. | ||
| histogram_num_prompt_tokens_request | ||
| # Number of generation tokens processed. | ||
|
|
@@ -241,6 +253,20 @@ histogram_num_generation_tokens_request | |
| histogram_best_of_request | ||
| # Histogram of the n request parameter. | ||
| histogram_n_request | ||
| # Number of requests currently running on GPU. | ||
| gauge_scheduler_running | ||
| # Number of requests waiting to be processed. | ||
| gauge_scheduler_waiting | ||
| # Number of requests swapped to CPU. | ||
| gauge_scheduler_swapped | ||
| # GPU KV-cache usage. 1 means 100 percent usage. | ||
| gauge_gpu_cache_usage | ||
| # CPU KV-cache usage. 1 means 100 percent usage. | ||
| gauge_cpu_cache_usage | ||
| # CPU prefix cache block hit rate. | ||
| gauge_cpu_prefix_cache_hit_rate | ||
| # GPU prefix cache block hit rate. | ||
| gauge_gpu_prefix_cache_hit_rate | ||
|
||
| ``` | ||
| Your output for these fields should look similar to the following: | ||
| ```bash | ||
|
|
@@ -250,6 +276,37 @@ vllm:prompt_tokens_total{model="vllm_model",version="1"} 10 | |
| # HELP vllm:generation_tokens_total Number of generation tokens processed. | ||
| # TYPE vllm:generation_tokens_total counter | ||
| vllm:generation_tokens_total{model="vllm_model",version="1"} 16 | ||
| # HELP vllm:num_preemptions_total Number of preemption tokens processed. | ||
| # TYPE vllm:num_preemptions_total counter | ||
| vllm:num_preemptions_total{model="vllm_model",version="1"} 0 | ||
| # HELP vllm:num_requests_running Number of requests currently running on GPU. | ||
| # TYPE vllm:num_requests_running gauge | ||
| vllm:num_requests_running{model="vllm_model",version="1"} 0 | ||
| # HELP vllm:num_requests_waiting Number of requests waiting to be processed. | ||
| # TYPE vllm:num_requests_waiting gauge | ||
| vllm:num_requests_waiting{model="vllm_model",version="1"} 0 | ||
| # HELP vllm:num_requests_swapped Number of requests swapped to CPU. | ||
| # TYPE vllm:num_requests_swapped gauge | ||
| vllm:num_requests_swapped{model="vllm_model",version="1"} 0 | ||
| # HELP vllm:gpu_cache_usage_perc Gauga of gpu cache usage. 1 means 100 percent usage. | ||
| # TYPE vllm:gpu_cache_usage_perc gauge | ||
| vllm:gpu_cache_usage_perc{model="vllm_model",version="1"} 0 | ||
| # HELP vllm:cpu_cache_usage_perc Gauga of cpu cache usage. 1 means 100 percent usage. | ||
| # TYPE vllm:cpu_cache_usage_perc gauge | ||
| vllm:cpu_cache_usage_perc{model="vllm_model",version="1"} 0 | ||
| # HELP vllm:cpu_prefix_cache_hit_rate CPU prefix cache block hit rate. | ||
| # TYPE vllm:cpu_prefix_cache_hit_rate gauge | ||
| vllm:cpu_prefix_cache_hit_rate{model="vllm_model",version="1"} -1 | ||
| # HELP vllm:gpu_prefix_cache_hit_rate GPU prefix cache block hit rate. | ||
| # TYPE vllm:gpu_prefix_cache_hit_rate gauge | ||
| vllm:gpu_prefix_cache_hit_rate{model="vllm_model",version="1"} -1 | ||
| # HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step. | ||
| # TYPE vllm:iteration_tokens_total histogram | ||
| vllm:iteration_tokens_total_count{model="vllm_model",version="1"} 10 | ||
| vllm:iteration_tokens_total_sum{model="vllm_model",version="1"} 12 | ||
| vllm:iteration_tokens_total_bucket{model="vllm_model",version="1",le="1"} 9 | ||
| ... | ||
| vllm:iteration_tokens_total_bucket{model="vllm_model",version="1",le="+Inf"} 10 | ||
| # HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds. | ||
| # TYPE vllm:time_to_first_token_seconds histogram | ||
| vllm:time_to_first_token_seconds_count{model="vllm_model",version="1"} 1 | ||
|
|
@@ -271,6 +328,34 @@ vllm:e2e_request_latency_seconds_sum{model="vllm_model",version="1"} 0.086861848 | |
| vllm:e2e_request_latency_seconds_bucket{model="vllm_model",version="1",le="1"} 1 | ||
| ... | ||
| vllm:e2e_request_latency_seconds_bucket{model="vllm_model",version="1",le="+Inf"} 1 | ||
| # HELP vllm:request_queue_time_seconds Histogram of time spent in WAITING phase for request. | ||
| # TYPE vllm:request_queue_time_seconds histogram | ||
| vllm:request_queue_time_seconds_count{model="vllm_model",version="1"} 1 | ||
| vllm:request_queue_time_seconds_sum{model="vllm_model",version="1"} 0.0045166015625 | ||
| vllm:request_queue_time_seconds_bucket{model="vllm_model",version="1",le="1"} 1 | ||
| ... | ||
| vllm:request_queue_time_seconds_bucket{model="vllm_model",version="1",le="+Inf"} 1 | ||
| # HELP vllm:request_inference_time_seconds Histogram of time spent in RUNNING phase for request | ||
| # TYPE vllm:request_inference_time_seconds histogram | ||
| vllm:request_inference_time_seconds_count{model="vllm_model",version="1"} 1 | ||
| vllm:request_inference_time_seconds_sum{model="vllm_model",version="1"} 0.1418392658233643 | ||
| vllm:request_inference_time_seconds_bucket{model="vllm_model",version="1",le="1"} 1 | ||
| ... | ||
| vllm:request_inference_time_seconds_bucket{model="vllm_model",version="1",le="+Inf"} 1 | ||
| # HELP vllm:request_prefill_time_seconds Histogram of time spent in PREFILL phase for request. | ||
| # TYPE vllm:request_prefill_time_seconds histogram | ||
| vllm:request_prefill_time_seconds_count{model="vllm_model",version="1"} 1 | ||
| vllm:request_prefill_time_seconds_sum{model="vllm_model",version="1"} 0.05302977561950684 | ||
| vllm:request_prefill_time_seconds_bucket{model="vllm_model",version="1",le="1"} 1 | ||
| ... | ||
| vllm:request_prefill_time_seconds_bucket{model="vllm_model",version="1",le="+Inf"} 1 | ||
| # HELP vllm:request_decode_time_seconds Histogram of time spent in DECODE phase for request. | ||
| # TYPE vllm:request_decode_time_seconds histogram | ||
| vllm:request_decode_time_seconds_count{model="vllm_model",version="1"} 1 | ||
| vllm:request_decode_time_seconds_sum{model="vllm_model",version="1"} 0.08880949020385742 | ||
| vllm:request_decode_time_seconds_bucket{model="vllm_model",version="1",le="1"} 1 | ||
| ... | ||
| vllm:request_decode_time_seconds_bucket{model="vllm_model",version="1",le="+Inf"} 1 | ||
| # HELP vllm:request_prompt_tokens Number of prefill tokens processed. | ||
| # TYPE vllm:request_prompt_tokens histogram | ||
| vllm:request_prompt_tokens_count{model="vllm_model",version="1"} 1 | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.