Skip to content

Conversation

@luccafong
Copy link
Collaborator

@luccafong luccafong commented Sep 6, 2025

Purpose

  • Support customized global logger for data parallel. This allows users to provide custom global loggers that can aggregate metrics across multiple DP instances.
  • Update DP multi host instance example with customized loggers.

Test Plan

  • Unit tests
  • Verify in online e2e example that both per-engine and global loggers work correctly
  • Verify with e2e online serve and benchmark with -aggregate-engine-logging
  • Verify with e2e online serve and benchmark with default behavior not changed

Test Result

  • Unit tests pass
python -m pytest tests/v1/engine/test_async_llm.py::test_customize_aggregated_loggers -v
============================= 1 passed, 3 warnings in 28.54s ==============================

python -m pytest tests/v1/engine/test_async_llm.py::test_customize_loggers -v
============================= 1 passed, 3 warnings in 24.11s ==============================

  • e2e test with online examples:
VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=WARN python3 /data/users/fanglu/gitrepos/vllm/examples/online_serving/multi_instance_data_parallel.py
CUDA_VISIBLE_DEVICES=1  vllm serve ibm-research/PowerMoE-3b -dp 2 -dpr 1         --data-parallel-address 127.0.0.1 --data-parallel-rpc-port 62300 --max-model-len 2048 --max-num-batched-tokens 2048          --data-parallel-size-local 1 --enforce-eager --headless

logs:

DEBUG 09-22 12:58:01 [v1/metrics/loggers.py:136] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
DEBUG 09-22 12:58:01 [v1/metrics/loggers.py:136] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 09-22 12:58:01 [v1/metrics/loggers.py:136] Engine 001: Avg prompt throughput: 4.4 tokens/s, Avg generation throughput: 36.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 09-22 12:58:01 [v1/metrics/loggers.py:136] Engine 001: Avg prompt throughput: 4.4 tokens/s, Avg generation throughput: 36.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 09-22 12:58:01 [v1/metrics/loggers.py:190] 2 Engines Aggregated: Avg prompt throughput: 4.4 tokens/s, Avg generation throughput: 36.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

  • e2e test with serving

Add option --aggregate-engine-logging to enable aggregated log stats.

serve

vllm serve "/data/local/models/DeepSeek-V3.1" --max_model_len=4096 --gpu_memory_utilization=0.9 --tensor_parallel_size 1 --data_parallel_size 8 --enable_expert_parallel  --max_num_seqs=256  --aggregate-engine-logging --enforce-eager > /tmp/test_serving.log 2>&1 &

benchmark

vllm bench serve  --model "/data/local/models/DeepSeek-V3.1" --dataset-name random  --random-input-len 2048 --random-output-len 1024  --num-prompts 400 --host 127.0.0.1 --port 8000 --ignore-eos --random-range-ratio 0 --max-concurrency 200 --ready-check-timeout-sec 0

Logs:

^[[1;36m(APIServer pid=735328)^[[0;0m INFO 10-13 11:43:03 [loggers.py:181] 8 Engines Aggregated: Avg prompt throughput: 13106.5 tokens/s, Avg generation throughput: 28.8 tokens/s, Running: 168 reqs, Waiting: 32 reqs, GPU KV cache usage: 17.6%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=735328)^[[0;0m INFO 10-13 11:43:13 [loggers.py:181] 8 Engines Aggregated: Avg prompt throughput: 8184.7 tokens/s, Avg generation throughput: 1318.0 tokens/s, Running: 200 reqs, Waiting: 0 reqs, GPU KV cache usage: 21.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=735328)^[[0;0m INFO 10-13 11:43:23 [loggers.py:181] 8 Engines Aggregated: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1339.9 tokens/s, Running: 200 reqs, Waiting: 0 reqs, GPU KV cache usage: 22.4%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=735328)^[[0;0m INFO 10-13 11:43:33 [loggers.py:181] 8 Engines Aggregated: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1340.0 tokens/s, Running: 200 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.1%, Prefix cache hit rate: 0.0%





  • e2e test with serving with default settings
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 000: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 4.2 tokens/s, Running: 21 reqs, Waiting: 0 reqs, GPU KV cache usage: 17.6%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 001: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 4.2 tokens/s, Running: 21 reqs, Waiting: 0 reqs, GPU KV cache usage: 17.6%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 002: Avg prompt throughput: 1843.1 tokens/s, Avg generation throughput: 4.3 tokens/s, Running: 21 reqs, Waiting: 0 reqs, GPU KV cache usage: 17.6%, Prefix cache hit rate: 1.9%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 003: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 20 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 004: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 20 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.8%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 005: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 3.8 tokens/s, Running: 19 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.0%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 006: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 3.8 tokens/s, Running: 19 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.0%, Prefix cache hit rate: 0.0%
^[[1;36m(APIServer pid=1440812)^[[0;0m INFO 10-13 11:55:31 [loggers.py:181] Engine 007: Avg prompt throughput: 1638.3 tokens/s, Avg generation throughput: 3.8 tokens/s, Running: 19 reqs, Waiting: 0 reqs, GPU KV cache usage: 16.0%, Prefix cache hit rate: 0.0%



Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results

@mergify mergify bot added documentation Improvements or additions to documentation v1 labels Sep 6, 2025
@facebook-github-bot
Copy link

@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for a customized global logger for data parallel (DP) training, which is a valuable addition for monitoring distributed training. The changes primarily affect the V1 engine, with corresponding updates to examples and tests. My review has identified a few issues: a bug in the updated example code where a request ID is not unique within a loop, a leftover debug print statement in the new logger logic, and an unused parameter in the V0 engine's API that could cause confusion. Addressing these points will improve the quality and correctness of the implementation.

@facebook-github-bot
Copy link

@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764.

@luccafong luccafong force-pushed the support_global_dp_logging branch from cb2d450 to 893ea0f Compare September 6, 2025 01:08
@facebook-github-bot
Copy link

@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764.

@luccafong luccafong force-pushed the support_global_dp_logging branch 2 times, most recently from cb2d450 to 7bfddba Compare September 6, 2025 01:17
@facebook-github-bot
Copy link

@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764.

@mergify
Copy link

mergify bot commented Sep 6, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @luccafong.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 6, 2025
@luccafong luccafong force-pushed the support_global_dp_logging branch from 7bfddba to 8156746 Compare September 6, 2025 01:21
@mergify mergify bot removed the needs-rebase label Sep 6, 2025
@luccafong luccafong force-pushed the support_global_dp_logging branch 2 times, most recently from 8633cf4 to 562f552 Compare September 8, 2025 23:58
# Each EngineCore's metrics are expressed as a unique label.
self.prometheus_logger = prometheus_factory(vllm_config, engine_idxs)
self.global_logger: Optional[StatLoggerBase] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's name this as aggregated_logger?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps type here better as GlobalStatLoggerBase

@@ -145,6 +156,63 @@ def log_engine_initialized(self):
self.vllm_config.cache_config.num_gpu_blocks)


class GlobalStatLogger(LoggingStatLogger, GlobalStatLoggerBase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AggregatedLogger is a better name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments to explain the difference?

@@ -222,6 +230,7 @@ def from_engine_args(
start_engine_loop: bool = True,
usage_context: UsageContext = UsageContext.ENGINE_CONTEXT,
stat_loggers: Optional[list[StatLoggerFactory]] = None,
stat_logger_global: Optional[GlobalStatLoggerFactory] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rename this per_engine_loggers and aggregated_logger.

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @luccafong!

I think this is ok but I think there's probably a cleaner way of organizing the interface overall (even the current state isn't great).

For example, I think it would be reasonable to just change the interface to be global/aggregated for all loggers, and modify the default LoggingStatLogger to have two modes - aggregated or per engine. Need to think how that would actually be configured.. I guess it could be an env var.

Then we also wouldn't need a special case/field for the PrometheusLogger.

A more annoying problem is the multi-api-server case where we currently disable the logging logger, but that's kind of orthogonal to this I guess.

(btw my comments below might not be applicable in the same way if we refactor per above suggestion).

# Each EngineCore's metrics are expressed as a unique label.
self.prometheus_logger = prometheus_factory(vllm_config, engine_idxs)
self.global_logger: Optional[StatLoggerBase] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps type here better as GlobalStatLoggerBase

Comment on lines 172 to 188
now = time.monotonic()
prompt_throughput = self._get_throughput(self.num_prompt_tokens, now)
generation_throughput = self._get_throughput(
self.num_generation_tokens, now)

self._reset(now)

scheduler_stats = self.last_scheduler_stats

log_fn = logger.info
if not any(
(prompt_throughput, generation_throughput,
self.last_prompt_throughput, self.last_generation_throughput)):
# Avoid log noise on an idle production system
log_fn = logger.debug
self.last_generation_throughput = generation_throughput
self.last_prompt_throughput = prompt_throughput
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we restructure so that this isn't duplicated with the superclass?

@luccafong
Copy link
Collaborator Author

luccafong commented Sep 19, 2025

Thanks @luccafong!

I think this is ok but I think there's probably a cleaner way of organizing the interface overall (even the current state isn't great).

For example, I think it would be reasonable to just change the interface to be global/aggregated for all loggers, and modify the default LoggingStatLogger to have two modes - aggregated or per engine. Need to think how that would actually be configured.. I guess it could be an env var.

Then we also wouldn't need a special case/field for the PrometheusLogger.

A more annoying problem is the multi-api-server case where we currently disable the logging logger, but that's kind of orthogonal to this I guess.

(btw my comments below might not be applicable in the same way if we refactor per above suggestion).

thanks @njhill for the review and suggestions.

Good point to provide an aggregation option, I am wondering if this enabled, is it weird we have multiple logger? or we skip logging for non zero DP engine in that mode ? I am also not sure how refactoring impact current use cases who rely on PrometheusLogger.

@njhill
Copy link
Member

njhill commented Sep 19, 2025

Good point to provide an aggregation option, I am wondering if this enabled, is it weird we have multiple logger? or we skip logging for non zero DP engine in that mode ? I am also not sure how refactoring impact current use cases who rely on PrometheusLogger.

Re PrometheusLogger... this is already a "global" one right? so it would just mean we can include that in the list as one of the default ones and not have a separate field.

Thinking more, maybe we can keep the global stats logger abstract subclass after all (but I agree that it would be better to name it AggregateStatsLogger). I don't think we need a new arg for this, we can just check the type of passed-in statsloggers.

And then have a simple adapter - if a provided stats logger is not an AggregateStatsLogger, wrap it in this adapter which converts non-aggregate into aggregate (just contains dict with n instances). Then our internal field can just be a list of AggregateStatsLoggers and we invoke them all in the same way. PrometheusLogger would just be updated to extend AggregateStatsLogger.

And your concrete impl of the aggregate logging one we could name e.g. AggregateLoggingStatsLogger.

I hope that makes sense!

@mergify mergify bot added the needs-rebase label Oct 10, 2025
@luccafong luccafong force-pushed the support_global_dp_logging branch from 5907d05 to 016eacd Compare October 10, 2025 23:51
@mergify mergify bot removed the needs-rebase label Oct 10, 2025
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @luccafong

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 11, 2025
@njhill
Copy link
Member

njhill commented Oct 13, 2025

@luccafong it looks like there's a test that needs updating: https://buildkite.com/vllm/ci/builds/34468#0199d15b-035d-435e-a030-a453e91e436e

and will need another rebase now that all those formatting changes have been made :(

@mergify
Copy link

mergify bot commented Oct 13, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @luccafong.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 13, 2025
@luccafong luccafong force-pushed the support_global_dp_logging branch from 5484535 to ce5c02f Compare October 13, 2025 18:56
@mergify mergify bot removed the needs-rebase label Oct 13, 2025
Signed-off-by: Lu Fang <[email protected]>

fix the test

Signed-off-by: Lu Fang <[email protected]>

address comments

Signed-off-by: Lu Fang <[email protected]>

add aggregator interface and abstract common logic

Signed-off-by: Lu Fang <[email protected]>

add corrupted request aggregation

Signed-off-by: Lu Fang <[email protected]>

more refactor

Signed-off-by: Lu Fang <[email protected]>

fix ut

Signed-off-by: Lu Fang <[email protected]>

fix kv_connector_logging

Signed-off-by: Lu Fang <[email protected]>

fix merge conflicts

Signed-off-by: Lu Fang <[email protected]>

fix lint

Signed-off-by: Lu Fang <[email protected]>

address comments

Signed-off-by: Lu Fang <[email protected]>

address comments

Signed-off-by: Lu Fang <[email protected]>

address comments

Signed-off-by: Lu Fang <[email protected]>

address commnet

Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
@luccafong luccafong force-pushed the support_global_dp_logging branch from 7ab416c to 554625d Compare October 13, 2025 20:44
@luccafong luccafong merged commit 8317f72 into vllm-project:main Oct 14, 2025
51 checks passed
1994 pushed a commit to 1994/vllm that referenced this pull request Oct 14, 2025
Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
Zhathw pushed a commit to Zhathw/vllm that referenced this pull request Nov 12, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants