-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Misc][DP] support customized aggregated logger for dp #24354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc][DP] support customized aggregated logger for dp #24354
Conversation
|
@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for a customized global logger for data parallel (DP) training, which is a valuable addition for monitoring distributed training. The changes primarily affect the V1 engine, with corresponding updates to examples and tests. My review has identified a few issues: a bug in the updated example code where a request ID is not unique within a loop, a leftover debug print statement in the new logger logic, and an unused parameter in the V0 engine's API that could cause confusion. Addressing these points will improve the quality and correctness of the implementation.
|
@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764. |
cb2d450 to
893ea0f
Compare
|
@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764. |
cb2d450 to
7bfddba
Compare
|
@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81832764. |
|
This pull request has merge conflicts that must be resolved before it can be |
7bfddba to
8156746
Compare
8633cf4 to
562f552
Compare
vllm/v1/metrics/loggers.py
Outdated
| # Each EngineCore's metrics are expressed as a unique label. | ||
| self.prometheus_logger = prometheus_factory(vllm_config, engine_idxs) | ||
| self.global_logger: Optional[StatLoggerBase] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's name this as aggregated_logger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps type here better as GlobalStatLoggerBase
vllm/v1/metrics/loggers.py
Outdated
| @@ -145,6 +156,63 @@ def log_engine_initialized(self): | |||
| self.vllm_config.cache_config.num_gpu_blocks) | |||
|
|
|||
|
|
|||
| class GlobalStatLogger(LoggingStatLogger, GlobalStatLoggerBase): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AggregatedLogger is a better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some comments to explain the difference?
vllm/v1/engine/async_llm.py
Outdated
| @@ -222,6 +230,7 @@ def from_engine_args( | |||
| start_engine_loop: bool = True, | |||
| usage_context: UsageContext = UsageContext.ENGINE_CONTEXT, | |||
| stat_loggers: Optional[list[StatLoggerFactory]] = None, | |||
| stat_logger_global: Optional[GlobalStatLoggerFactory] = None, | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should rename this per_engine_loggers and aggregated_logger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @luccafong!
I think this is ok but I think there's probably a cleaner way of organizing the interface overall (even the current state isn't great).
For example, I think it would be reasonable to just change the interface to be global/aggregated for all loggers, and modify the default LoggingStatLogger to have two modes - aggregated or per engine. Need to think how that would actually be configured.. I guess it could be an env var.
Then we also wouldn't need a special case/field for the PrometheusLogger.
A more annoying problem is the multi-api-server case where we currently disable the logging logger, but that's kind of orthogonal to this I guess.
(btw my comments below might not be applicable in the same way if we refactor per above suggestion).
vllm/v1/metrics/loggers.py
Outdated
| # Each EngineCore's metrics are expressed as a unique label. | ||
| self.prometheus_logger = prometheus_factory(vllm_config, engine_idxs) | ||
| self.global_logger: Optional[StatLoggerBase] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps type here better as GlobalStatLoggerBase
vllm/v1/metrics/loggers.py
Outdated
| now = time.monotonic() | ||
| prompt_throughput = self._get_throughput(self.num_prompt_tokens, now) | ||
| generation_throughput = self._get_throughput( | ||
| self.num_generation_tokens, now) | ||
|
|
||
| self._reset(now) | ||
|
|
||
| scheduler_stats = self.last_scheduler_stats | ||
|
|
||
| log_fn = logger.info | ||
| if not any( | ||
| (prompt_throughput, generation_throughput, | ||
| self.last_prompt_throughput, self.last_generation_throughput)): | ||
| # Avoid log noise on an idle production system | ||
| log_fn = logger.debug | ||
| self.last_generation_throughput = generation_throughput | ||
| self.last_prompt_throughput = prompt_throughput |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we restructure so that this isn't duplicated with the superclass?
thanks @njhill for the review and suggestions. Good point to provide an aggregation option, I am wondering if this enabled, is it weird we have multiple logger? or we skip logging for non zero DP engine in that mode ? I am also not sure how refactoring impact current use cases who rely on PrometheusLogger. |
Re PrometheusLogger... this is already a "global" one right? so it would just mean we can include that in the list as one of the default ones and not have a separate field. Thinking more, maybe we can keep the global stats logger abstract subclass after all (but I agree that it would be better to name it And then have a simple adapter - if a provided stats logger is not an AggregateStatsLogger, wrap it in this adapter which converts non-aggregate into aggregate (just contains dict with n instances). Then our internal field can just be a list of AggregateStatsLoggers and we invoke them all in the same way. PrometheusLogger would just be updated to extend AggregateStatsLogger. And your concrete impl of the aggregate logging one we could name e.g. I hope that makes sense! |
5907d05 to
016eacd
Compare
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @luccafong
|
@luccafong it looks like there's a test that needs updating: https://buildkite.com/vllm/ci/builds/34468#0199d15b-035d-435e-a030-a453e91e436e and will need another rebase now that all those formatting changes have been made :( |
|
This pull request has merge conflicts that must be resolved before it can be |
5484535 to
ce5c02f
Compare
Signed-off-by: Lu Fang <[email protected]> fix the test Signed-off-by: Lu Fang <[email protected]> address comments Signed-off-by: Lu Fang <[email protected]> add aggregator interface and abstract common logic Signed-off-by: Lu Fang <[email protected]> add corrupted request aggregation Signed-off-by: Lu Fang <[email protected]> more refactor Signed-off-by: Lu Fang <[email protected]> fix ut Signed-off-by: Lu Fang <[email protected]> fix kv_connector_logging Signed-off-by: Lu Fang <[email protected]> fix merge conflicts Signed-off-by: Lu Fang <[email protected]> fix lint Signed-off-by: Lu Fang <[email protected]> address comments Signed-off-by: Lu Fang <[email protected]> address comments Signed-off-by: Lu Fang <[email protected]> address comments Signed-off-by: Lu Fang <[email protected]> address commnet Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
7ab416c to
554625d
Compare
…24354) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: 1994 <[email protected]>
…24354) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Dhruvil Bhatt <[email protected]>
…24354) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: bbartels <[email protected]>
…24354) Signed-off-by: Lu Fang <[email protected]>
…24354) Signed-off-by: Lu Fang <[email protected]>
…24354) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…24354) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…24354) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: 0xrushi <[email protected]>
…24354) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: 0xrushi <[email protected]>
…24354) Signed-off-by: Lu Fang <[email protected]>
…24354) Signed-off-by: Lu Fang <[email protected]>
…24354) Signed-off-by: Lu Fang <[email protected]>
Purpose
Test Plan
-aggregate-engine-loggingTest Result
logs:
Add option
--aggregate-engine-loggingto enable aggregated log stats.serve
benchmark
Logs:
Essential Elements of an Effective PR Description Checklist