-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[V1] feat:add engine v1 tracing #20372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Mu Huai <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @RichardoMrMu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces comprehensive OpenTelemetry tracing capabilities to the vLLM V1 engine. It enables the capture and propagation of tracing context throughout the request lifecycle and generates detailed spans with granular performance metrics, providing deeper insights into request processing times, including queueing, prefill, decode, and overall inference.
Highlights
- V1 Engine Tracing Enablement: The vLLM V1 engine now supports OpenTelemetry tracing, allowing for detailed observability of request lifecycles. This includes removing a previous restriction that explicitly prevented tracing in V1.
- Trace Context Propagation: Trace headers are now propagated throughout the request handling pipeline, from the initial
Request
object throughEngineCoreRequest
andEngineCoreOutput
, ensuring continuity of tracing context across different stages of processing. - Detailed Latency Metrics: New
SpanAttributes
have been introduced to capture granular latency metrics, specifically for prefill, decode, and overall inference times within the model execution, providing more precise performance insights. - OpenTelemetry Span Generation: The
OutputProcessor
now includes logic to initialize an OpenTelemetry tracer and, upon request completion, generates detailedllm_request
spans. These spans include comprehensive performance metrics such as queue time, time-to-first-token (TTFT), end-to-end latency, and the newly added prefill, decode, and inference times, along with request and usage details.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds OpenTelemetry tracing support to the v1 engine, which is a great feature for observability. The changes correctly plumb the trace_headers
through the request lifecycle and introduce a do_tracing
method to generate spans with relevant latency and usage metrics.
My review has identified two main issues:
- A critical issue where the new tracing functionality is effectively disabled because the
OutputProcessor
is not instantiated with the necessaryobservability_config
. - A high-severity issue in the
do_tracing
method where a latency metric is calculated incorrectly and set twice.
Addressing these points will ensure the new tracing feature works correctly and provides accurate metrics. Please see the detailed comments for suggestions.
vllm/v1/engine/output_processor.py
Outdated
def __init__(self, | ||
tokenizer: TokenizerGroup, | ||
log_stats: bool, | ||
observability_config: Optional[ObservabilityConfig] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new observability_config
parameter is not passed during OutputProcessor
instantiation in its call sites, such as in vllm.v1.engine.async_llm.AsyncLLM
and vllm.v1.engine.llm_engine.LLMEngine
. This will result in self.tracer
always being None
, effectively disabling the new tracing functionality.
Please ensure that the vllm_config.observability_config
is passed to the OutputProcessor
constructor in all instantiation locations.
For example, in vllm/v1/engine/async_llm.py
:
# In AsyncLLM.__init__
self.output_processor = OutputProcessor(
self.tokenizer,
log_stats=self.log_stats,
observability_config=vllm_config.observability_config
)
This is a critical issue for the feature to work as intended.
vllm/v1/engine/output_processor.py
Outdated
span.set_attribute(SpanAttributes.GEN_AI_LATENCY_TIME_IN_QUEUE, | ||
metrics.queued_ts - metrics.arrival_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The attribute SpanAttributes.GEN_AI_LATENCY_TIME_IN_QUEUE
is set twice with different values.
- On line 500, it's set to
metrics.queued_ts - metrics.arrival_time
. This seems to represent the time from request arrival to being queued, not the time spent in the queue. - On line 505, it's set to
queued_time
, which is calculated asmetrics.scheduled_ts - metrics.queued_ts
. This correctly represents the time spent in the queue, as per the comment on line 476.
The second assignment overwrites the first one. The first calculation appears to be incorrect for this specific attribute. I recommend removing the first assignment.
If metrics.queued_ts - metrics.arrival_time
is a metric you want to track, it should be assigned to a different span attribute.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Ye Zhang <[email protected]>
[vllm] v1 tracing with fixes based on vllm-project#20372
cc @sallyom |
Hi @RichardoMrMu. This PR is very helpful. Do you plan to continue developing? |
i hope to use it, please go on!!!! |
Currently do we have tracing for v1? i am getting error while enabling tracing with v1. |
Signed-off-by: Ye Zhang <[email protected]>
fix: merge error by accident
This pull request has merge conflicts that must be resolved before it can be |
@RichardoMrMu Added some suggestions that should fix the most prominent build errors! |
Co-authored-by: Benjamin Bartels <[email protected]> Signed-off-by: RichardoMu <[email protected]>
vllm/v1/engine/output_processor.py
Outdated
|
||
# meta | ||
span.set_attribute(SpanAttributes.GEN_AI_REQUEST_ID, req_state.request_id) | ||
if req_state.parent_req and req_state.parent_req.sampling_params: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If pooling model or params.n is 1, then the following attributes are missing ?
vllm/vllm/v1/engine/async_llm.py
Line 255 in 82216dc
if is_pooling or params.n == 1: |
Is it possible to add these attributes regardless of model type and params.n ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. I added fields in RequestState
to hold these sampling_params
from EngineCoreRequest
. or plz let me know if there's better way to do it :)
Chris traceut fix
…llm-project#24380) Signed-off-by: Aaron Pham <[email protected]>
…4404) Signed-off-by: 22quinn <[email protected]>
Fix guidedcodeing ut failure
fix trace test pipeline config
@simon-mo hi, the number of unit test checks recently increased to over 70, and the runs keep failing for reasons unrelated to this PR. Could you please help check this? |
metrics tracing tests passed, other tests seems fixed on main |
Signed-off-by: Mu Huai <[email protected]> Signed-off-by: Ye Zhang <[email protected]> Signed-off-by: RichardoMu <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: 22quinn <[email protected]> Co-authored-by: Mu Huai <[email protected]> Co-authored-by: Ye Zhang <[email protected]> Co-authored-by: Benjamin Bartels <[email protected]> Co-authored-by: simon-mo <[email protected]> Co-authored-by: 瑜琮 <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: 22quinn <[email protected]>
Signed-off-by: Mu Huai <[email protected]> Signed-off-by: Ye Zhang <[email protected]> Signed-off-by: RichardoMu <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: 22quinn <[email protected]> Co-authored-by: Mu Huai <[email protected]> Co-authored-by: Ye Zhang <[email protected]> Co-authored-by: Benjamin Bartels <[email protected]> Co-authored-by: simon-mo <[email protected]> Co-authored-by: 瑜琮 <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: 22quinn <[email protected]>
Signed-off-by: Mu Huai <[email protected]> Signed-off-by: Ye Zhang <[email protected]> Signed-off-by: RichardoMu <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: 22quinn <[email protected]> Co-authored-by: Mu Huai <[email protected]> Co-authored-by: Ye Zhang <[email protected]> Co-authored-by: Benjamin Bartels <[email protected]> Co-authored-by: simon-mo <[email protected]> Co-authored-by: 瑜琮 <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: 22quinn <[email protected]>
Signed-off-by: Mu Huai <[email protected]> Signed-off-by: Ye Zhang <[email protected]> Signed-off-by: RichardoMu <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: 22quinn <[email protected]> Co-authored-by: Mu Huai <[email protected]> Co-authored-by: Ye Zhang <[email protected]> Co-authored-by: Benjamin Bartels <[email protected]> Co-authored-by: simon-mo <[email protected]> Co-authored-by: 瑜琮 <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: 22quinn <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
For my fault to sync fork, I accidentally closed the previous PR 18069. I recreate a PR.
origin:
following this issue #17794, and #10582 this issue ([P2] Make it work with tracing too (there's some request level stats that tracing needs, like queue time, ttft). These request level metric should be possible to be surfaced in v1 too.). I make engine v1 to generate model tracing span.