Skip to content

Conversation

simon-mo
Copy link
Collaborator

@simon-mo simon-mo commented Oct 6, 2025

Summary

  • clarify the metrics design doc so the prometheus middleware note no longer references the legacy V0 engine migration
  • update the speculative decoding guide to state that draft-model support requires the V1 engine instead of pointing to the retired v0.10 release

Testing

  • not run (documentation changes only)

https://chatgpt.com/codex/tasks/task_e_68e3f11c47408329bf2324ac7b1ad7bf

@mergify mergify bot added the documentation Improvements or additions to documentation label Oct 6, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a number of documentation updates to remove references to the legacy v0 engine and clarify concepts for the current v1 engine. The changes are well-executed across multiple files, improving the clarity and relevance of the documentation for users. The updates are consistent with the stated goals of the PR, and I have no further suggestions.


We have started the process of deprecating V0. Please read [RFC #18571](gh-issue:18571) for more details.

V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also update this paragraph?

| **Mamba Models** | <nobr>🟢 (Mamba-2), 🟢 (Mamba-1)</nobr> |
| **Multimodal Models** | <nobr>🟢 Functional</nobr> |

vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol.
Copy link
Member

@DarkLight1337 DarkLight1337 Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the V1 column from the Supported Models page and delete all models that don't support V1

Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations.

In vLLM V1, **chunked prefill is always enabled by default**. This is different from vLLM V0, where it was conditionally enabled based on model characteristics.
In vLLM V1, **chunked prefill is always enabled by default** so that behavior is consistent across supported models.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In vLLM V1, **chunked prefill is always enabled by default** so that behavior is consistent across supported models.
In vLLM V1, **chunked prefill is always enabled by default**.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are probably some mistakes here. @markmc PTAL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally lgtm, although I guess my attitude is that design docs like these are naturally a snapshot in time of a design decision, but more discoverable than a random Google doc. It's really hard to be disciplined enough to keep a doc like this up to date

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill I guess this page can use a full clean up

Comment on lines +19 to +20
Speculative decoding with a draft model requires the V1 engine.
Older releases that predate V1 (such as the 0.10.x series) raise a `NotImplementedError`.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Speculative decoding with a draft model requires the V1 engine.
Older releases that predate V1 (such as the 0.10.x series) raise a `NotImplementedError`.
Speculative decoding with a draft model is not supported in vLLM V1 version.
You can use older version before the 0.10x series to continue to leverage it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the V1 column from the Supported Models page and delete all models that don't support V1

LGTM after doing this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably gradually remove this docs

Copy link

mergify bot commented Oct 8, 2025

Documentation preview: https://vllm--26311.org.readthedocs.build/en/26311/


### Multi-process Mode

In v0, metrics are collected in the engine core process and we use multiprocess mode to make them available in the API server process. See <gh-pr:7279>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics are still collected in the API server process, but multiprocess mode was reinstated by #17546 in order to share metrics state between API server processes

This is relevant because if we move away from multiprocess mode in v1,
we get these back. However, it's questionable how relevant these are
if they don't aggregate these stats for all processes that make up a
vLLM instance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so these are gone again


Since metrics is a big enough topic on its own, we are going to tackle
the topic of tracing in v1 separately.
the topic of tracing separately.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracing has since been reinstated - #20372

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally lgtm, although I guess my attitude is that design docs like these are naturally a snapshot in time of a design decision, but more discoverable than a random Google doc. It's really hard to be disciplined enough to keep a doc like this up to date

Copy link

mergify bot commented Oct 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @simon-mo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

codex documentation Improvements or additions to documentation needs-rebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants