[PERF] Async structured outputs #23224

vadiklyutiy · 2025-08-20T01:44:32Z

Purpose

Continue working on optimizing structured outputs started in #21862.
This PR implements asynchronous calculation of structured outputs to improve performance of structured output generation by moving heavy computation to a separate background process.

Key Changes:

Multiprocess Architecture: Introduces a new 3-tier architecture:

StructuredOutputManager (Main process): Coordinates operations via multiprocessing queues
StructuredOutputGateway (Child process): Background process that receives and executes tasks
StructuredOutputExecutor (Child process): Performs actual grammar compilation, bitmask generation, and token acceptance

Asynchronous Grammar Bitmask Calculations:

Grammar bitmask generation now happens asynchronously in the background process
Main process receives a GrammarBitmaskPlaceholder and can continue execution
Actual bitmask is retrieved when needed via shared memory

Communication:

Uses shared memory for large bitmask data transfer
Three dedicated queues for different types of communication:
- task_queue: Main → Child for task submission
- batch_validate_result_queue: Child → Main for token validation results
- grammar_init_notification_queue: Child → Main for grammar initialization notifications

Introduce Batch Processing for:

submit_batch_accept_tokens(): Batches token acceptance operations
submit_batch_validate_tokens(): Batches token validation for speculative decoding
to reduce amount of communications

Test Plan

vllm serve Qwen/Qwen3-1.7B  --no-enable-prefix-caching

python3 benchmarks/benchmark_serving_structured_output.py --backend vllm --model Qwen/Qwen3-1.7B --structured-output-ratio 1.0 --request-rate 100 --num-prompts 2000 --json-schema-path ./test3.json  --output-len 2048

test3.json

{
    "description": "Schema for representing structured information about an animal species.",
    "properties": {
      "animal_name": {
        "description": "The most common English name for the species. Keep it under 50 characters. Avoid including subspecies or variation labels unless critical for identification. Do not invent or decorate the name.",
        "title": "Animal Name",
        "type": "string"
      },
      "short_summary": {
        "description": "One brief sentence (no more than 20 words) describing the animal’s main characteristic or behavior.",
        "title": "Short Summary",
        "type": "string"
      },
      "distinctive_trait": {
        "description": "The single feature that most clearly separates this species from close relatives (e.g., pattern, call, or body shape).",
        "title": "Distinctive Trait",
        "type": "string"
      },
      "notable_features": {
        "description": "List 2–4 features that help in identifying or understanding the species. Each item should be short and specific.",
        "items": {
          "type": "string"
        },
        "title": "Notable Features",
        "type": "array"
      },
      "key_facts": {
        "description": "3–5 factual points about the species (e.g., size range, diet, reproduction, lifespan). Should avoid repeating notable_features.",
        "items": {
          "type": "string"
        },
        "title": "Key Facts",
        "type": "array"
      },
      "habitats": {
        "description": "2–4 environments where this animal usually lives (e.g., tundra, tropical forest, river delta).",
        "items": {
          "type": "string"
        },
        "title": "Habitats",
        "type": "array"
      },
      "example_queries": {
        "description": "2–4 sample search phrases someone might use when trying to find this species online (e.g., 'bird with curved beak', 'forest antelope small').",
        "items": {
          "type": "string"
        },
        "title": "Example Queries",
        "type": "array"
      }
    },
    "required": [
      "animal_name",
      "short_summary",
      "distinctive_trait",
      "notable_features",
      "key_facts",
      "habitats",
      "example_queries"
    ],
    "title": "AnimalMetadataSchemaV2",
    "type": "object"
  }

Test Result

NVidia H200 GPU,

Main: 68.58 reqs/sec
This PR: 80.42 reqs/sec (+17.3%)

This PR

============ Serving Benchmark Result ============
Successful requests:                     2000
Request rate configured (RPS):           100.00
Benchmark duration (s):                  24.87
Total input tokens:                      962000
Total generated tokens:                  255810
Request throughput (req/s):              80.42
Output token throughput (tok/s):         10286.35
Total Token throughput (tok/s):          48969.26
---------------Time to First Token----------------
Mean TTFT (ms):                          92.96
Median TTFT (ms):                        89.16
P99 TTFT (ms):                           229.94
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          26.62
Median TPOT (ms):                        27.61
P99 TPOT (ms):                           30.63
---------------Inter-token Latency----------------
Mean ITL (ms):                           26.41
Median ITL (ms):                         25.92
P99 ITL (ms):                            51.52
==================================================
correct_rate(%) 100.0

Main

============ Serving Benchmark Result ============
Successful requests:                     2000
Request rate configured (RPS):           100.00
Benchmark duration (s):                  29.16
Total input tokens:                      962000
Total generated tokens:                  255995
Request throughput (req/s):              68.58
Output token throughput (tok/s):         8777.49
Total Token throughput (tok/s):          41762.30
---------------Time to First Token----------------
Mean TTFT (ms):                          216.26
Median TTFT (ms):                        172.08
P99 TTFT (ms):                           615.28
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          70.55
Median TPOT (ms):                        75.79
P99 TPOT (ms):                           91.68
---------------Inter-token Latency----------------
Mean ITL (ms):                           69.98
Median ITL (ms):                         65.79
P99 ITL (ms):                            227.71
==================================================
correct_rate(%) 100.0

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

mergify · 2025-08-20T01:45:36Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vadiklyutiy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces an asynchronous architecture for guided decoding to improve performance, which is a significant and well-thought-out change. The use of multiprocessing and shared memory to offload heavy computations is a solid approach. The code is generally well-structured, and the separation of concerns into Manager, Gateway, and Executor classes is clean.

I've found a critical bug in the TPU model runner and a couple of opportunities to improve performance and robustness in the shared memory handling. My comments focus on these areas.

Overall, this is a great step towards optimizing structured output generation.

vllm/v1/worker/tpu_model_runner.py

vllm/v1/structured_output/__init__.py

github-actions · 2025-08-20T01:58:25Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

WoosukKwon · 2025-08-20T04:23:24Z

Hi, thanks for the PR!

Actually, the bitmask construction can be naturally overlapped with the model execution if we restructure the loop a bit: #23233 (while the RFC doesn't consider the spec decode + structured outputs case yet). WDYT?

mergify · 2025-08-23T03:02:39Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vadiklyutiy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

russellb · 2025-09-25T20:51:39Z

One performance concern is how TTFT is improved in one case (the larger JSON schema you provided), but worse in others (using the xgrammar_bench dataset of simpler JSON schemas). It seems that, at least for simpler schemas, the old threadpool executor performed initialization faster.

I'm hesitant to obsess over the results too much since I do believe this direction is good, and is required for us to move forward to adding support in the async scheduler.

russellb · 2025-09-25T21:03:35Z

I'm happy with this going in as-is. It's a great next step and will enable us to more easily work on the next steps of integration with async scheduling.

However, I would like to wait to merge until after we release v0.11.0. It feels like a very risky change to merge now when we plan to release any day. If we merge right after, that gives us more time to ensure there are no regressions.

vadiklyutiy · 2025-09-26T08:55:16Z

@russellb

I'm not worried about the 91% correctnes

likely you should increase allowed num of files open. http require one file per request. ulimit -n 65000 should fix it

russellb · 2025-09-26T12:25:49Z

@russellb

I'm not worried about the 91% correctnes

likely you should increase allowed num of files open. http require one file per request. ulimit -n 65000 should fix it

I think it's just from sometimes cutting the response off early so the result ends up not being valid JSON.

aarnphm

Discussed with Ben offline yesterday, +1 for merging after 0.11.x

aarnphm · 2025-09-26T12:41:34Z

vllm/v1/structured_output/__init__.py

+class StructuredOutputTask:
+
+    def __init__(self, task_type: TaskType, args: tuple, kwargs: dict):
+        self.task_type = task_type
+        self.args = args
+        self.kwargs = kwargs
+
+
+class StructuredOutputResult:
+
+    def __init__(self,
+                 task_type: TaskType,
+                 result: Any,
+                 error: Optional[Exception] = None):
+        self.task_type = task_type
+        self.result = result
+        self.error = error


nit stuff: to make these into dataclass.

vllm/v1/structured_output/__init__.py

vadiklyutiy · 2025-09-26T14:49:47Z

merging after 0.11.x

@russellb @aarnphm
no problem to wait the release.
Meantime could you please run the full CI to be sure there are no unexpected fails.

russellb · 2025-09-26T18:51:48Z

merging after 0.11.x

@russellb @aarnphm no problem to wait the release. Meantime could you please run the full CI to be sure there are no unexpected fails.

good idea. I just triggered full CI to run.

WoosukKwon

Thanks for the PR and review! I'd like to review this PR during weekend. Please hold it off until then.

requirements/docs.txt

- fix a bug with dp Signed-off-by: Vadim Gimpelson <[email protected]> #suppress-bc-linter

fix doc build Signed-off-by: Vadim Gimpelson <[email protected]> #suppress-bc-linter

Signed-off-by: Vadim Gimpelson <[email protected]>

simon-mo · 2025-10-03T21:16:14Z

@WoosukKwon bump on review! thank you

Signed-off-by: Vadim Gimpelson <[email protected]>

simon-mo · 2025-10-06T16:21:50Z

Now v0.11.0 has been released, ready to go?

vadiklyutiy · 2025-10-06T18:55:49Z

@simon-mo
I am always ready :)

russellb · 2025-10-07T12:48:24Z

Now v0.11.0 has been released, ready to go?

waiting on @WoosukKwon 's approval

mergify · 2025-10-07T13:58:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vadiklyutiy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

hmellor · 2025-10-07T13:59:02Z

@vadiklyutiy if you've not seen it already, please check out https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1759663228844749 for instructions on resolving the merge conflicts

vadiklyutiy · 2025-10-08T23:39:03Z

@hmellor thank you for pointed this post.

But the last 1.5 month I resolved merge conflicts in this PR 7-8 times. It is not always simple and I spend time on it.

I want to hear that we go with this PR and after that will resolve everything. Hope for your understanding

vadiklyutiy · 2025-10-08T23:40:07Z

@WoosukKwon could you kindly clarify are you going to review this PR or not?

hmellor · 2025-10-09T15:48:45Z

No problem @vadiklyutiy let's wait for Woosuk's response.

I just wanted to share the instructions for if you do need them.

vadiklyutiy · 2025-10-09T23:29:56Z

No problem @vadiklyutiy let's wait for Woosuk's response.

I just wanted to share the instructions for if you do need them.

Yes, seems without this instruction merge would be a pian. Thanks for sharing

vadiklyutiy · 2025-10-24T20:13:36Z

Had a meeting.
Decision is "no go" for this approach.

vadiklyutiy requested review from WoosukKwon, aarnphm, alexm-redhat, comaniac, mgoin, njhill, robertgshaw2-redhat, russellb and ywang96 as code owners August 20, 2025 01:44

mergify bot added structured-output v1 tpu Related to Google TPUs labels Aug 20, 2025

github-project-automation bot added this to Structured Output Aug 20, 2025

mergify bot added the needs-rebase label Aug 20, 2025

gemini-code-assist bot reviewed Aug 20, 2025

View reviewed changes

vllm/v1/worker/tpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/structured_output/__init__.py Outdated Show resolved Hide resolved

vllm/v1/structured_output/__init__.py Outdated Show resolved Hide resolved

vadiklyutiy force-pushed the async-guided-decoding branch from 1f7b15f to d676e2c Compare August 20, 2025 12:06

mergify bot removed the needs-rebase label Aug 20, 2025

vadiklyutiy mentioned this pull request Aug 21, 2025

[RFC]: Restructure the core loop to allow more asynchrony #23233

Open

1 task

mergify bot added the needs-rebase label Aug 23, 2025

shen-shanshan mentioned this pull request Aug 25, 2025

[Feature]: Add Support for Guided Decoding (Structured Output) vllm-project/vllm-ascend#177

Closed

20 tasks

benchislett mentioned this pull request Aug 25, 2025

[Perf][V1] Fully overlap model execution #23569

Merged

njhill mentioned this pull request Sep 5, 2025

Xgrammar fixed #24300

Open

vadiklyutiy force-pushed the async-guided-decoding branch from 985277e to e6da326 Compare September 18, 2025 09:08

vadiklyutiy requested review from ApostaC, NickLucche and heheda12345 as code owners September 18, 2025 09:08

aarnphm reviewed Sep 26, 2025

View reviewed changes

russellb added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 26, 2025

WoosukKwon requested changes Sep 26, 2025

View reviewed changes

hmellor reviewed Sep 29, 2025

View reviewed changes

requirements/docs.txt Outdated Show resolved Hide resolved

vadiklyutiy added 3 commits October 1, 2025 22:40

- fix doc build

03d43cd

- fix a bug with dp Signed-off-by: Vadim Gimpelson <[email protected]> #suppress-bc-linter

- pre-commit

97e8922

fix doc build Signed-off-by: Vadim Gimpelson <[email protected]> #suppress-bc-linter

rerun CI #suppress-bc-linter

c1c05bc

Signed-off-by: Vadim Gimpelson <[email protected]>

vadiklyutiy added 2 commits October 4, 2025 06:02

rerun CI #suppress-bc-linter

057ea6c

Signed-off-by: Vadim Gimpelson <[email protected]>

rerun CI #suppress-bc-linter

a092463

Signed-off-by: Vadim Gimpelson <[email protected]>

hmellor changed the title ~~[PERF] Async guided decoding~~ [PERF] Async structured outputs Oct 7, 2025

mergify bot added the needs-rebase label Oct 7, 2025

njhill mentioned this pull request Oct 15, 2025

[Core] Async scheduling + structured outputs compatibility #26866

Open

vadiklyutiy closed this Oct 24, 2025

Uh oh!

Uh oh!

[PERF] Async structured outputs #23224

[PERF] Async structured outputs #23224

Uh oh!

Conversation

vadiklyutiy commented Aug 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Key Changes:

Test Plan

Test Result

This PR

Main

Uh oh!

mergify bot commented Aug 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 20, 2025

Uh oh!

WoosukKwon commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Aug 23, 2025

Uh oh!

russellb commented Sep 25, 2025

Uh oh!

russellb commented Sep 25, 2025

Uh oh!

vadiklyutiy commented Sep 26, 2025

Uh oh!

russellb commented Sep 26, 2025

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

aarnphm Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vadiklyutiy commented Sep 26, 2025

Uh oh!

russellb commented Sep 26, 2025

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

simon-mo commented Oct 3, 2025

Uh oh!

simon-mo commented Oct 6, 2025

Uh oh!

vadiklyutiy commented Oct 6, 2025

Uh oh!

russellb commented Oct 7, 2025

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

hmellor commented Oct 7, 2025

Uh oh!

vadiklyutiy commented Oct 8, 2025

Uh oh!

vadiklyutiy commented Oct 8, 2025

Uh oh!

hmellor commented Oct 9, 2025

Uh oh!

vadiklyutiy commented Oct 9, 2025

Uh oh!

vadiklyutiy commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

vadiklyutiy commented Aug 20, 2025 •

edited by github-actions bot

Loading

WoosukKwon commented Aug 20, 2025 •

edited

Loading