Skip to content

Conversation

@vadiklyutiy
Copy link
Contributor

@vadiklyutiy vadiklyutiy commented Aug 20, 2025

Purpose

Continue working on optimizing structured outputs started in #21862.
This PR implements asynchronous calculation of structured outputs to improve performance of structured output generation by moving heavy computation to a separate background process.

Key Changes:

  1. Multiprocess Architecture: Introduces a new 3-tier architecture:
  • StructuredOutputManager (Main process): Coordinates operations via multiprocessing queues
  • StructuredOutputGateway (Child process): Background process that receives and executes tasks
  • StructuredOutputExecutor (Child process): Performs actual grammar compilation, bitmask generation, and token acceptance
  1. Asynchronous Grammar Bitmask Calculations:
  • Grammar bitmask generation now happens asynchronously in the background process
  • Main process receives a GrammarBitmaskPlaceholder and can continue execution
  • Actual bitmask is retrieved when needed via shared memory
  1. Communication:
  • Uses shared memory for large bitmask data transfer
  • Three dedicated queues for different types of communication:
    • task_queue: Main → Child for task submission
    • batch_validate_result_queue: Child → Main for token validation results
    • grammar_init_notification_queue: Child → Main for grammar initialization notifications
  1. Introduce Batch Processing for:
  • submit_batch_accept_tokens(): Batches token acceptance operations
  • submit_batch_validate_tokens(): Batches token validation for speculative decoding
    to reduce amount of communications

Test Plan

vllm serve Qwen/Qwen3-1.7B  --no-enable-prefix-caching
python3 benchmarks/benchmark_serving_structured_output.py --backend vllm --model Qwen/Qwen3-1.7B --structured-output-ratio 1.0 --request-rate 100 --num-prompts 2000 --json-schema-path ./test3.json  --output-len 2048
test3.json
{
    "description": "Schema for representing structured information about an animal species.",
    "properties": {
      "animal_name": {
        "description": "The most common English name for the species. Keep it under 50 characters. Avoid including subspecies or variation labels unless critical for identification. Do not invent or decorate the name.",
        "title": "Animal Name",
        "type": "string"
      },
      "short_summary": {
        "description": "One brief sentence (no more than 20 words) describing the animal’s main characteristic or behavior.",
        "title": "Short Summary",
        "type": "string"
      },
      "distinctive_trait": {
        "description": "The single feature that most clearly separates this species from close relatives (e.g., pattern, call, or body shape).",
        "title": "Distinctive Trait",
        "type": "string"
      },
      "notable_features": {
        "description": "List 2–4 features that help in identifying or understanding the species. Each item should be short and specific.",
        "items": {
          "type": "string"
        },
        "title": "Notable Features",
        "type": "array"
      },
      "key_facts": {
        "description": "3–5 factual points about the species (e.g., size range, diet, reproduction, lifespan). Should avoid repeating notable_features.",
        "items": {
          "type": "string"
        },
        "title": "Key Facts",
        "type": "array"
      },
      "habitats": {
        "description": "2–4 environments where this animal usually lives (e.g., tundra, tropical forest, river delta).",
        "items": {
          "type": "string"
        },
        "title": "Habitats",
        "type": "array"
      },
      "example_queries": {
        "description": "2–4 sample search phrases someone might use when trying to find this species online (e.g., 'bird with curved beak', 'forest antelope small').",
        "items": {
          "type": "string"
        },
        "title": "Example Queries",
        "type": "array"
      }
    },
    "required": [
      "animal_name",
      "short_summary",
      "distinctive_trait",
      "notable_features",
      "key_facts",
      "habitats",
      "example_queries"
    ],
    "title": "AnimalMetadataSchemaV2",
    "type": "object"
  }

Test Result

NVidia H200 GPU,

  • Main: 68.58 reqs/sec
  • This PR: 80.42 reqs/sec (+17.3%)

This PR

============ Serving Benchmark Result ============
Successful requests:                     2000
Request rate configured (RPS):           100.00
Benchmark duration (s):                  24.87
Total input tokens:                      962000
Total generated tokens:                  255810
Request throughput (req/s):              80.42
Output token throughput (tok/s):         10286.35
Total Token throughput (tok/s):          48969.26
---------------Time to First Token----------------
Mean TTFT (ms):                          92.96
Median TTFT (ms):                        89.16
P99 TTFT (ms):                           229.94
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          26.62
Median TPOT (ms):                        27.61
P99 TPOT (ms):                           30.63
---------------Inter-token Latency----------------
Mean ITL (ms):                           26.41
Median ITL (ms):                         25.92
P99 ITL (ms):                            51.52
==================================================
correct_rate(%) 100.0

Main

============ Serving Benchmark Result ============
Successful requests:                     2000
Request rate configured (RPS):           100.00
Benchmark duration (s):                  29.16
Total input tokens:                      962000
Total generated tokens:                  255995
Request throughput (req/s):              68.58
Output token throughput (tok/s):         8777.49
Total Token throughput (tok/s):          41762.30
---------------Time to First Token----------------
Mean TTFT (ms):                          216.26
Median TTFT (ms):                        172.08
P99 TTFT (ms):                           615.28
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          70.55
Median TPOT (ms):                        75.79
P99 TPOT (ms):                           91.68
---------------Inter-token Latency----------------
Mean ITL (ms):                           69.98
Median ITL (ms):                         65.79
P99 ITL (ms):                            227.71
==================================================
correct_rate(%) 100.0


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results

@mergify
Copy link

mergify bot commented Aug 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vadiklyutiy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 20, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an asynchronous architecture for guided decoding to improve performance, which is a significant and well-thought-out change. The use of multiprocessing and shared memory to offload heavy computations is a solid approach. The code is generally well-structured, and the separation of concerns into Manager, Gateway, and Executor classes is clean.

I've found a critical bug in the TPU model runner and a couple of opportunities to improve performance and robustness in the shared memory handling. My comments focus on these areas.

Overall, this is a great step towards optimizing structured output generation.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@WoosukKwon
Copy link
Collaborator

WoosukKwon commented Aug 20, 2025

Hi, thanks for the PR!

Actually, the bitmask construction can be naturally overlapped with the model execution if we restructure the loop a bit: #23233 (while the RFC doesn't consider the spec decode + structured outputs case yet). WDYT?

@mergify
Copy link

mergify bot commented Aug 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vadiklyutiy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@russellb
Copy link
Member

One performance concern is how TTFT is improved in one case (the larger JSON schema you provided), but worse in others (using the xgrammar_bench dataset of simpler JSON schemas). It seems that, at least for simpler schemas, the old threadpool executor performed initialization faster.

I'm hesitant to obsess over the results too much since I do believe this direction is good, and is required for us to move forward to adding support in the async scheduler.

@russellb
Copy link
Member

I'm happy with this going in as-is. It's a great next step and will enable us to more easily work on the next steps of integration with async scheduling.

However, I would like to wait to merge until after we release v0.11.0. It feels like a very risky change to merge now when we plan to release any day. If we merge right after, that gives us more time to ensure there are no regressions.

@vadiklyutiy
Copy link
Contributor Author

@russellb

I'm not worried about the 91% correctnes

likely you should increase allowed num of files open. http require one file per request. ulimit -n 65000 should fix it

@russellb
Copy link
Member

@russellb

I'm not worried about the 91% correctnes

likely you should increase allowed num of files open. http require one file per request. ulimit -n 65000 should fix it

I think it's just from sometimes cutting the response off early so the result ends up not being valid JSON.

Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with Ben offline yesterday, +1 for merging after 0.11.x

Comment on lines +152 to +168
class StructuredOutputTask:

def __init__(self, task_type: TaskType, args: tuple, kwargs: dict):
self.task_type = task_type
self.args = args
self.kwargs = kwargs


class StructuredOutputResult:

def __init__(self,
task_type: TaskType,
result: Any,
error: Optional[Exception] = None):
self.task_type = task_type
self.result = result
self.error = error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit stuff: to make these into dataclass.

@vadiklyutiy
Copy link
Contributor Author

merging after 0.11.x

@russellb @aarnphm
no problem to wait the release.
Meantime could you please run the full CI to be sure there are no unexpected fails.

@russellb russellb added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 26, 2025
@russellb
Copy link
Member

merging after 0.11.x

@russellb @aarnphm no problem to wait the release. Meantime could you please run the full CI to be sure there are no unexpected fails.

good idea. I just triggered full CI to run.

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and review! I'd like to review this PR during weekend. Please hold it off until then.

 - fix a bug with dp

Signed-off-by: Vadim Gimpelson <[email protected]>

#suppress-bc-linter
fix doc build

Signed-off-by: Vadim Gimpelson <[email protected]>

#suppress-bc-linter
Signed-off-by: Vadim Gimpelson <[email protected]>
@simon-mo
Copy link
Collaborator

simon-mo commented Oct 3, 2025

@WoosukKwon bump on review! thank you

Signed-off-by: Vadim Gimpelson <[email protected]>
Signed-off-by: Vadim Gimpelson <[email protected]>
@simon-mo
Copy link
Collaborator

simon-mo commented Oct 6, 2025

Now v0.11.0 has been released, ready to go?

@vadiklyutiy
Copy link
Contributor Author

@simon-mo
I am always ready :)

@russellb
Copy link
Member

russellb commented Oct 7, 2025

Now v0.11.0 has been released, ready to go?

waiting on @WoosukKwon 's approval

@hmellor hmellor changed the title [PERF] Async guided decoding [PERF] Async structured outputs Oct 7, 2025
@mergify
Copy link

mergify bot commented Oct 7, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vadiklyutiy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 7, 2025
@hmellor
Copy link
Member

hmellor commented Oct 7, 2025

@vadiklyutiy if you've not seen it already, please check out https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1759663228844749 for instructions on resolving the merge conflicts

@vadiklyutiy
Copy link
Contributor Author

@hmellor thank you for pointed this post.

But the last 1.5 month I resolved merge conflicts in this PR 7-8 times. It is not always simple and I spend time on it.

I want to hear that we go with this PR and after that will resolve everything. Hope for your understanding

@vadiklyutiy
Copy link
Contributor Author

@WoosukKwon could you kindly clarify are you going to review this PR or not?

@hmellor
Copy link
Member

hmellor commented Oct 9, 2025

No problem @vadiklyutiy let's wait for Woosuk's response.

I just wanted to share the instructions for if you do need them.

@vadiklyutiy
Copy link
Contributor Author

No problem @vadiklyutiy let's wait for Woosuk's response.

I just wanted to share the instructions for if you do need them.

Yes, seems without this instruction merge would be a pian. Thanks for sharing

@vadiklyutiy
Copy link
Contributor Author

Had a meeting.
Decision is "no go" for this approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build needs-rebase ready ONLY add when PR is ready to merge/full CI is needed structured-output tpu Related to Google TPUs v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants