[RFC]: Restructure the core loop to allow more asynchrony

### Motivation.

Today, the core loop looks like:

```python
while True:
    scheduler_output = self.scheduler.schedule()
    model_runner_output = self.model_executor.execute_model(scheduler_output)
    engine_core_outputs = self.scheduler.update_from_output(scheduler_output, model_runner_output)
    yield engine_core_outputs
```

While simple, this structure doesn't allow us to utilize CPU cycles while the GPU is running the model in `execute_model`.

### Proposed Change.

The proposal is to carve out the sampling stage from the `execute_model` method. This way, `execute_model` will be a non-blocking call that returns without no GPU->CPU synchronization at the end. That is:

```python
while True:
    scheduler_output = self.scheduler.schedule()

    # Prepare inputs and execute the model until the last hidden states. This is non-blocking.
    self.model_executor.execute_model(scheduler_output)

    # If structured outputs is used, produce the bitmask here. Otherwise, bitmask is None.
    bitmask = self.scheduler.get_grammar_bitmask(scheduler_output)

    # Prepare sampling metadata and sample the next token ids. This is blocking.
    model_runner_output = self.model_executor.sample(bitmask)

    engine_core_outputs = self.scheduler.update_from_output(scheduler_output, model_runner_output)
    yield engine_core_outputs
```

This gives two performance benefits:
1. (On the worker side) We can overlap the construction of sampling metadata (and logits processor) with `execute_model`.
2. (On the scheduler side) We can overlap the construction of bitmask with `execute_model`.

For async scheduling, the core loop will look like
```python
# Initial step
scheduler_output = self.scheduler.schedule()
self.model_executor.prepare_inputs(scheduler_output)
self.model_executor.execute_model()
bitmask = self.scheduler.get_grammar_bitmask(scheduler_output)
prev_scheduler_output = scheduler_output

while True:
    scheduler_output = self.scheduler.schedule()
    self.model_executor.prepare_inputs(scheduler_output)
    model_output = self.model_executor.sample(bitmask)
    self.model_executor.execute_model()
    bitmask = self.scheduler.get_grammar_bitmask(scheduler_output)
    engine_core_outputs = self.scheduler.update_from_output(prev_scheduler_output, model_output)
    prev_scheduler_output = scheduler_output
```

### Feedback Period.

_No response_

### CC List.

_No response_

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Restructure the core loop to allow more asynchrony #23233

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Restructure the core loop to allow more asynchrony #23233

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions