feat: Enable engine-level arguments with speculators models #25250

rahul-tuli · 2025-09-19T11:04:28Z

This PR fixes two critical issues with speculators models that were running without speculative decoding enabled, and enables users to combine engine-level arguments (like --tensor-parallel-size, --seed, --max-model-len) with speculators models using simplified command syntax.

Problem

Previously, there were two issues with speculators models:

Speculators models were running but NOT using speculative decoding - They were treated as regular models, missing the performance benefits of speculative decoding entirely, this happened because while the verifier model and tokenizer were updated the speculative config was not being initilialized
Engine-level arguments were incompatible - Users had to use verbose commands with explicit speculative configuration:

VLLM_USE_V1=1 vllm serve "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8-dynamic" \
    --seed 42 \
    --tensor-parallel-size 4 \
    --speculative-config '{"model": "nm-testing/SpeculatorLlama3-1-8B-Eagle3-converted-0717-quantized", "num_speculative_tokens": 3, "method":"eagle3"}'

The speculators model detection was happening too late in the configuration pipeline (in ModelConfig), which meant:

The speculative configuration was never properly initialized
Engine-level arguments couldn't be applied to the correct models
Users lost the core benefit of using speculators models

Solution

This PR fixes both issues:

Enables proper speculative decoding - Speculators models now initializes speculative configuration and run with the intended performance optimizations
Simplified syntax with engine-level arguments - Users can now use the clean syntax:

vllm serve --seed 42 --tensor-parallel-size 4 "nm-testing/SpeculatorLlama3-1-8B-Eagle3-converted-0717-quantized"

The model will now correctly:

Detect that it's a speculators model
Extract the embedded speculative configuration
Use speculative decoding for improved performance
Apply engine-level arguments to both target and draft models

Implementation Details

Core Changes

Moved speculators detection earlier: Relocated from ModelConfig to EngineArgs to ensure proper argument processing
Improved function naming for readability:
- convert_speculators_to_vllm → build_vllm_speculative_config
- extract_vllm_speculative_config to add proper validation and algorithm level updates
- maybe_override_with_speculators_target_model → maybe_override_with_speculators
Enhanced configuration merging: Engine-level CLI arguments now take precedence over embedded settings
Backward compatibility: Maintains full compatibility with regular models and existing workflows

Technical Implementation

Automatic Detection: Detects speculators models by checking for embedded speculators_config
Configuration Extraction: Converts embedded speculative configuration to vLLM format
CLI Precedence: Engine-level arguments override embedded configuration values
Early Processing: Moves configuration resolution to EngineArgs.create_engine_config()

Usage Examples

Basic Serve Command

export CUDA_VISIBLE_DEVICES=0,1
export VLLM_USE_V1=1
vllm serve \
    --host 127.0.0.1 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --seed 42 \
    --max-model-len 4096 \
    "nm-testing/SpeculatorLlama3-1-8B-Eagle3-converted-0717-quantized"

Test Request

curl -s \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "The capital of France is",
        "max_tokens": 10,
        "temperature": 0.7
    }' \
    "http://127.0.0.1:8000/v1/completions"

Benefits

Actually Enables Speculative Decoding: Speculators models now run with the intended performance optimizations instead of as regular models
Simplified UX: Users no longer need verbose speculative configuration
Engine arguments work seamlessly with speculators models
acceleration

Files Modified

vllm/engine/arg_utils.py: Moved speculators detection logic
vllm/transformers_utils/config.py: Enhanced configuration resolution
vllm/transformers_utils/configs/speculators/base.py: Improved function naming and documentation
vllm/config/__init__.py: Removed old detection logic
tests/speculative_decoding/speculators/test_eagle3.py: Used parameterization, and expanded test to check for speculative config, they should catch future errors like these now!

rahul-tuli · 2025-09-19T11:10:00Z

vllm/transformers_utils/configs/speculators/base.py

Changed to match vllm

mgoin

Looks reasonable to me, just a nit

mgoin · 2025-09-19T16:25:05Z

vllm/transformers_utils/configs/speculators/base.py

-        # Build base vLLM config
+        # Build base vLLM speculative configuration
        vllm_config = {
            "method": config_dict.get("speculators_model_type"),
-            "num_lookahead_tokens": num_lookahead_tokens,
+            "num_speculative_tokens": num_speculative_tokens,
            "target_model": spec_config.get("verifier")["name_or_path"]
        }
-        vllm_config.update(config_dict["transformer_layer_config"])
+
+        # Merge transformer layer configuration if present
+        transformer_config = config_dict.get("transformer_layer_config", {})
+        vllm_config.update(transformer_config)


nit: we should validate that this is a valid SpeculativeConfig after construction

We can validate the engine args level, let me add that!

On taking a deeper look it will be non-trivial and sort of hacky to validate that here, since create_speculative_config methods adds target model config and other things before initializing the SpeculativeConfig, I think its fine to fail at that level for now?

mergify · 2025-09-19T16:26:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rahul-tuli.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

dsikka · 2025-09-19T20:35:55Z

tests/speculative_decoding/speculators/test_eagle3.py

@@ -19,7 +19,6 @@ def test_llama(vllm_runner, example_prompts, model_path, monkeypatch):



Do you think we can add a better non-smoke test to this PR to prevent something like this in the future?

Update this test itself to check for speculative config and other things, they should be fine now, and would catch errors like these in the future!

aarnphm

This change makes sense to me. Thanks for this.

aarnphm · 2025-09-19T20:53:41Z

@rahul-tuli there is merge conflicts? can you address this as well?

This commit enables users to combine engine-level arguments (like --tensor-parallel-size, --seed, --max-model-len) with speculators models using simplified command syntax. Changes: - Move speculators detection from ModelConfig to EngineArgs for earlier processing - Refactor speculators config extraction with improved function naming: - convert_speculators_to_vllm → build_vllm_speculative_config - get_vllm_config_dict → extract_vllm_speculative_config - maybe_override_with_speculators_target_model → maybe_override_with_speculators - Enhance test coverage to verify speculative config initialization - Add comprehensive documentation and error handling - Remove debug logging from production code - Apply consistent code formatting per project standards Users can now use simplified syntax like: vllm serve --seed 42 --tensor-parallel-size 4 "speculators-model-name" Instead of the verbose explicit configuration. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Rahul Tuli <[email protected]>

- Replace corrupted config/__init__.py with clean version from main - Combine test_llama and test_qwen into single parameterized test - Add descriptive test IDs: llama3-eagle3-speculator, qwen3-eagle3-speculator - Fix inconsistent property access and enhance test documentation - Verify speculative config initialization and text generation - Apply formatting fixes from pre-commit hooks Signed-off-by: Rahul Tuli <[email protected]>

rahul-tuli · 2025-09-21T12:35:52Z

@rahul-tuli there is merge conflicts? can you address this as well?

Addressed! @aarnphm

…ject#25250) Signed-off-by: Rahul Tuli <[email protected]> Co-authored-by: Claude <[email protected]>

…ject#25250) Signed-off-by: Rahul Tuli <[email protected]> Co-authored-by: Claude <[email protected]> Signed-off-by: charlifu <[email protected]>

Signed-off-by: Rahul Tuli <[email protected]> Co-authored-by: Claude <[email protected]> Signed-off-by: yewentao256 <[email protected]>

…ject#25250) Signed-off-by: Rahul Tuli <[email protected]> Co-authored-by: Claude <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…ject#25250) Signed-off-by: Rahul Tuli <[email protected]> Co-authored-by: Claude <[email protected]>

rahul-tuli mentioned this pull request Sep 19, 2025

feat: Enable engine-level arguments with speculators models #24962

Closed

rahul-tuli marked this pull request as ready for review September 19, 2025 11:07

rahul-tuli requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners September 19, 2025 11:07

rahul-tuli commented Sep 19, 2025

View reviewed changes

vllm/transformers_utils/configs/speculators/base.py Outdated

Copy link

Contributor Author

rahul-tuli Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to match vllm

mgoin approved these changes Sep 19, 2025

View reviewed changes

mgoin added speculative-decoding ready ONLY add when PR is ready to merge/full CI is needed labels Sep 19, 2025

mergify bot added the needs-rebase label Sep 19, 2025

dsikka reviewed Sep 19, 2025

View reviewed changes

aarnphm approved these changes Sep 19, 2025

View reviewed changes

rahul-tuli force-pushed the feat/fix-speculators-model-support branch from d8df1d1 to b91f7df Compare September 21, 2025 10:53

mergify bot removed the needs-rebase label Sep 21, 2025

rahul-tuli marked this pull request as draft September 21, 2025 11:14

rahul-tuli force-pushed the feat/fix-speculators-model-support branch from aa6b3ec to 125a424 Compare September 21, 2025 11:16

rahul-tuli force-pushed the feat/fix-speculators-model-support branch from 125a424 to 8e10e4b Compare September 21, 2025 12:31

Merge branch 'main' into feat/fix-speculators-model-support

c5dd9ff

rahul-tuli marked this pull request as ready for review September 21, 2025 12:35

mgoin merged commit c438b29 into vllm-project:main Sep 21, 2025
44 checks passed

minosfuture pushed a commit to minosfuture/vllm that referenced this pull request Sep 21, 2025

feat: Enable engine-level arguments with speculators models (vllm-pro…

626782f

…ject#25250) Signed-off-by: Rahul Tuli <[email protected]> Co-authored-by: Claude <[email protected]>

kingsmad pushed a commit to kingsmad/vllm that referenced this pull request Sep 22, 2025

feat: Enable engine-level arguments with speculators models (vllm-pro…

0de3fac

…ject#25250) Signed-off-by: Rahul Tuli <[email protected]> Co-authored-by: Claude <[email protected]>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

feat: Enable engine-level arguments with speculators models (vllm-pro…

051a9e5

…ject#25250) Signed-off-by: Rahul Tuli <[email protected]> Co-authored-by: Claude <[email protected]>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

feat: Enable engine-level arguments with speculators models (#25250)

791089d

Signed-off-by: Rahul Tuli <[email protected]> Co-authored-by: Claude <[email protected]> Signed-off-by: yewentao256 <[email protected]>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

feat: Enable engine-level arguments with speculators models (vllm-pro…

99941c4

…ject#25250) Signed-off-by: Rahul Tuli <[email protected]> Co-authored-by: Claude <[email protected]>

		@@ -19,7 +19,6 @@ def test_llama(vllm_runner, example_prompts, model_path, monkeypatch):

Uh oh!

feat: Enable engine-level arguments with speculators models #25250

feat: Enable engine-level arguments with speculators models #25250

Uh oh!

Conversation

rahul-tuli commented Sep 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Implementation Details

Core Changes

Technical Implementation

Usage Examples

Basic Serve Command

Test Request

Benefits

Files Modified

Uh oh!

rahul-tuli Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

rahul-tuli Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

rahul-tuli Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 19, 2025

Uh oh!

dsikka Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

rahul-tuli Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

aarnphm commented Sep 19, 2025

Uh oh!

rahul-tuli commented Sep 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rahul-tuli commented Sep 19, 2025 •

edited by github-actions bot

Loading