Skip to content

Conversation

rahul-tuli
Copy link
Contributor

@rahul-tuli rahul-tuli commented Sep 19, 2025

This PR fixes two critical issues with speculators models that were running without speculative decoding enabled, and enables users to combine engine-level arguments (like --tensor-parallel-size, --seed, --max-model-len) with speculators models using simplified command syntax.

Problem

Previously, there were two issues with speculators models:

  1. Speculators models were running but NOT using speculative decoding - They were treated as regular models, missing the performance benefits of speculative decoding entirely, this happened because while the verifier model and tokenizer were updated the speculative config was not being initilialized
  2. Engine-level arguments were incompatible - Users had to use verbose commands with explicit speculative configuration:
VLLM_USE_V1=1 vllm serve "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8-dynamic" \
    --seed 42 \
    --tensor-parallel-size 4 \
    --speculative-config '{"model": "nm-testing/SpeculatorLlama3-1-8B-Eagle3-converted-0717-quantized", "num_speculative_tokens": 3, "method":"eagle3"}'

The speculators model detection was happening too late in the configuration pipeline (in ModelConfig), which meant:

  • The speculative configuration was never properly initialized
  • Engine-level arguments couldn't be applied to the correct models
  • Users lost the core benefit of using speculators models

Solution

This PR fixes both issues:

  1. Enables proper speculative decoding - Speculators models now initializes speculative configuration and run with the intended performance optimizations
  2. Simplified syntax with engine-level arguments - Users can now use the clean syntax:
vllm serve --seed 42 --tensor-parallel-size 4 "nm-testing/SpeculatorLlama3-1-8B-Eagle3-converted-0717-quantized"

The model will now correctly:

  • Detect that it's a speculators model
  • Extract the embedded speculative configuration
  • Use speculative decoding for improved performance
  • Apply engine-level arguments to both target and draft models

Implementation Details

Core Changes

  1. Moved speculators detection earlier: Relocated from ModelConfig to EngineArgs to ensure proper argument processing
  2. Improved function naming for readability:
    • convert_speculators_to_vllmbuild_vllm_speculative_config
    • extract_vllm_speculative_config to add proper validation and algorithm level updates
    • maybe_override_with_speculators_target_modelmaybe_override_with_speculators
  3. Enhanced configuration merging: Engine-level CLI arguments now take precedence over embedded settings
  4. Backward compatibility: Maintains full compatibility with regular models and existing workflows

Technical Implementation

  • Automatic Detection: Detects speculators models by checking for embedded speculators_config
  • Configuration Extraction: Converts embedded speculative configuration to vLLM format
  • CLI Precedence: Engine-level arguments override embedded configuration values
  • Early Processing: Moves configuration resolution to EngineArgs.create_engine_config()

Usage Examples

Basic Serve Command

export CUDA_VISIBLE_DEVICES=0,1
export VLLM_USE_V1=1
vllm serve \
    --host 127.0.0.1 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --seed 42 \
    --max-model-len 4096 \
    "nm-testing/SpeculatorLlama3-1-8B-Eagle3-converted-0717-quantized"

Test Request

curl -s \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "The capital of France is",
        "max_tokens": 10,
        "temperature": 0.7
    }' \
    "http://127.0.0.1:8000/v1/completions"

Benefits

  1. Actually Enables Speculative Decoding: Speculators models now run with the intended performance optimizations instead of as regular models
  2. Simplified UX: Users no longer need verbose speculative configuration
  3. Engine arguments work seamlessly with speculators models
    acceleration

Files Modified

  • vllm/engine/arg_utils.py: Moved speculators detection logic
  • vllm/transformers_utils/config.py: Enhanced configuration resolution
  • vllm/transformers_utils/configs/speculators/base.py: Improved function naming and documentation
  • vllm/config/__init__.py: Removed old detection logic
  • tests/speculative_decoding/speculators/test_eagle3.py: Used parameterization, and expanded test to check for speculative config, they should catch future errors like these now!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to match vllm

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me, just a nit

Comment on lines -84 to +109
# Build base vLLM config
# Build base vLLM speculative configuration
vllm_config = {
"method": config_dict.get("speculators_model_type"),
"num_lookahead_tokens": num_lookahead_tokens,
"num_speculative_tokens": num_speculative_tokens,
"target_model": spec_config.get("verifier")["name_or_path"]
}
vllm_config.update(config_dict["transformer_layer_config"])

# Merge transformer layer configuration if present
transformer_config = config_dict.get("transformer_layer_config", {})
vllm_config.update(transformer_config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should validate that this is a valid SpeculativeConfig after construction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can validate the engine args level, let me add that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On taking a deeper look it will be non-trivial and sort of hacky to validate that here, since create_speculative_config methods adds target model config and other things before initializing the SpeculativeConfig, I think its fine to fail at that level for now?

@mgoin mgoin added speculative-decoding ready ONLY add when PR is ready to merge/full CI is needed labels Sep 19, 2025
Copy link

mergify bot commented Sep 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rahul-tuli.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 19, 2025
@@ -19,7 +19,6 @@ def test_llama(vllm_runner, example_prompts, model_path, monkeypatch):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can add a better non-smoke test to this PR to prevent something like this in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this test itself to check for speculative config and other things, they should be fine now, and would catch errors like these in the future!

Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me. Thanks for this.

@aarnphm
Copy link
Collaborator

aarnphm commented Sep 19, 2025

@rahul-tuli there is merge conflicts? can you address this as well?

@rahul-tuli rahul-tuli force-pushed the feat/fix-speculators-model-support branch from d8df1d1 to b91f7df Compare September 21, 2025 10:53
@mergify mergify bot removed the needs-rebase label Sep 21, 2025
@rahul-tuli rahul-tuli marked this pull request as draft September 21, 2025 11:14
This commit enables users to combine engine-level arguments (like
--tensor-parallel-size, --seed, --max-model-len) with speculators models
using simplified command syntax.

Changes:
- Move speculators detection from ModelConfig to EngineArgs for earlier processing
- Refactor speculators config extraction with improved function naming:
  - convert_speculators_to_vllm → build_vllm_speculative_config
  - get_vllm_config_dict → extract_vllm_speculative_config
  - maybe_override_with_speculators_target_model → maybe_override_with_speculators
- Enhance test coverage to verify speculative config initialization
- Add comprehensive documentation and error handling
- Remove debug logging from production code
- Apply consistent code formatting per project standards

Users can now use simplified syntax like:
vllm serve --seed 42 --tensor-parallel-size 4 "speculators-model-name"

Instead of the verbose explicit configuration.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Rahul Tuli <[email protected]>
@rahul-tuli rahul-tuli force-pushed the feat/fix-speculators-model-support branch from aa6b3ec to 125a424 Compare September 21, 2025 11:16
- Replace corrupted config/__init__.py with clean version from main
- Combine test_llama and test_qwen into single parameterized test
- Add descriptive test IDs: llama3-eagle3-speculator, qwen3-eagle3-speculator
- Fix inconsistent property access and enhance test documentation
- Verify speculative config initialization and text generation
- Apply formatting fixes from pre-commit hooks

Signed-off-by: Rahul Tuli <[email protected]>
@rahul-tuli rahul-tuli force-pushed the feat/fix-speculators-model-support branch from 125a424 to 8e10e4b Compare September 21, 2025 12:31
@rahul-tuli rahul-tuli marked this pull request as ready for review September 21, 2025 12:35
@rahul-tuli
Copy link
Contributor Author

@rahul-tuli there is merge conflicts? can you address this as well?

Addressed! @aarnphm

@mgoin mgoin merged commit c438b29 into vllm-project:main Sep 21, 2025
44 checks passed
minosfuture pushed a commit to minosfuture/vllm that referenced this pull request Sep 21, 2025
kingsmad pushed a commit to kingsmad/vllm that referenced this pull request Sep 22, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: Rahul Tuli <[email protected]>
Co-authored-by: Claude <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants