Skip to content

Conversation

Tostino
Copy link
Contributor

@Tostino Tostino commented Oct 9, 2023

Here is a PR for support for conversation-templates setup as json files which can be specified for a model upon starting the api. Just create your template, and pass in the path with the --conversation-template my_template.json argument.

There was no where for me to add support for this in the vLLM api, so I only added it to the OpenAI api.

Added an example, and updated the quickstart part of the readme (the only place that talked about the other arguments)

@Tostino
Copy link
Contributor Author

Tostino commented Oct 16, 2023

Going to cancel this PR, and work on another which properly implements the HF chat templates within the tokenizer.

@Tostino Tostino closed this Oct 16, 2023
amy-why-3459 pushed a commit to amy-why-3459/vllm that referenced this pull request Sep 15, 2025
### What this PR does / why we need it?

1. Add MTP dummy_run, and adapt main model dummy_run when mtp is enabled
2. Adapt main model torchair graph mode, when mtp is enabled
3. mtp model torchair graph mode will be supported in the future

### Does this PR introduce _any_ user-facing change?


### How was this patch tested?

This patch is tested by
`vllm-ascend/tests/long_term/spec_decode_v1/test_v1_mtp_correctness.py`


### Usage
online example
```shell
export VLLM_USE_V1=1
export VLLM_ENABLE_MC2=1
export VLLM_VERSION=0.9.1
export ASCEND_LAUNCH_BLOCKING=0

python -m vllm.entrypoints.openai.api_server \
 --model="/model_weight_path" \
 --trust-remote-code \
 --max-model-len 40000 \
 --tensor-parallel-size 4 \
 --data_parallel_size 4 \
 --enable_expert_parallel \
 --served-model-name deepseekr1 \
 --quantization ascend \
 --host 0.0.0.0 \
 --port 1234 \
 --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
 --additional-config '{"ascend_scheduler_config":{"enabled":true,"enable_chunked_prefill":false},"torchair_graph_config":{"enabled":true}}' \
 --gpu_memory_utilization 0.95
```

offline example
```
lm = LLM(
    model="/model_weight_path",
    tensor_parallel_size=16,
    max_num_seqs=128,
    gpu_memory_utilization=0.95,
    distributed_executor_backend="mp",
    enable_expert_parallel=True,
    speculative_config={
        "method": "deepseek_mtp",
        "num_speculative_tokens": 1,
    },
    trust_remote_code=True,
    enforce_eager=False,
    additional_config = {
       'torchair_graph_config': {
            'enabled': True,
            'enable_multistream_shared_expert': False
        },
       "ascend_scheduler_config": {
            "enabled": True
        },
    }
)
```

Signed-off-by: xuyexiong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant