-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
Your current environment
==============================
System Info
==============================
OS : macOS 15.5 (arm64)
GCC version : Could not collect
Clang version : 17.0.0 (clang-1700.0.13.5)
CMake version : Could not collect
Libc version : N/A
==============================
PyTorch Info
==============================
PyTorch version : 2.7.0
Is debug build : False
CUDA used to build PyTorch : None
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.12.10 (main, Apr 8 2025, 11:35:47) [Clang 16.0.0 (clang-1600.0.26.6)] (64-bit runtime)
Python platform : macOS-15.5-arm64-arm-64bit
==============================
CUDA / GPU Info
==============================
Is CUDA available : False
CUDA runtime version : No CUDA
CUDA_MODULE_LOADING set to : N/A
GPU models and configuration : No CUDA
Nvidia driver version : No CUDA
cuDNN version : No CUDA
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
Apple M3 Max
==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] pyzmq==26.4.0
[pip3] torch==2.7.0
[pip3] torchaudio==2.7.0
[pip3] torchvision==0.22.0
[pip3] transformers==4.52.4
[conda] Could not collect
==============================
vLLM Info
==============================
ROCM Version : Could not collect
Neuron SDK Version : N/A
vLLM Version : 0.9.2.dev44+gc742438f8 (git sha: c742438f8)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect
==============================
Environment Variables
==============================
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
How would you like to use vllm
I'd like to run whisper model inference with vLLM. To support transcribing long audio files, I need whisper model to return timestamps for chunking and then merging. My code looks like below
model_name = "/Users/xxx/.cache/huggingface/hub/models--openai--whisper-base/snapshots/e37978b90ca9030d5170a5c07aadb050351a65bb"
llm = LLM(model_name, ...)
decoder_prompt = ("<|startoftranscript|>")
input_prompts = [
{
"encoder_prompt": {
"prompt": "",
"multi_modal_data": {"audio": audio_data},
},
"decoder_prompt": decoder_prompt,
}
]
sampling_params = SamplingParams(decode_with_timestamps=True, ...)
output = llm.generate(input_prompts, sampling_params)
Even though the file generation_config.json has "return_timestamps": true,
and decode_with_timestamps=True
is passed to SamplingParams, the output doesn't contain any timestamp, just pure transcript text.
I even tried to add WhisperTimeStampLogitsProcessor
from transformers
as an additional logits_processors in SamplingParams. It didn't work unfortunately.
According to HF transformers' page for whisper, return_timestamps can be passed to its pipeline (which doesn't exist in vLLM)
generate_kwargs = {
"max_new_tokens": 448,
"num_beams": 1,
"condition_on_prev_tokens": False,
"compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
"logprob_threshold": -1.0,
"no_speech_threshold": 0.6,
"return_timestamps": True,
}
result = pipe(sample, generate_kwargs=generate_kwargs)
I'm just wondering how to let Whisper output timestamps. Any pointers are welcome! Thanks!
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.