Whisper one-shot enc+dec path treats mel frames (3000) as “encoder length”, causing length assertion at 1500 and internal broadcast shape errors

Environment

Triton Server: nvcr.io/nvidia/tritonserver:25.08-trtllm-python-py3

TRT-LLM version inside server: 1.1.0rc1

Engines built with TRT-LLM: 1.1.0rc1 (also tried 1.1.0rc5 in devel)

Model: Whisper large-v3-turbo (TRT-LLM encoder+decoder)

Expected

For Whisper, input mel length is 3000 (30s @ hop=160), but encoder internal length is 1500 due to conv stride=2.

The one-shot enc+dec path should accept encoder_input_features with shape [B,3000,128] and input_lengths=1500, and route to decoder without errors (like the official Session-based flow).


Actual

Case A (encoder max_seq_len=1500, default):

[TensorRT-LLM][ERROR] Assertion failed:
Encoder length (3000) exceeds maximum encoder input length (1500).


Case B (encoder forced to max_seq_len=3000):

Invalid input shape ... ELEMENTWISE_SUM:
Broadcast incompatible: 6000 != 3000 ...


(internal shape mismatch likely between position embedding and conv outputs)

Repro Steps
trtllm-build \
  --checkpoint_dir /ws/tllm_ckpt_whisper/encoder \
  --output_dir     /ws/engines/whisper_large_v3_turbo/encoder \
  --max_batch_size 16 \
  --max_input_len  1500 \
  --bert_attention_plugin float16
  
trtllm-build \
  --checkpoint_dir /ws/tllm_ckpt_whisper/decoder \
  --output_dir     /ws/engines/whisper_large_v3_turbo/decoder \
  --max_batch_size 16 \
  --max_encoder_input_len 1500 \
  --max_input_len 224 \
  --max_seq_len 448 \
  --max_beam_width 5 \
  --gather_generation_logits \
  --logits_dtype float16 \
  --gpt_attention_plugin float16 \
  --kv_cache_type paged \
  --gemm_plugin float16  


Build encoder with --max_input_len 1500 (or leave deduced), decoder with --max_encoder_input_len 1500.

In Triton tensorrtllm model, send:

encoder_input_features: [B,3000,128] FP16

input_lengths: 1500

Observe assertion error above.

Rebuild encoder with --max_seq_len 3000 to bypass assertion, observe internal broadcast error.

Notes

Session-based flow (run_whisper.py) works: encoder accepts mel 3000 and returns encoder_output_lengths=1500 to decoder – no issues.

The one-shot path seems to validate “encoder length” against mel length instead of downsampled length.

Questions

Is this a known limitation for the one-shot enc+dec path with Whisper?

Can the length validation and internal shape setup be updated to account for encoder stride (×2 downsampling) so that the one-shot path matches the Session-based flow?

Any recommended flags or configuration to make one-shot work without external mel downsampling?









Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Whisper one-shot enc+dec path treats mel frames (3000) as “encoder length”, causing length assertion at 1500 and internal broadcast shape errors #804

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Whisper one-shot enc+dec path treats mel frames (3000) as “encoder length”, causing length assertion at 1500 and internal broadcast shape errors #804

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions