Skip to content

Conversation

@neonwatty
Copy link

@neonwatty neonwatty commented Dec 15, 2025

This PR aims to fix #1357

The Problem

#1357 reports a problem I've had in usage as well - that when using chunk_length_s: 30 with return_timestamps: "word", timestamps can exceed the actual audio duration. For example, a 60s audio file produces timestamps up to 69.98s.

Root cause:* Digging in here's what I found: the model outputs timestamps up to ~29.98s (the maximum representable value given time_precision = 30/1500 = 0.02). For final chunks shorter than 30s, these raw timestamps are added to the accumulated time_offset, causing overflow.

Proposed Solution

I think this simple solution works: clamp word-level timestamps to the actual chunk_len from stride metadata before applying time_offset:

if (current_chunk_len !== null) {
    raw_start = Math.min(raw_start, current_chunk_len);
    raw_end = Math.min(raw_end, current_chunk_len);
}

Why not match Python's approach?

There's a similar reported issue upstream with Python transformers; this fix crops cross-attention matrices before DTW alignment (PR #25607).

To my knowledge, here in JS we receive pre-computed token_timestamps from ONNX models, so we cannot modify the DTW computation.

Hence clamping at the tokenizer level seems to be the appropriate fix here.

Testing

…uggingface#1357)

Clamp word-level timestamps to the actual chunk_len to prevent timestamps
from exceeding audio duration when the model outputs timestamps near the
30s boundary for shorter final chunks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

whisper-large-v3-turbo_timestamped has broken timestamps

1 participant