Fix word-level timestamp overflow in Whisper chunked transcription #1486
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR aims to fix #1357
The Problem
#1357 reports a problem I've had in usage as well - that when using
chunk_length_s: 30withreturn_timestamps: "word", timestamps can exceed the actual audio duration. For example, a 60s audio file produces timestamps up to 69.98s.Root cause:* Digging in here's what I found: the model outputs timestamps up to ~29.98s (the maximum representable value given
time_precision = 30/1500 = 0.02). For final chunks shorter than 30s, these raw timestamps are added to the accumulatedtime_offset, causing overflow.Proposed Solution
I think this simple solution works: clamp word-level timestamps to the actual
chunk_lenfrom stride metadata before applyingtime_offset:Why not match Python's approach?
There's a similar reported issue upstream with Python transformers; this fix crops cross-attention matrices before DTW alignment (PR #25607).
To my knowledge, here in JS we receive pre-computed
token_timestampsfrom ONNX models, so we cannot modify the DTW computation.Hence clamping at the tokenizer level seems to be the appropriate fix here.
Testing
return_timestamps: true) unaffected