Skip to content

bug: Whisper VAD - Token Timestamp Issue #3174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sysulq opened this issue May 20, 2025 · 1 comment
Open

bug: Whisper VAD - Token Timestamp Issue #3174

sysulq opened this issue May 20, 2025 · 1 comment
Assignees

Comments

@sysulq
Copy link

sysulq commented May 20, 2025

I'm using the VAD feature with Whisper to recognize audio, and I'm using the following command. It seems strange that in the generated JSON file, the start time of the tokens begins at 0, which doesn't correspond to the timestamps.

./build/bin/whisper-cli -m /Users/pilot/.aicutpro/whisper_cpp/models/ggml-tiny.bin -f test.wav -of test -oj -pp -l en -t 4 -bo 5 -bs 5 --vad --vad-model vad_models/ggml-silero-v5.1.2.bin -fa -sow -ojf
	"transcription": [
		{
			"timestamps": {
				"from": "00:00:04,480",
				"to": "00:00:07,860"
			},
			"offsets": {
				"from": 4480,
				"to": 7860
			},
			"text": " I want to tell you what I see coming.",
			"tokens": [
				{
					"text": "[_BEG_]",
					"timestamps": {
						"from": "00:00:00,000",
						"to": "00:00:00,000"
					},
					"offsets": {
						"from": 0,
						"to": 0
					},
					"id": 50363,
					"p": 0.995989,
					"t_dtw": -1
				},
				{
					"text": " I",
					"timestamps": {
						"from": "00:00:00,070",
						"to": "00:00:00,070"
					},
					"offsets": {
						"from": 70,
						"to": 70
					},
					"id": 314,
					"p": 0.928097,
					"t_dtw": -1
				},
				{
					"text": " want",
					"timestamps": {
						"from": "00:00:00,130",
						"to": "00:00:00,370"
					},
					"offsets": {
						"from": 130,
						"to": 370
					},
					"id": 765,
					"p": 0.985233,
					"t_dtw": -1
				},
				{
					"text": " to",
					"timestamps": {
						"from": "00:00:00,370",
						"to": "00:00:00,520"
					},
					"offsets": {
						"from": 370,
						"to": 520
					},
					"id": 284,
					"p": 0.997866,
					"t_dtw": -1
				},
				{
					"text": " tell",
					"timestamps": {
						"from": "00:00:00,520",
						"to": "00:00:00,820"
					},
					"offsets": {
						"from": 520,
						"to": 820
					},
					"id": 1560,
					"p": 0.999005,
					"t_dtw": -1
				},
				{
					"text": " you",
					"timestamps": {
						"from": "00:00:00,820",
						"to": "00:00:01,040"
					},
					"offsets": {
						"from": 820,
						"to": 1040
					},
					"id": 345,
					"p": 0.996679,
					"t_dtw": -1
				},
				{
					"text": " what",
					"timestamps": {
						"from": "00:00:01,040",
						"to": "00:00:01,340"
					},
					"offsets": {
						"from": 1040,
						"to": 1340
					},
					"id": 644,
					"p": 0.993718,
					"t_dtw": -1
				},
				{
					"text": " I",
					"timestamps": {
						"from": "00:00:01,340",
						"to": "00:00:01,410"
					},
					"offsets": {
						"from": 1340,
						"to": 1410
					},
					"id": 314,
					"p": 0.993655,
					"t_dtw": -1
				},
				{
					"text": " see",
					"timestamps": {
						"from": "00:00:01,410",
						"to": "00:00:01,630"
					},
					"offsets": {
						"from": 1410,
						"to": 1630
					},
					"id": 766,
					"p": 0.997687,
					"t_dtw": -1
				},
				{
					"text": " coming",
					"timestamps": {
						"from": "00:00:01,630",
						"to": "00:00:02,080"
					},
					"offsets": {
						"from": 1630,
						"to": 2080
					},
					"id": 2406,
					"p": 0.995338,
					"t_dtw": -1
				},
				{
					"text": ".",
					"timestamps": {
						"from": "00:00:02,080",
						"to": "00:00:02,360"
					},
					"offsets": {
						"from": 2080,
						"to": 2360
					},
					"id": 13,
					"p": 0.919863,
					"t_dtw": -1
				},
				{
					"text": "[_TT_118]",
					"timestamps": {
						"from": "00:00:02,360",
						"to": "00:00:02,360"
					},
					"offsets": {
						"from": 2360,
						"to": 2360
					},
					"id": 50481,
					"p": 0.291442,
					"t_dtw": -1
				}
			]
		},
@danbev
Copy link
Collaborator

danbev commented May 20, 2025

It is currently reporting VAD processed tokens directly when it should be resolving/mapping these to original input audio timestamps. I'll open a pull request to handle this situation. Thanks for reporting this and bringing it to our attention!

We have opened #3173 which is slightly related to this and just so that you are aware of this issue.

@danbev danbev self-assigned this May 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants