@@ -115,6 +115,7 @@ options:
115
115
- lpt N , -- logprob- thold N [- 1.00 ] log probability threshold for decoder fail
116
116
- su, -- speed- up [false ] speed up audio by x2 (reduced accuracy)
117
117
- tr, -- translate [false ] translate from source language to english
118
+ - tdrz, -- tinydiarize [false ] enable tinydiarize (requires a tdrz model)
118
119
- di, -- diarize [false ] stereo audio diarization
119
120
- nf, -- no- fallback [false ] do not use temperature fallback while decoding
120
121
- otxt, -- output- txt [false ] output result in a text file
@@ -493,7 +494,7 @@ main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 pr
493
494
[00:00:10.020 --> 00:00:11.000] country.
494
495
` ` `
495
496
496
- # # Word-level timestamp
497
+ # # Word-level timestamp (experimental)
497
498
498
499
The ` --max-len` argument can be used to obtain word-level timestamps. Simply use ` -ml 1` :
499
500
@@ -534,6 +535,32 @@ main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 pr
534
535
[00:00:10.510 --> 00:00:11.000] .
535
536
` ` `
536
537
538
+ # # Speaker segmentation via tinydiarize (experimental)
539
+
540
+ More information about this approach is available here: https://github.com/ggerganov/whisper.cpp/pull/1058
541
+
542
+ Sample usage:
543
+
544
+ ` ` ` py
545
+ # download a tinydiarize compatible model
546
+ ./models/download-ggml-model.sh small.en-tdrz
547
+
548
+ # run as usual, adding the "-tdrz" command-line argument
549
+ ./main -f ./samples/a13.wav -m ./models/ggml-small.en-tdrz.bin -tdrz
550
+ ...
551
+ main: processing ' ./samples/a13.wav' (480000 samples, 30.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, tdrz = 1, timestamps = 1 ...
552
+ ...
553
+ [00:00:00.000 --> 00:00:03.800] Okay Houston, we' ve had a problem here. [SPEAKER_TURN]
554
+ [00:00:03.800 --> 00:00:06.200] This is Houston. Say again please. [SPEAKER_TURN]
555
+ [00:00:06.200 --> 00:00:08.260] Uh Houston we' ve had a problem.
556
+ [00:00:08.260 --> 00:00:11.320] We' ve had a main beam up on a volt. [SPEAKER_TURN]
557
+ [00:00:11.320 --> 00:00:13.820] Roger main beam interval. [SPEAKER_TURN]
558
+ [00:00:13.820 --> 00:00:15.100] Uh uh [SPEAKER_TURN]
559
+ [00:00:15.100 --> 00:00:18.020] So okay stand, by thirteen we' re looking at it. [SPEAKER_TURN]
560
+ [00:00:18.020 --> 00:00:25.740] Okay uh right now uh Houston the uh voltage is uh is looking good um.
561
+ [00:00:27.620 --> 00:00:29.940] And we had a a pretty large bank or so.
562
+ ` ` `
563
+
537
564
# # Karaoke-style movie generation (experimental)
538
565
539
566
The [main](examples/main) example provides support for output of karaoke-style movies, where the
0 commit comments