Description
It's possible to fine-tune models to be able to use audio_ctx more freely, without affecting their knowledge too much.
Example with default settings (notice the ~3x speed difference):
$ ./main -m ggml-model-whisper-tiny.en-q8_0.bin -f samples/jfk.wav -ac 500
[...]
[00:00:00.000 --> 00:00:09.760] and so my fellow Americans ask not what your country can do for you ask what you can do for
[00:00:09.760 --> 00:00:10.760] You are a country.
whisper_print_timings: load time = 47.05 ms
whisper_print_timings: fallbacks = 0 p / 1 h
whisper_print_timings: mel time = 17.20 ms
whisper_print_timings: sample time = 389.59 ms / 762 runs ( 0.51 ms per run)
whisper_print_timings: encode time = 191.74 ms / 2 runs ( 95.87 ms per run)
whisper_print_timings: decode time = 5.03 ms / 2 runs ( 2.51 ms per run)
whisper_print_timings: batchd time = 1040.05 ms / 752 runs ( 1.38 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 1699.19 ms
$ ./main -m tiny_en_acft_q8_0.bin -f samples/jfk.wav -ac 500
[...]
[00:00:00.000 --> 00:00:07.880] And so, my fellow Americans ask not what your country can do for you
[00:00:07.880 --> 00:00:09.880] ask what you can do for your...
[00:00:09.880 --> 00:00:10.880] country.
whisper_print_timings: load time = 60.26 ms
whisper_print_timings: fallbacks = 1 p / 0 h
whisper_print_timings: mel time = 15.26 ms
whisper_print_timings: sample time = 62.74 ms / 186 runs ( 0.34 ms per run)
whisper_print_timings: encode time = 208.25 ms / 2 runs ( 104.13 ms per run)
whisper_print_timings: decode time = 12.02 ms / 5 runs ( 2.40 ms per run)
whisper_print_timings: batchd time = 189.45 ms / 169 runs ( 1.12 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 556.35 ms
Example with greedy search and no timestamps (notice it doesn't repeat itself):
$ ./main -m ggml-model-whisper-tiny.en-q8_0.bin -f samples/jfk.wav -nt -ng -nf -bo 1 -bs 1 -ac 500
[...]
And so my fellow Americans ask not what your country can do for you. Ask what you can do for your country for you. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do
whisper_print_timings: load time = 41.61 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 13.48 ms
whisper_print_timings: sample time = 97.74 ms / 1 runs ( 97.74 ms per run)
whisper_print_timings: encode time = 114.27 ms / 1 runs ( 114.27 ms per run)
whisper_print_timings: decode time = 506.76 ms / 219 runs ( 2.31 ms per run)
whisper_print_timings: batchd time = 3.95 ms / 2 runs ( 1.98 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 783.24 ms
$ ./main -m ft3-quant/tiny_en_acft_q8_0.bin -f samples/jfk.wav -nt -ng -nf -bo 1 -bs 1 -ac 500
[...]
And so my fellow Americans ask not what your country can do for you, ask what you can do for your
whisper_print_timings: load time = 46.31 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 16.60 ms
whisper_print_timings: sample time = 9.33 ms / 1 runs ( 9.33 ms per run)
whisper_print_timings: encode time = 95.40 ms / 1 runs ( 95.40 ms per run)
whisper_print_timings: decode time = 47.55 ms / 22 runs ( 2.16 ms per run)
whisper_print_timings: batchd time = 3.45 ms / 2 runs ( 1.73 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 222.61 ms
Models and method are available here: https://github.com/futo-org/whisper-acft
Feedback and comments are welcome! The finetuning method probably isn't perfect, it may need fewer epochs, more data or avoiding randomly subtracting from context too much, but it still produces good results.
Related to #137 but I thought to open a new issue for this to discuss this specific method.
(Edit: The original results were from an older version of whisper.cpp which showed a 10x speed difference with default beam search, I have updated the results to a56f435 and the speed difference is no longer as significant, but is still there)