Finetuning models for audio_ctx support

It's possible to fine-tune models to be able to use audio_ctx more freely, without affecting their knowledge too much.

Example with default settings (notice the ~3x speed difference):
```
$ ./main -m ggml-model-whisper-tiny.en-q8_0.bin -f samples/jfk.wav -ac 500
[...]
[00:00:00.000 --> 00:00:09.760]   and so my fellow Americans ask not what your country can do for you ask what you can do for
[00:00:09.760 --> 00:00:10.760]   You are a country.


whisper_print_timings:     load time =    47.05 ms
whisper_print_timings:     fallbacks =   0 p /   1 h
whisper_print_timings:      mel time =    17.20 ms
whisper_print_timings:   sample time =   389.59 ms /   762 runs (    0.51 ms per run)
whisper_print_timings:   encode time =   191.74 ms /     2 runs (   95.87 ms per run)
whisper_print_timings:   decode time =     5.03 ms /     2 runs (    2.51 ms per run)
whisper_print_timings:   batchd time =  1040.05 ms /   752 runs (    1.38 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1699.19 ms
```

```
$ ./main -m tiny_en_acft_q8_0.bin -f samples/jfk.wav -ac 500
[...]
[00:00:00.000 --> 00:00:07.880]   And so, my fellow Americans ask not what your country can do for you
[00:00:07.880 --> 00:00:09.880]   ask what you can do for your...
[00:00:09.880 --> 00:00:10.880]   country.


whisper_print_timings:     load time =    60.26 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =    15.26 ms
whisper_print_timings:   sample time =    62.74 ms /   186 runs (    0.34 ms per run)
whisper_print_timings:   encode time =   208.25 ms /     2 runs (  104.13 ms per run)
whisper_print_timings:   decode time =    12.02 ms /     5 runs (    2.40 ms per run)
whisper_print_timings:   batchd time =   189.45 ms /   169 runs (    1.12 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   556.35 ms
```



Example with greedy search and no timestamps (notice it doesn't repeat itself):
```
$ ./main -m ggml-model-whisper-tiny.en-q8_0.bin -f samples/jfk.wav -nt -ng -nf -bo 1 -bs 1 -ac 500
[...]
 And so my fellow Americans ask not what your country can do for you. Ask what you can do for your country for you. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do

whisper_print_timings:     load time =    41.61 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    13.48 ms
whisper_print_timings:   sample time =    97.74 ms /     1 runs (   97.74 ms per run)
whisper_print_timings:   encode time =   114.27 ms /     1 runs (  114.27 ms per run)
whisper_print_timings:   decode time =   506.76 ms /   219 runs (    2.31 ms per run)
whisper_print_timings:   batchd time =     3.95 ms /     2 runs (    1.98 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   783.24 ms
```

```
$ ./main -m ft3-quant/tiny_en_acft_q8_0.bin -f samples/jfk.wav -nt -ng -nf -bo 1 -bs 1 -ac 500
[...]
 And so my fellow Americans ask not what your country can do for you, ask what you can do for your

whisper_print_timings:     load time =    46.31 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    16.60 ms
whisper_print_timings:   sample time =     9.33 ms /     1 runs (    9.33 ms per run)
whisper_print_timings:   encode time =    95.40 ms /     1 runs (   95.40 ms per run)
whisper_print_timings:   decode time =    47.55 ms /    22 runs (    2.16 ms per run)
whisper_print_timings:   batchd time =     3.45 ms /     2 runs (    1.73 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   222.61 ms
```

Models and method are available here: https://github.com/futo-org/whisper-acft

Feedback and comments are welcome! The finetuning method probably isn't perfect, it may need fewer epochs, more data or avoiding randomly subtracting from context too much, but it still produces good results.

Related to #137 but I thought to open a new issue for this to discuss this specific method.

(Edit: The original results were from an older version of whisper.cpp which showed a 10x speed difference with default beam search, I have updated the results to a56f435fd475afd7edf02bfbf9f8c77f527198c2 and the speed difference is no longer as significant, but is still there)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Finetuning models for audio_ctx support #1951

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Finetuning models for audio_ctx support #1951

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions