Skip to content

Finetuning models for audio_ctx support #1951

Open
@abb128

Description

@abb128

It's possible to fine-tune models to be able to use audio_ctx more freely, without affecting their knowledge too much.

Example with default settings (notice the ~3x speed difference):

$ ./main -m ggml-model-whisper-tiny.en-q8_0.bin -f samples/jfk.wav -ac 500
[...]
[00:00:00.000 --> 00:00:09.760]   and so my fellow Americans ask not what your country can do for you ask what you can do for
[00:00:09.760 --> 00:00:10.760]   You are a country.


whisper_print_timings:     load time =    47.05 ms
whisper_print_timings:     fallbacks =   0 p /   1 h
whisper_print_timings:      mel time =    17.20 ms
whisper_print_timings:   sample time =   389.59 ms /   762 runs (    0.51 ms per run)
whisper_print_timings:   encode time =   191.74 ms /     2 runs (   95.87 ms per run)
whisper_print_timings:   decode time =     5.03 ms /     2 runs (    2.51 ms per run)
whisper_print_timings:   batchd time =  1040.05 ms /   752 runs (    1.38 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1699.19 ms
$ ./main -m tiny_en_acft_q8_0.bin -f samples/jfk.wav -ac 500
[...]
[00:00:00.000 --> 00:00:07.880]   And so, my fellow Americans ask not what your country can do for you
[00:00:07.880 --> 00:00:09.880]   ask what you can do for your...
[00:00:09.880 --> 00:00:10.880]   country.


whisper_print_timings:     load time =    60.26 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =    15.26 ms
whisper_print_timings:   sample time =    62.74 ms /   186 runs (    0.34 ms per run)
whisper_print_timings:   encode time =   208.25 ms /     2 runs (  104.13 ms per run)
whisper_print_timings:   decode time =    12.02 ms /     5 runs (    2.40 ms per run)
whisper_print_timings:   batchd time =   189.45 ms /   169 runs (    1.12 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   556.35 ms

Example with greedy search and no timestamps (notice it doesn't repeat itself):

$ ./main -m ggml-model-whisper-tiny.en-q8_0.bin -f samples/jfk.wav -nt -ng -nf -bo 1 -bs 1 -ac 500
[...]
 And so my fellow Americans ask not what your country can do for you. Ask what you can do for your country for you. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do

whisper_print_timings:     load time =    41.61 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    13.48 ms
whisper_print_timings:   sample time =    97.74 ms /     1 runs (   97.74 ms per run)
whisper_print_timings:   encode time =   114.27 ms /     1 runs (  114.27 ms per run)
whisper_print_timings:   decode time =   506.76 ms /   219 runs (    2.31 ms per run)
whisper_print_timings:   batchd time =     3.95 ms /     2 runs (    1.98 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   783.24 ms
$ ./main -m ft3-quant/tiny_en_acft_q8_0.bin -f samples/jfk.wav -nt -ng -nf -bo 1 -bs 1 -ac 500
[...]
 And so my fellow Americans ask not what your country can do for you, ask what you can do for your

whisper_print_timings:     load time =    46.31 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    16.60 ms
whisper_print_timings:   sample time =     9.33 ms /     1 runs (    9.33 ms per run)
whisper_print_timings:   encode time =    95.40 ms /     1 runs (   95.40 ms per run)
whisper_print_timings:   decode time =    47.55 ms /    22 runs (    2.16 ms per run)
whisper_print_timings:   batchd time =     3.45 ms /     2 runs (    1.73 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   222.61 ms

Models and method are available here: https://github.com/futo-org/whisper-acft

Feedback and comments are welcome! The finetuning method probably isn't perfect, it may need fewer epochs, more data or avoiding randomly subtracting from context too much, but it still produces good results.

Related to #137 but I thought to open a new issue for this to discuss this specific method.

(Edit: The original results were from an older version of whisper.cpp which showed a 10x speed difference with default beam search, I have updated the results to a56f435 and the speed difference is no longer as significant, but is still there)

Metadata

Metadata

Assignees

No one assigned

    Labels

    ideasInteresting ideas for experimentationresearch🔬

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions