Skip to content

Quantization produces invalid decoder.token_embedding.weight in resulting file #2906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shyperson0 opened this issue Mar 19, 2025 · 4 comments

Comments

@shyperson0
Copy link

shyperson0 commented Mar 19, 2025

After quantizing with q3_k the resulting model is unusable. Quantize runs without errors, it appears to describe some wrong metadata on the model.(?)
whisper_model_load: tensor 'decoder.token_embedding.weight' has wrong size in model file: got 5705095, expected 2190735360
Tested on latest master e27fd6f
Relevant logs below.

> quantize models/ggml-tiny.en.bin models/ggml-tiny.en.q3_k 11

whisper_model_quantize: n_vocab       = 51864
whisper_model_quantize: n_audio_ctx   = 1500
whisper_model_quantize: n_audio_state = 384
whisper_model_quantize: n_audio_head  = 6
whisper_model_quantize: n_audio_layer = 4
whisper_model_quantize: n_text_ctx    = 448
whisper_model_quantize: n_text_state  = 384
whisper_model_quantize: n_text_head   = 6
whisper_model_quantize: n_text_layer  = 4
whisper_model_quantize: n_mels        = 80
whisper_model_quantize: ftype (src)   = 1
whisper_model_quantize: qntvr (src)   = 0
whisper_model_quantize: ftype (dst)   = 2011
whisper_model_quantize: qntvr (dst)   = 2
whisper_model_quantize: loading model from 'models/ggml-tiny.en.bin'
                                    decoder.positional_embedding - [  384,   448,     1], type =    f32 size =    0.656 MB
                                    encoder.positional_embedding - [  384,  1500,     1], type =    f32 size =    2.197 MB
                                  decoder.token_embedding.weight - [  384, 51864,     1], type =    f16 size =    75.97 MB ->     8.16 MB
                                  decoder.blocks.0.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    decoder.blocks.0.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.0.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.0.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   decoder.blocks.0.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.0.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 decoder.blocks.0.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.0.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              decoder.blocks.0.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.0.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.0.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              decoder.blocks.0.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.0.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.0.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  decoder.blocks.0.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                           decoder.blocks.0.cross_attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                             decoder.blocks.0.cross_attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                        decoder.blocks.0.cross_attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.0.cross_attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.0.cross_attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                        decoder.blocks.0.cross_attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.0.cross_attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.0.cross_attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                            decoder.blocks.0.cross_attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  decoder.blocks.1.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    decoder.blocks.1.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.1.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.1.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   decoder.blocks.1.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.1.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 decoder.blocks.1.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.1.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              decoder.blocks.1.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.1.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.1.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              decoder.blocks.1.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.1.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.1.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  decoder.blocks.1.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                           decoder.blocks.1.cross_attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                             decoder.blocks.1.cross_attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                        decoder.blocks.1.cross_attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.1.cross_attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.1.cross_attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                        decoder.blocks.1.cross_attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.1.cross_attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.1.cross_attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                            decoder.blocks.1.cross_attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  decoder.blocks.2.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    decoder.blocks.2.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.2.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.2.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   decoder.blocks.2.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.2.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 decoder.blocks.2.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.2.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              decoder.blocks.2.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.2.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.2.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              decoder.blocks.2.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.2.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.2.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  decoder.blocks.2.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                           decoder.blocks.2.cross_attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                             decoder.blocks.2.cross_attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                        decoder.blocks.2.cross_attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.2.cross_attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.2.cross_attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                        decoder.blocks.2.cross_attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.2.cross_attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.2.cross_attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                            decoder.blocks.2.cross_attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  decoder.blocks.3.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    decoder.blocks.3.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.3.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.3.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   decoder.blocks.3.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.3.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 decoder.blocks.3.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.3.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              decoder.blocks.3.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.3.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.3.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              decoder.blocks.3.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.3.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.3.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  decoder.blocks.3.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                           decoder.blocks.3.cross_attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                             decoder.blocks.3.cross_attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                        decoder.blocks.3.cross_attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.3.cross_attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.3.cross_attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                        decoder.blocks.3.cross_attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.3.cross_attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.3.cross_attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                            decoder.blocks.3.cross_attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                               decoder.ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                                 decoder.ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                            encoder.conv1.weight - [    3,    80,   384], type =    f16 size =    0.176 MB
                                              encoder.conv1.bias - [    1,   384,     1], type =    f32 size =    0.001 MB
                                            encoder.conv2.weight - [    3,   384,   384], type =    f16 size =    0.844 MB
                                              encoder.conv2.bias - [    1,   384,     1], type =    f32 size =    0.001 MB
                                  encoder.blocks.0.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    encoder.blocks.0.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.0.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.0.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   encoder.blocks.0.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.0.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 encoder.blocks.0.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.0.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              encoder.blocks.0.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.0.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.0.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              encoder.blocks.0.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.0.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.0.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  encoder.blocks.0.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  encoder.blocks.1.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    encoder.blocks.1.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.1.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.1.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   encoder.blocks.1.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.1.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 encoder.blocks.1.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.1.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              encoder.blocks.1.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.1.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.1.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              encoder.blocks.1.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.1.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.1.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  encoder.blocks.1.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  encoder.blocks.2.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    encoder.blocks.2.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.2.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.2.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   encoder.blocks.2.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.2.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 encoder.blocks.2.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.2.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              encoder.blocks.2.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.2.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.2.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              encoder.blocks.2.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.2.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.2.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  encoder.blocks.2.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  encoder.blocks.3.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    encoder.blocks.3.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.3.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.3.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   encoder.blocks.3.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.3.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 encoder.blocks.3.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.3.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              encoder.blocks.3.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.3.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.3.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              encoder.blocks.3.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.3.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.3.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  encoder.blocks.3.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                          encoder.ln_post.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                            encoder.ln_post.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
ggml_common_quantize_0: model size  =   144.04 MB
ggml_common_quantize_0: quant size  =    18.98 MB | ftype = 11 (q3_K)

main: quantize time =  1125.16 ms
main:    total time =  1125.16 ms

> whisper-cli -f samples/jfk.wav -m models/ggml-tiny.en.q3_k

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-tiny.en.q3_k'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_init_with_params_no_state: devices    = 1
whisper_init_with_params_no_state: backends   = 1
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 11
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 1 (tiny)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =    15.36 MB
whisper_model_load: tensor 'decoder.token_embedding.weight' has wrong size in model file: got 5705095, expected 2190735360
whisper_init_with_params_no_state: failed to load model
error: failed to initialize whisper context

edit: fixed formating :3

@fujimotos
Copy link
Contributor

fujimotos commented Mar 20, 2025

After quantizing with q3_k the resulting model is unusable.
Quantize runs without errors, it appears to describe some wrong metadata on the model.(?)

I am suspecting that the super-block size is a likely cause. Look at this log line

decoder.token_embedding.weight - [  384, 51864,     1], type =    f16 size =    75.97 MB ->     8.16 MB

See that the first dimension length is 384. Since q3_k uses the super-
block size of 256, it cannot handle models whose tensor sizes are not
dividable by 256.

I believe this is the motivation why llama.cpp introduced LLAMA_QKK_64
build flag (back in 2023):

ggml-org/llama.cpp#2001

I think the simplest work-around at the moment is to use other models than tiny.
For example, I could confirm base model worked:

$ quantize models/ggml-base.en.bin models/ggml-base.q3_k.bin 11
$ whisper-cli -m models/ggml-base.q3_k.bin -f jfk.wav
...
[00:00:00.000 --> 00:00:08.000]   And so, my fellow Americans, ask not what your country can do for you.
[00:00:08.000 --> 00:00:11.000]   Ask what you can do for your country.
...

@opsec-ai
Copy link

opsec-ai commented May 9, 2025

_K quantization is not working with CUDA in b6f3fa4

It generates a different error though. None of the _K types are supported. But _0 and _1 types work.

_I dunno... If K types aren't supposed to work with CUDA, maybe mention that in the docs.

cd models
quantize ggml-base.en.bin ggml-base.q3_k.bin 11
../build/bin/whisper-cli -m ggml-base.q3_k.bin ../samples/jfk.wav

whisper.cpp/ggml/src/ggml-cuda/getrows.cu:195: ggml_cuda_get_rows_switch_src0_type:
unsupported src0 type: q3_K
... backtrace

@fujimotos
Copy link
Contributor

I dunno... If K types aren't supposed to work with CUDA,

I think ggml-cuda does not support k-quants. Check out:

https://github.com/ggml-org/whisper.cpp/blob/master/ggml/src/ggml-cuda/getrows.cu#L160-L198

to see which quant type is supported.

The fact is that not all the GGML backends have reached the feature parity,
which is, in my opinion, totally natural for an actively developing project
like ggml.

@ggerganov
Copy link
Member

The correct solution is to keep the embeddings tensors (the ones from which we get rows) in pinned host memory, similar to how we do it in llama.cpp. This way it's more faster and all quantizations will be supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants