Quantization produces invalid decoder.token_embedding.weight in resulting file #2906

shyperson0 · 2025-03-19T15:29:48Z

After quantizing with q3_k the resulting model is unusable. Quantize runs without errors, it appears to describe some wrong metadata on the model.(?)
whisper_model_load: tensor 'decoder.token_embedding.weight' has wrong size in model file: got 5705095, expected 2190735360
Tested on latest master e27fd6f
Relevant logs below.

> quantize models/ggml-tiny.en.bin models/ggml-tiny.en.q3_k 11

whisper_model_quantize: n_vocab       = 51864
whisper_model_quantize: n_audio_ctx   = 1500
whisper_model_quantize: n_audio_state = 384
whisper_model_quantize: n_audio_head  = 6
whisper_model_quantize: n_audio_layer = 4
whisper_model_quantize: n_text_ctx    = 448
whisper_model_quantize: n_text_state  = 384
whisper_model_quantize: n_text_head   = 6
whisper_model_quantize: n_text_layer  = 4
whisper_model_quantize: n_mels        = 80
whisper_model_quantize: ftype (src)   = 1
whisper_model_quantize: qntvr (src)   = 0
whisper_model_quantize: ftype (dst)   = 2011
whisper_model_quantize: qntvr (dst)   = 2
whisper_model_quantize: loading model from 'models/ggml-tiny.en.bin'
                                    decoder.positional_embedding - [  384,   448,     1], type =    f32 size =    0.656 MB
                                    encoder.positional_embedding - [  384,  1500,     1], type =    f32 size =    2.197 MB
                                  decoder.token_embedding.weight - [  384, 51864,     1], type =    f16 size =    75.97 MB ->     8.16 MB
                                  decoder.blocks.0.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    decoder.blocks.0.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.0.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.0.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   decoder.blocks.0.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.0.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 decoder.blocks.0.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.0.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              decoder.blocks.0.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.0.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.0.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              decoder.blocks.0.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.0.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.0.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  decoder.blocks.0.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                           decoder.blocks.0.cross_attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                             decoder.blocks.0.cross_attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                        decoder.blocks.0.cross_attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.0.cross_attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.0.cross_attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                        decoder.blocks.0.cross_attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.0.cross_attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.0.cross_attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                            decoder.blocks.0.cross_attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  decoder.blocks.1.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    decoder.blocks.1.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.1.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.1.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   decoder.blocks.1.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.1.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 decoder.blocks.1.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.1.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              decoder.blocks.1.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.1.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.1.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              decoder.blocks.1.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.1.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.1.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  decoder.blocks.1.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                           decoder.blocks.1.cross_attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                             decoder.blocks.1.cross_attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                        decoder.blocks.1.cross_attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.1.cross_attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.1.cross_attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                        decoder.blocks.1.cross_attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.1.cross_attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.1.cross_attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                            decoder.blocks.1.cross_attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  decoder.blocks.2.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    decoder.blocks.2.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.2.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.2.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   decoder.blocks.2.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.2.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 decoder.blocks.2.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.2.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              decoder.blocks.2.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.2.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.2.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              decoder.blocks.2.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.2.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.2.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  decoder.blocks.2.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                           decoder.blocks.2.cross_attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                             decoder.blocks.2.cross_attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                        decoder.blocks.2.cross_attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.2.cross_attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.2.cross_attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                        decoder.blocks.2.cross_attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.2.cross_attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.2.cross_attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                            decoder.blocks.2.cross_attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  decoder.blocks.3.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    decoder.blocks.3.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.3.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.3.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   decoder.blocks.3.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     decoder.blocks.3.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 decoder.blocks.3.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   decoder.blocks.3.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              decoder.blocks.3.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.3.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.3.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              decoder.blocks.3.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                decoder.blocks.3.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                decoder.blocks.3.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  decoder.blocks.3.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                           decoder.blocks.3.cross_attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                             decoder.blocks.3.cross_attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                        decoder.blocks.3.cross_attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.3.cross_attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.3.cross_attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                        decoder.blocks.3.cross_attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                          decoder.blocks.3.cross_attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                          decoder.blocks.3.cross_attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                            decoder.blocks.3.cross_attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                               decoder.ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                                 decoder.ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                            encoder.conv1.weight - [    3,    80,   384], type =    f16 size =    0.176 MB
                                              encoder.conv1.bias - [    1,   384,     1], type =    f32 size =    0.001 MB
                                            encoder.conv2.weight - [    3,   384,   384], type =    f16 size =    0.844 MB
                                              encoder.conv2.bias - [    1,   384,     1], type =    f32 size =    0.001 MB
                                  encoder.blocks.0.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    encoder.blocks.0.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.0.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.0.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   encoder.blocks.0.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.0.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 encoder.blocks.0.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.0.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              encoder.blocks.0.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.0.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.0.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              encoder.blocks.0.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.0.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.0.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  encoder.blocks.0.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  encoder.blocks.1.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    encoder.blocks.1.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.1.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.1.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   encoder.blocks.1.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.1.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 encoder.blocks.1.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.1.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              encoder.blocks.1.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.1.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.1.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              encoder.blocks.1.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.1.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.1.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  encoder.blocks.1.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  encoder.blocks.2.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    encoder.blocks.2.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.2.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.2.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   encoder.blocks.2.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.2.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 encoder.blocks.2.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.2.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              encoder.blocks.2.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.2.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.2.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              encoder.blocks.2.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.2.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.2.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  encoder.blocks.2.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                  encoder.blocks.3.mlp_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                    encoder.blocks.3.mlp_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.3.mlp.0.weight - [  384,  1536,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.3.mlp.0.bias - [ 1536,     1,     1], type =    f32 size =    0.006 MB
                                   encoder.blocks.3.mlp.2.weight - [ 1536,   384,     1], type =    f16 size =     2.25 MB ->     0.24 MB
                                     encoder.blocks.3.mlp.2.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                 encoder.blocks.3.attn_ln.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                   encoder.blocks.3.attn_ln.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                              encoder.blocks.3.attn.query.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.3.attn.query.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.3.attn.key.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                              encoder.blocks.3.attn.value.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                encoder.blocks.3.attn.value.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                encoder.blocks.3.attn.out.weight - [  384,   384,     1], type =    f16 size =     0.56 MB ->     0.06 MB
                                  encoder.blocks.3.attn.out.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
                                          encoder.ln_post.weight - [  384,     1,     1], type =    f32 size =    0.001 MB
                                            encoder.ln_post.bias - [  384,     1,     1], type =    f32 size =    0.001 MB
ggml_common_quantize_0: model size  =   144.04 MB
ggml_common_quantize_0: quant size  =    18.98 MB | ftype = 11 (q3_K)

main: quantize time =  1125.16 ms
main:    total time =  1125.16 ms

> whisper-cli -f samples/jfk.wav -m models/ggml-tiny.en.q3_k

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-tiny.en.q3_k'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_init_with_params_no_state: devices    = 1
whisper_init_with_params_no_state: backends   = 1
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 11
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 1 (tiny)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =    15.36 MB
whisper_model_load: tensor 'decoder.token_embedding.weight' has wrong size in model file: got 5705095, expected 2190735360
whisper_init_with_params_no_state: failed to load model
error: failed to initialize whisper context

edit: fixed formating :3

The text was updated successfully, but these errors were encountered:

fujimotos · 2025-03-20T09:40:04Z

After quantizing with q3_k the resulting model is unusable.
Quantize runs without errors, it appears to describe some wrong metadata on the model.(?)

I am suspecting that the super-block size is a likely cause. Look at this log line

decoder.token_embedding.weight - [  384, 51864,     1], type =    f16 size =    75.97 MB ->     8.16 MB

See that the first dimension length is 384. Since q3_k uses the super-
block size of 256, it cannot handle models whose tensor sizes are not
dividable by 256.

I believe this is the motivation why llama.cpp introduced LLAMA_QKK_64
build flag (back in 2023):

ggml-org/llama.cpp#2001

I think the simplest work-around at the moment is to use other models than tiny.
For example, I could confirm base model worked:

$ quantize models/ggml-base.en.bin models/ggml-base.q3_k.bin 11
$ whisper-cli -m models/ggml-base.q3_k.bin -f jfk.wav
...
[00:00:00.000 --> 00:00:08.000]   And so, my fellow Americans, ask not what your country can do for you.
[00:00:08.000 --> 00:00:11.000]   Ask what you can do for your country.
...

opsec-ai · 2025-05-09T09:16:02Z

_K quantization is not working with CUDA in b6f3fa4

It generates a different error though. None of the _K types are supported. But _0 and _1 types work.

_I dunno... If K types aren't supposed to work with CUDA, maybe mention that in the docs.

cd models
quantize ggml-base.en.bin ggml-base.q3_k.bin 11
../build/bin/whisper-cli -m ggml-base.q3_k.bin ../samples/jfk.wav

whisper.cpp/ggml/src/ggml-cuda/getrows.cu:195: ggml_cuda_get_rows_switch_src0_type:
unsupported src0 type: q3_K
... backtrace

fujimotos · 2025-05-09T12:24:26Z

I dunno... If K types aren't supposed to work with CUDA,

I think ggml-cuda does not support k-quants. Check out:

https://github.com/ggml-org/whisper.cpp/blob/master/ggml/src/ggml-cuda/getrows.cu#L160-L198

to see which quant type is supported.

The fact is that not all the GGML backends have reached the feature parity,
which is, in my opinion, totally natural for an actively developing project
like ggml.

ggerganov · 2025-05-09T12:31:21Z

The correct solution is to keep the embeddings tensors (the ones from which we get rows) in pinned host memory, similar to how we do it in llama.cpp. This way it's more faster and all quantizations will be supported.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantization produces invalid decoder.token_embedding.weight in resulting file #2906

Quantization produces invalid decoder.token_embedding.weight in resulting file #2906

shyperson0 commented Mar 19, 2025 •

edited

Loading

fujimotos commented Mar 20, 2025 •

edited

Loading

Uh oh!

opsec-ai commented May 9, 2025 •

edited

Loading

Uh oh!

fujimotos commented May 9, 2025

Uh oh!

ggerganov commented May 9, 2025

Uh oh!

Quantization produces invalid decoder.token_embedding.weight in resulting file #2906

Quantization produces invalid decoder.token_embedding.weight in resulting file #2906

Comments

shyperson0 commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fujimotos commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

opsec-ai commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fujimotos commented May 9, 2025

Uh oh!

ggerganov commented May 9, 2025

Uh oh!

shyperson0 commented Mar 19, 2025 •

edited

Loading

fujimotos commented Mar 20, 2025 •

edited

Loading

opsec-ai commented May 9, 2025 •

edited

Loading