Rebase #4

abhilash1910 · 2024-03-05T11:10:08Z

No description provided.

* vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1

The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results.

Commit taken from remyoudompheng's PR #12260 Co-authored-by: Rémy Oudompheng <[email protected]>

Signed-off-by: Xiaodong Ye <[email protected]>

* cuda : fix rope non-cont ggml-ci * cont : fix multi-rope + add test ggml-ci * sycl : try fix ggml-ci * cont : fix sycl + clean-up cuda ggml-ci

* model : add hunyuan moe * tokenizer ok * fix tensor name * cgraph init * chat template * wip * almost working * skip embed, fix bos * cleanup * yarn scaling * cleanup * correct rope type * failed token fix * ntk alpha freq_base * tokenization working * cleanup and pr changes * vocab_size sanity check * ntk alpha generic * Update convert_hf_to_gguf.py * Apply suggestions from code review * fix regression * fix style --------- Co-authored-by: kooshi <[email protected]>

* Add server_prefix * Correct server path env * Rename cli flag to --api-prefix * Change all to api_prefix

Splits producing more than one ubatch per batch for recurrent models were broken with #14512. This fixes it by moving the completeness check after the ubatch split loop.

* Init - first pass. * Model -> ModelBase. * fix errors in conversion. * Update the graph. * up. * up. * wip * cgraph ok * rm redundant code --------- Co-authored-by: Vaibhavs10 <[email protected]>

Signed-off-by: stevenkuang <[email protected]>

* vulkan: allow FA split_k with smaller KV values * vulkan: spread split_k_reduce work across more threads k_num can get rather large. Use the whole workgroup to reduce the M/L values. Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).

* v1 * push more fixes * another fix * fix * more fixes * minor fix * more cleaning on python code * python fixes * changed precision for multipliers float 32->64 * fixes * another fix * fix * pre-norm -> norm * fix * Revert "fix" This reverts commit 243e4d1. * fix * small fix ffn_norm * try * mix instead of max * fix vocab size * conflict solve * fixed multipliers * falcon-h1 specefic vocab resolved * read arch from gguf.MODEL_ARCH * mamba_d_ssm added to d_inner find_hparam * remove unused functions from gguf_writer.py * override modify_tensors instead of get_tensors * fix conversion and d_inner * added some cb functions for debugging puposes * inp_out_ids moved outside of layers loop * mup_vec create as float64 * fix rope_theta * injected mup * clean ups * rm extra space * rm unused MAMBA_CHUNK_SIZE * rm unused key * add bos False * changed ROPE_TYPE * cleaning debugging stuff * cleaning debug quant * fix comment * some cleanups * some cleanups * Update src/llama-model-loader.cpp * more cleanups * moe cleanuips * d_ssm -> d_inner; * cleaning unused hparams * cleanup * more cleanups * more cleanups on python conversion; * minor cleanups * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * remove todo * added falcon-h1 * tensor not required * clean * remove unneeded attributes * more cleanups and fixed conversion * remove final_norm * flake8 fixes * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * flake8 fixes * Update src/llama-hparams.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-arch.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * added hashes * Update src/llama-arch.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update src/llama-vocab.cpp Co-authored-by: Georgi Gerganov <[email protected]> * update the update file * Revert "update the update file" This reverts commit 082ab4a. * fix: address suggestions * fix: update convert_hf_to_gguf.py * Update gguf-py/gguf/constants.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update src/llama-model-loader.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * d_inner fixed * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * reshaping ssm_norm for 34B * removing generate_mup * remove duplicates metadata keys * rm comment * final comment * fix unused args * fix constants * fix bad merge * Update src/llama-model.cpp Co-authored-by: compilade <[email protected]> * falcon-h1: remove unused ssm_in_b and bad merge * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * falcon-h1: fix last comment * Update convert_hf_to_gguf.py Co-authored-by: compilade <[email protected]> * falcon-h1: revert add_add_bos(False) * falcon-h1: fix tied weights * falcon-h1: remove whitespace * falcon-h1: fix wrong size param * falcon-h1: fix whitespace issues --------- Co-authored-by: younesbelkada <[email protected]> Co-authored-by: Younes B <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]> Co-authored-by: compilade <[email protected]>

* ggml : add ggml_scale_bias * ggml_vec_mad1_f32 * add more simd * add CUDA * sycl * vulkan * cann (placeholder) * opencl * will this fix cpu? * fix cuda * suggestions from coderabbit * fix cann compile error * vDSP_vsmsa * rm __ARM_FEATURE_SVE * use memcpy for op params * make code looks more consistent * use scalar for __ARM_FEATURE_SVE * add x param to ggml_vec_mad1_f32

* wip: llama : separate recurrent states from the KV cache This will be necessary to support Jamba (and other recurrent models mixed with Attention). Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states. * llama : use std::find for seq_nodes in llama_rs_cache * llama : state checkpoints for recurrent models * llama : correctly handle more edge cases for the rs cache * llama : rename many llama_kv_cache_* functions * llama : remove useless return value for some llama_cache_* functions * llama : rethink recurrent state cell counts * llama : begin work on support for variable GQA This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads. * llama : gracefully fail when not finding hybrid slot * llama : support Jamba * llama : fix BERT inference without KV cache * convert-hf : check for unprocessed Jamba experts * convert-hf : support Mini-Jamba conversion * llama : fix Jamba quantization sanity checks * llama : sequence-length-aware batch splitting * llama : use equal-sequence-length sub-batches for recurrent models * ggml : simplify SSM-related operators * llama : make recurrent state slot allocation contiguous * llama : adapt internal uses of batches to llama_ubatch * llama : fix batch split output count for embeddings * llama : minimize swaps when reordering logits This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models. * llama : fix edge case finding batch seq_id of split recurrent cell This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash. * llama : avoid copies for simple batch splits * ggml : make ggml_ssm_scan not modify its source tensors * llama : fix shared recurrent tail cell count for small ubatch sizes Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model. * llama : fix .base() compilation error on Windows * llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL * ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors The implementation already supported it, and this makes Mamba's conv step slightly faster. * mamba : fix non-contiguous usage of ggml_silu * llama : session saving and reloading for hybrid models * convert_hf : fix Jamba conversion * llama : fix mixed signedness comparison * llama : use unused n_embd_k_gqa in k_shift This also slightly reduces the diff from the master branch * llama : begin renaming llama_past back to llama_kv_cache * llama : remove implicit recurrent state rollbacks * llama : partially apply clang-format style * convert : fix jamba conv1d shape squeezing * graph : add back hybrid memory graph input But this time it contains the sub-cache graph inputs. This *should* make it easier to handle updating the inputs when caching the graph (eventually). * model : add Jamba to Mamba-specific hparams printing * jamba : remove redundant nullptr initializations * model : remove unnecessary prefix for tensor loading constants Co-authored-by: Sigbjørn Skjæret <[email protected]> * model : use ggml_swiglu_split for Mamba Co-authored-by: Sigbjørn Skjæret <[email protected]> * model : make falcon-h1 use shared mamba2 layer builder * memory : avoid referring to KV in recurrent cache logs * gguf-py : avoid adding duplicate tensor mappings for Jamba Some of the tensor names are common with Llama4 --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

@JohannesGaessler

…4392) * compare-commits.sh: support both llama-bench and test-backend-ops Signed-off-by: Xiaodong Ye <[email protected]> * Speed up the build by specifying -j 12 Signed-off-by: Xiaodong Ye <[email protected]> * Remove build_number from test-backend-ops db Signed-off-by: Xiaodong Ye <[email protected]> * Apply suggestion from @JohannesGaessler Co-authored-by: Johannes Gäßler <[email protected]> * Refine tool selection logic Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]> Signed-off-by: Xiaodong Ye <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]>

* docker: add cann build pipline * docker: add cann build pipline * docker: fix cann devops * cann : fix multi card hccl * Update ggml/src/ggml-cann/ggml-cann.cpp Co-authored-by: Xuan-Son Nguyen <[email protected]> * Update ggml-cann.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]>

ggml-ci

* Initial Q2_K Block Interleaving Implementation * Addressed review comments and clean up of the code * Post rebase fixes * Initial CI/CD fixes * Update declarations in arch-fallback.h * Changes for GEMV Q2_K in arch-fallback.h * Enable repacking only on AVX-512 machines * Update comments in repack.cpp * Address q2k comments --------- Co-authored-by: Manogna-Sree <[email protected]>

* support hunyuan_v1_dense Signed-off-by: stevenkuang <[email protected]> * update hunyuan_moe to hunyuan_v1_moe Signed-off-by: stevenkuang <[email protected]> * fix rope alpha assert and bos token Signed-off-by: stevenkuang <[email protected]> * add blank line Signed-off-by: stevenkuang <[email protected]> * Revert "update hunyuan_moe to hunyuan_v1_moe" This reverts commit aa973ca. * use hunyuan_dense instead of hunyuan_v1_dense Signed-off-by: stevenkuang <[email protected]> * fix hunyuan_moe chat template Signed-off-by: stevenkuang <[email protected]> * remove leftover code Signed-off-by: stevenkuang <[email protected]> * update hunyuan dense chat template Signed-off-by: stevenkuang <[email protected]> * fix hunyuan dense vocab and chat template Signed-off-by: stevenkuang <[email protected]> --------- Signed-off-by: stevenkuang <[email protected]>

* vendor : update vendored copy of google/minja Signed-off-by: Lennart Austenfeld <[email protected]> * Re-remove trailing whitespace Signed-off-by: Lennart Austenfeld <[email protected]> * Remove another trailing whitespace Signed-off-by: Lennart Austenfeld <[email protected]> --------- Signed-off-by: Lennart Austenfeld <[email protected]>

* vulkan: optimizations for direct convolution - Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill the GPU. The new size should be amenable to using coopmat, too. - Fix shmem bank conflicts. 16B padding should work with coopmat. - Some explicit loop unrolling. - Skip math/stores work for parts of the tile that are OOB. - Apply fastdiv opt. - Disable shuffles for NV. * Three tiles sizes for CONV_2D, and a heuristic to choose * reallow collectives for pre-Turing * make SHMEM_PAD a spec constant * fixes for intel perf - no shmem padding, placeholder shader core count * shader variants with/without unrolling * 0cc4m's fixes for AMD perf Co-authored-by: 0cc4m <[email protected]> --------- Co-authored-by: 0cc4m <[email protected]>

Signed-off-by: Xiaodong Ye <[email protected]>

- Increase tile size for k-quants, to match non-k-quants - Choose more carefully between large and medium tiles, considering how it interacts with split_k - Allow larger/non-power of two split_k, and make the splits a multiple of 256 - Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used

* torch is not required for convert_hf_to_gguf_update * add --check-missing parameter * check that pre-tokenizer hashes are up-to-date

* cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 ggml-ci * cont : fix cont types ggml-ci * cont : adopt variable names and comment from the other branch

ggml-ci

…5040) This commit removes the right alignment the `n_stream` value in the log message in the `llama_kv_cache_unified` constructor. The motivation for this change is to enhance the readability of log message. Currently the output looks like this: ```console llama_kv_cache_unified: size = 2048.00 MiB ( 4096 cells, 32 layers, 1/ 1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB ``` Notice that the `n_stream` value is right aligned, which makes it a little harder to read. With the change in this commit the output will look like ```console llama_kv_cache_unified: size = 2048.00 MiB ( 4096 cells, 32 layers, 1/1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB ```

… text_config) (#15051) * basic kimi-vl textmodel conversion * check config["text_config"] for special tokens

…14994) * imatrix : use a single count for dense 3d tensors * imatrix : fix 3d activations when model tensor is 2d * imatrix : fix 3d tensor counts

* imatrix : use GGUF by default * imatrix : use GGUF regardless of the output filename The legacy format can only be produced with --output-format dat

CISC and others added 30 commits July 5, 2025 09:17

server : fix assistant prefilling when content is an array (#14360)

ddef995

vulkan: Handle updated FA dim2/3 definition (#14518)

a0374a6

* vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1

vulkan: fix rms_norm+mul fusion (#14545)

e592be1

The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results.

vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (#14485)

6491d6e

Commit taken from remyoudompheng's PR #12260 Co-authored-by: Rémy Oudompheng <[email protected]>

CUDA: add bf16 and i32 to getrows (#14529)

b9c3eef

llama : remove ggml_cont where possible (#14568)

12f55c3

llama : fix incorrect minicpm3 v_states shape (#14571)

e1a7059

musa: fix build warnings (unused variable) (#14561)

68155c6

Signed-off-by: Xiaodong Ye <[email protected]>

CUDA: add bilinear interpolation for upscale (#14563)

75c91de

cuda : fix rope with partial rotation and non-cont src (#14580)

4d0dcd4

* cuda : fix rope non-cont ggml-ci * cont : fix multi-rope + add test ggml-ci * sycl : try fix ggml-ci * cont : fix sycl + clean-up cuda ggml-ci

vulkan: increase timeout for CI (#14574)

53903ae

server: Add ability to mount server at prefix (#14544)

17a1f0d

* Add server_prefix * Correct server path env * Rename cli flag to --api-prefix * Change all to api_prefix

vulkan : fix rope with partial rotation and non-cont src (#14582)

b8eeb87

memory : fix broken batch splits for recurrent cache (#14575)

bb4f7a9

Splits producing more than one ubatch per batch for recurrent models were broken with #14512. This fixes it by moving the completeness check after the ubatch split loop.

model : add SmolLM3 (#14581)

0838286

* Init - first pass. * Model -> ModelBase. * fix errors in conversion. * Update the graph. * up. * up. * wip * cgraph ok * rm redundant code --------- Co-authored-by: Vaibhavs10 <[email protected]>

model : fix hunyuan moe chat template (#14584)

699f439

Signed-off-by: stevenkuang <[email protected]>

convert : fix smollm3 jinja template (#14586)

20b7bf8

llama : remove unintended whitespace (#14592)

1055545

model : add skt/A.X-4.0 model vocabulary (#14589)

ffd59e7

ggml : prevent integer overflow in gguf tensor size calculation (#14595)

26a48ad

llama : remove llm_graph_input_one (#14603)

cb9178f

cuda : support Falcon-H1 state size for SSM_SCAN (#14602)

a57d1bc

cmake : llguidance build parser library only (#14608)

ac44eb6

cmake : bump llguidance version to v1.0.1 (#14609)

f9a867f

llama : minor coding style fix for smollm3 (#14605)

435a6d1

EAddario and others added 30 commits July 31, 2025 21:32

quantize : skip tensor override when in fallback mode (#14995)

daf2dd7

graph : fix equal_seq() check (#14986)

ba42794

ggml-ci

opencl: add f16 for add, sub, mul, div (#14984)

1c872f7

CUDA: fix MMQ nwarps for AMD with warp_size==32 (#15014)

9c35706

server: enable token array inputs for OAI API (#15001)

f906275

model : support Qwen3-Embedding (#15023)

339bd02

vulkan: Support ne[3]>1 in noncontig matrix-vector multiply (#15015)

ec0b188

llama-bench: rename DB table name from test to llama_bench (#15003)

3025b62

Signed-off-by: Xiaodong Ye <[email protected]>

chat : fix multiple tool_calls on hermes-2-pro (#14962)

f738989

convert : fix Qwen3-Embedding pre-tokenizer hash (#15030)

711d5e6

ci : check that pre-tokenizer hashes are up-to-date (#15032)

2bf3fbf

* torch is not required for convert_hf_to_gguf_update * add --check-missing parameter * check that pre-tokenizer hashes are up-to-date

cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 (#15038)

15e92fd

* cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 ggml-ci * cont : fix cont types ggml-ci * cont : adopt variable names and comment from the other branch

llama : enable LLAMA_SET_ROWS=1 by default (#14959)

a4569c4

ggml-ci

cuda: make im2col a little faster (#15025)

3303c19

CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (#15035)

03d4698

opencl: fix adreno compiler detection logic (#15029)

5c0eb5e

vulkan: Use coopmat2 for conv2d (#14982)

6c7a441

model : add text-only support for Kimi-VL (and find special tokens in…

83bc2f2

… text_config) (#15051) * basic kimi-vl textmodel conversion * check config["text_config"] for special tokens

vocab : JetBrains Mellum pre-tokenizer (#15045)

97366dc

memory : handle kv_unified for hybrid models (#15050)

11a3811

imatrix : fix 3d activation handling for hybrid and recurrent models (#…

0a2f549

…14994) * imatrix : use a single count for dense 3d tensors * imatrix : fix 3d activations when model tensor is 2d * imatrix : fix 3d tensor counts

imatrix : use GGUF by default (#14842)

d31192b

* imatrix : use GGUF by default * imatrix : use GGUF regardless of the output filename The legacy format can only be produced with --output-format dat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rebase #4

Rebase #4

Uh oh!

abhilash1910 commented Mar 5, 2024

Uh oh!

Uh oh!

Rebase #4

Are you sure you want to change the base?

Rebase #4

Uh oh!

Conversation

abhilash1910 commented Mar 5, 2024

Uh oh!

Uh oh!