Sync master with upstream release b5833 #156

jan-service-account · 2025-07-06T04:07:25Z

Updates dev branch with latest release (b5833) from ggml-org/llama.cpp

Co-authored-by: dinhhuy <[email protected]>

* Add Reorder to Q6_K mmvq implementation * Address PR comments: clean up comments * Remove unused parameter after refactoring q4_k * Adding inline to function and removing unnecessary reference to int --------- Signed-off-by: nscipione <[email protected]>

* webui: fix sidebar being covered by main content Signed-off-by: Xiaodong Ye <[email protected]> * webui: update index.html.gz Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

* Simplify the environment variable setting to specify the memory pool type. * Adjust the GGML_CANN_ASYNC_MODE setting to accept yes, enable, 1, or on (case-insensitive) as valid options. * update * fix CI * update * delete whitespace * fix according to review * update CANN.md * update CANN.md

ggml-ci

* move ggml-cpu-aarch64 to repack * split quantize_row_q8_0/1 * split helper functions * split ggml_vec_dot_q4_0_q8_0 * split ggml_vec_dot_q4_1_q8_1 * split ggml_vec_dot_q5_0_q8_0 * split ggml_vec_dot_q5_1_q8_1 * split ggml_vec_dot_q8_0_q8_0 * split ggml_vec_dot_tq1_0_q8_K * split ggml_vec_dot_tq2_0_q8_K * split ggml_vec_dot_q2_K_q8_K * split ggml_vec_dot_q3_K_q8_K * split ggml_vec_dot_q4_K_q8_K * split ggml_vec_dot_q5_K_q8_K * split ggml_vec_dot_q6_K_q8_K * split ggml_vec_dot_iq2_xxs_q8_K * split ggml_vec_dot_iq2_xs_q8_K * split ggml_vec_dot_iq2_s_q8_K * split ggml_vec_dot_iq3_xxs_q8_K * split ggml_vec_dot_iq3_s_q8_K * split ggml_vec_dot_iq1_s_q8_K * split ggml_vec_dot_iq1_m_q8_K * split ggml_vec_dot_iq4_nl_q8_0 * split ggml_vec_dot_iq4_xs_q8_K * fix typos * fix missing prototypes * rename ggml-cpu-quants.c * rename ggml-cpu-traits * rename arm folder * move cpu-feats-x86.cpp * rename ggml-cpu-hbm * update arm detection macro in quants.c * move iq quant tables * split ggml_quantize_mat_q8_0/K * split ggml_gemv_* * split ggml_gemm_* * rename namespace aarch64 to repack * use weak aliases to replace test macros * rename GGML_CPU_AARCH64 to GGML_CPU_REPACK * rename more aarch64 to repack * clean up rebase leftover * fix compilation errors * remove trailing spaces * try to fix clang compilation errors * try to fix clang compilation errors again * try to fix clang compilation errors, 3rd attempt * try to fix clang compilation errors, 4th attempt * try to fix clang compilation errors, 5th attempt * try to fix clang compilation errors, 6th attempt * try to fix clang compilation errors, 7th attempt * try to fix clang compilation errors, 8th attempt * try to fix clang compilation errors, 9th attempt * more cleanup * fix compilation errors * fix apple targets * fix a typo in arm version of ggml_vec_dot_q4_K_q8_K Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

ggml-org#13980) * llama : allow building all tests on windows when not using shared libraries * add static windows build to ci * tests : enable debug logs for test-chat --------- Co-authored-by: Georgi Gerganov <[email protected]>

ggml-ci

… device is available, to allow fallback to CPU backend (ggml-org#14099)

ggml-ci

)

…org#13834) * kv-cache : avoid modifying recurrent cells when setting inputs * kv-cache : remove inp_s_mask It was replaced with equivalent and simpler functionality with rs_z (the first zeroed state) and the already-existing inp_s_copy. * kv-cache : fix non-consecutive token pos warning for recurrent models The problem was apparently caused by how the tail cells were swapped. * graph : simplify logic for recurrent state copies * kv-cache : use cell without src refs for rs_z in recurrent cache * llama-graph : fix recurrent state copy The `state_copy` shuffle assumes everything is moved at once, which is not true when `states_extra` is copied back to the cache before copying the range of states between `head` and `head + n_seqs`. This is only a problem if any of the cells in [`head`, `head + n_seqs`) have an `src` in [`head + n_seqs`, `head + n_kv`), which does happen when `n_ubatch > 1` in the `llama-parallel` example. Changing the order of the operations avoids the potential overwrite before use, although when copies are avoided (like with Mamba2), this will require further changes. * llama-graph : rename n_state to state_size in build_recurrent_state This naming should reduce confusion between the state size and the number of states.

Use the same descriptor set layout for all pipelines (MAX_PARAMETER_COUNT == 8) and move it to the vk_device. Move all the descriptor pool and set tracking to the context - none of it is specific to pipelines anymore. It has a single vector of pools and vector of sets, and a single counter to track requests and a single counter to track use.

)

ggml-ci

…org#14062) * webui: Wrap long numbers instead of infinite horizontal scroll * Use tailwind class * update index.html.gz

This change moves the command pool/buffer tracking into a vk_command_pool structure. There are two instances per context (for compute+transfer) and two instances per device for operations that don't go through a context. This should prevent separate contexts from stomping on each other.

* ggml-cpu: Factor out feature detection build from x86 * ggml-cpu: Add ARM feature detection and scoring This is analogous to cpu-feats-x86.cpp. However, to detect compile-time activation of features, we rely on GGML_USE_<FEAT> which need to be set in cmake, instead of GGML_<FEAT> that users would set for x86. This is because on ARM, users specify features with GGML_CPU_ARM_ARCH, rather than with individual flags. * ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for ARM Like x86, however to pass around arch flags within cmake, we use GGML_INTERNAL_<FEAT> as we don't have GGML_<FEAT>. Some features are optional, so we may need to build multiple backends per arch version (armv8.2_1, armv8.2_2, ...), and let the scoring function sort out which one can be used. * ggml-cpu: Limit ARM GGML_CPU_ALL_VARIANTS to Linux for now The other platforms will need their own specific variants. This also fixes the bug that the the variant-building branch was always being executed as the else-branch of GGML_NATIVE=OFF. The branch is moved to an elseif-branch which restores the previous behavior.

…rg#14130) ggml-ci

ggml-ci

* batch : remove logits_all flag ggml-ci * context : simplify output counting logic during decode ggml-ci * cont : fix comments

* cmake: Simplify build-info.cpp generation The rebuild of build-info.cpp still gets triggered when .git/index gets changes. * cmake: generate build-info.cpp in build dir

Co-authored-by: dinhhuy <[email protected]>

* cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT * cmake: Pass on LLAMA_BUILD_* to GGML_BUILD_*

* CUDA: add softmax broadcast * Pass by const ref * Review: Use blockDims for indexing, remove designated initializers * Add TODO for noncontigous input/output

…amic libraries search for dependencies in their origin directory. (ggml-org#14309)

* ggml : add version function to get lib version This commit adds a function `ggml_version()` to the ggml library that returns the version of the library as a string. The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used. Usage: ```c printf("GGML version: %s\n", ggml_version()); ``` Output: ```console GGML version: 0.0.2219 ``` * ggml : add ggml_commit() --------- Co-authored-by: Georgi Gerganov <[email protected]>

ggml-ci

* llama : initial Mamba-2 support * ggml : SIMD ggml_ssm_scan for Mamba-2 * ggml : improve ggml_mul speed when masking recurrent states * llama : support running Mamba-Codestral-7B-v0.1 * llama : fix Mamba-2 conv state saving * ggml : make the ggml_mul fast broadcast path more consistently formatted * llama : remove unused variable * llama : add missing break * convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly. * llama : avoid redundant state copy for Mamba 1 and 2 * metal : attempt to adapt SSM_SCAN for Mamba-2 * metal : fix SSM_SCAN pipeline scope * metal : use log and exp instead of log1pf and expf in SSM_SCAN * metal : remove unused arguments for SSM_SCAN The max index is 31, so trimming the arguments is necessary. * metal : add back n_seqs to SSM_SCAN args Whoops, this is needed for the offset in the concatenated output. * metal : fix SSM_SCAN state head offset * metal : fix wrong number of tokens per sequence in SSM_SCAN * ggml : remove unused fast broadcast path in GGML_MUL This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity. * ggml : avoid multiply by D in GGML_OP_SSM_SCAN This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks * convert : fix flake8 lint * metal : fix confusion between ; and , * metal : add missing args for nb references in ssm_scan_f32_group * metal : single-user mamba2 inference works * kv-cache : remove const_cast when setting inputs for s_copy And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy. * convert : avoid AutoConfig for Mamba and Mamba2 hparams * kv-cache : allow context shift for recurrent models * graph : fix recurrent state copies when avoiding copies Works, but using lambda functions might not be that clean. * ggml : fix mamba2 ssm scan when compiled with SVE * ggml-cpu : reorder SVE FMA for consistency with other SIMD arches * cuda : implement ssm scan for Mamba2 There is still room for improvement, but it works! * cuda : adapt Mamba1 ssm scan to shape changes from Mamba2 * mamba : fix mismatched new and delete size for llm_build_mamba Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON * cuda : graceful fallback for Mamba-1 models with weird embd size

* add support for chat template jinja files * remove gemma3n hack

…-org#14497)

ggml-ci

* ggml : fix FA mask dim 2 and 3 ggml-ci * backends : unsupport batched FA in CUDA and Vulkan ggml-ci * vulkan : disable FA for mask->ne[2] != 1

* kv-cache : use ggml_set_rows ggml-ci * graph : separate k and v indices ggml-ci * cont : remove redundant ifs ggml-ci * kv-cache : improve find_slot impl * kv-cache : bounds-check when accessing slot_info indices * kv-cache : add comments ggml-ci * ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends ggml-ci

* convert : correct gemma 3n conversion * rm redundant code

…g#14504) Signed-off-by: nscipione <[email protected]>

* vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes

…g#14002) Co-authored-by: luyuhong <[email protected]>

ggml-ci

…14368) * test-backend-ops: add support for specifying output format Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> * Add build_commit and build_number in test_result Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> * refactor Signed-off-by: Xiaodong Ye <[email protected]> * Get build commit from ggml_commit() Signed-off-by: Xiaodong Ye <[email protected]> * Merge errors into test_operation_info && address review comments Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> * remove visitor nonsense * remove visitor comment Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]> Co-authored-by: slaren <[email protected]>

…14360)

* vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1

huydt84 and others added 30 commits July 6, 2025 09:41

add geglu activation function (ggml-org#14074)

27cb396

Co-authored-by: dinhhuy <[email protected]>

graph : fix geglu (ggml-org#14077)

1404858

ggml-ci

sync : ggml

7916c5a

ggml-ci

Vulkan: Don't default to CPU device (like llvmpipe), even if no other…

63786e9

… device is available, to allow fallback to CPU backend (ggml-org#14099)

ggml : fix weak alias win32 (whisper/0)

cf6d8dd

ggml-ci

sync : ggml

fd91217

ggml-ci

vulkan: force device 0 in CI (ggml-org#14106)

756410c

llama : support GEGLU for jina-bert-v2 (ggml-org#14090)

aac5723

convert : fix duplicate key DeepSeek-R1 conversion error (ggml-org#14103

5952acc

)

opencl: add mul_mv_id_q4_0_f32_8x_flat (ggml-org#14003)

b98886c

kv-cache : add LLAMA_KV_CACHE_DEBUG environment variable (ggml-org#14121

856f024

)

kv-cache : relax SWA masking condition (ggml-org#14119)

a22916c

ggml-ci

webui: Wrap long numbers instead of infinite horizontal scroll (ggml-…

07fae80

…org#14062) * webui: Wrap long numbers instead of infinite horizontal scroll * Use tailwind class * update index.html.gz

tests : add test-tokenizers-repo (ggml-org#14017)

d8e7703

chore : clean up relative source dir paths (ggml-org#14128)

996c2fc

kv-cache : fix split_equal handling in unified implementation (ggml-o…

0095364

…rg#14130) ggml-ci

batch : remove logits_all flag (ggml-org#14141)

1a6b4e6

ggml-ci

context : simplify output counting logic during decode (ggml-org#14142)

63a9403

* batch : remove logits_all flag ggml-ci * context : simplify output counting logic during decode ggml-ci * cont : fix comments

cmake : Improve build-info.cpp generation (ggml-org#14156)

dabef7e

* cmake: Simplify build-info.cpp generation The rebuild of build-info.cpp still gets triggered when .git/index gets changes. * cmake: generate build-info.cpp in build dir

pooling : make cls_b and cls_out_b optional (ggml-org#14165)

8f86e0d

Co-authored-by: dinhhuy <[email protected]>

cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT (ggml-org#14167)

bb11279

* cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT * cmake: Pass on LLAMA_BUILD_* to GGML_BUILD_*

am17an and others added 26 commits July 6, 2025 09:59

CUDA: add softmax broadcast (ggml-org#14475)

659dba8

* CUDA: add softmax broadcast * Pass by const ref * Review: Use blockDims for indexing, remove designated initializers * Add TODO for noncontigous input/output

Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dyn…

7454d93

…amic libraries search for dependencies in their origin directory. (ggml-org#14309)

sync : ggml

e4aebb6

ggml-ci

gguf-py : add support for chat template jinja files (ggml-org#14508)

3f77b9b

* add support for chat template jinja files * remove gemma3n hack

CUDA: add dynamic shared mem to softmax, refactor general usage (ggml…

9455f96

…-org#14497)

ggml : remove kompute backend (ggml-org#14501)

cca7b95

ggml-ci

ggml : fix FA mask dim 2 and 3 (ggml-org#14505)

ebadc44

* ggml : fix FA mask dim 2 and 3 ggml-ci * backends : unsupport batched FA in CUDA and Vulkan ggml-ci * vulkan : disable FA for mask->ne[2] != 1

convert : correct gemma 3n conversion (ggml-org#14450)

6a270ce

* convert : correct gemma 3n conversion * rm redundant code

Fix conditional enabling following arch checks for ggml-sycl (ggml-or…

0412572

…g#14504) Signed-off-by: nscipione <[email protected]>

ggml: backward pass for split swiglu (ggml-org#14483)

9085ac3

vulkan: support mixed/deepseekR1 FA head sizes (ggml-org#14509)

a10a803

* vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes

opencl : broadcast for soft_max (ggml-org#14510)

54d339e

ggml : implement GEGLU_ERF and GEGLU_QUICK ops (ggml-org#14445)

78eadc1

CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (ggml-or…

f37f966

…g#14002) Co-authored-by: luyuhong <[email protected]>

batch : add n_used count (ggml-org#14512)

9d1aee7

ggml-ci

graph : prepare for 4D mask (ggml-org#14515)

f44cbba

ggml-ci

batch : add optional for sequential equal split (ggml-org#14511)

52132b4

ggml-ci

metal : disable fast math in all quantize kernels (ggml-org#14528)

430ab86

ggml-ci

eval-callback : check for empty input (ggml-org#14539)

2fc8e94

opencl: add GELU_ERF (ggml-org#14476)

1ec29d3

server : fix assistant prefilling when content is an array (ggml-org#…

e88a353

…14360)

vulkan: Handle updated FA dim2/3 definition (ggml-org#14518)

f24278e

* vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1

qnixsynapse force-pushed the update-dev-from-master-2025-07-06-04-07 branch from a0374a6 to f24278e Compare July 6, 2025 04:29

Minh141120 approved these changes Jul 8, 2025

View reviewed changes

Minh141120 merged commit f7de784 into dev Jul 8, 2025
9 checks passed

Minh141120 deleted the update-dev-from-master-2025-07-06-04-07 branch July 8, 2025 03:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync master with upstream release b5833 #156

Sync master with upstream release b5833 #156

Uh oh!

jan-service-account commented Jul 6, 2025

Uh oh!

Uh oh!

Uh oh!

Sync master with upstream release b5833 #156

Sync master with upstream release b5833 #156

Uh oh!

Conversation

jan-service-account commented Jul 6, 2025

Uh oh!

Uh oh!

Uh oh!