[WIP]ggml-hexagon: improve leftover element calc at `vec_dot_f16_f32` #18336

chraac · 2025-12-24T03:01:18Z

Performance

Device: 8Gen2

Baseline: ed7597771
Optimization: 2058f28b3

Operation	Params	Baseline (GFLOPS)	Optimization (GFLOPS)	Speedup
MUL_MAT (f16, f32)	k=128, n=1	3.68	5.74	1.56x
MUL_MAT (f16, f32)	k=14336, n=1	3.43	7.26	2.12x
MUL_MAT (f16, f32)	k=14336, n=2	3.46	7.29	2.11x
MUL_MAT (f16, f32)	k=14336, n=3	3.46	7.29	2.11x
MUL_MAT (f16, f32)	k=14336, n=4	3.46	7.34	2.12x
MUL_MAT (f16, f32)	k=14336, n=5	3.46	7.31	2.11x
MUL_MAT (f16, f32)	k=14336, n=8	3.48	7.35	2.11x

This reverts commit 8600ecd20d6c902fe16271d6af1e59504eff4a27.

chraac · 2025-12-24T03:23:52Z

ggml/src/ggml-hexagon/htp/matmul-ops.c

-        volatile HVX_Vector hi = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_hi_W(xp)), Q6_V_hi_W(yp));
-        volatile HVX_Vector lo = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_lo_W(xp)), Q6_V_lo_W(yp));
+        HVX_Vector hi = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_hi_W(xp)), Q6_V_hi_W(yp));
+        HVX_Vector lo = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_lo_W(xp)), Q6_V_lo_W(yp));


Based on my observations, using volatile here seems to have several drawbacks:

Prevents inlining: With volatile, the binary retains a separate vec_dot_f16_f32 function instead of inlining it into matmul_f16_f32.

Generates extra store instructions: noticed that the compiler generates extra vmem instructions to write the result registers to the stack, as seen in the highlight below. This will increase the mem bandwidth pressure, which impacts the processing speed.

chraac added 9 commits December 23, 2025 16:15

refactoring: enhance memory management with tracking buffer allocation

afaeb54

wip

7ef467c

refactoring: improve code formatting and alignment in matmul operations

e0b1435

wip

398aa85

wip

500c627

wip

2917136

opt: use qf32 internal precision for vec_dot_f16_f32

ea45020

add unroll marker

cb0a8ff

Revert "opt: use qf32 internal precision for vec_dot_f16_f32"

2058f28

This reverts commit 8600ecd20d6c902fe16271d6af1e59504eff4a27.

chraac requested review from lhez and max-krasnyansky as code owners December 24, 2025 03:01

chraac marked this pull request as draft December 24, 2025 03:01

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 24, 2025

loci-dev mentioned this pull request Dec 24, 2025

UPSTREAM PR #18336: [WIP]ggml-hexagon: improve leftover element calc at vec_dot_f16_f32 auroralabs-loci/llama.cpp#684

Open

chraac commented Dec 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP]ggml-hexagon: improve leftover element calc at `vec_dot_f16_f32` #18336

[WIP]ggml-hexagon: improve leftover element calc at `vec_dot_f16_f32` #18336

chraac commented Dec 24, 2025 •

edited

Loading

Uh oh!

chraac Dec 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[WIP]ggml-hexagon: improve leftover element calc at vec_dot_f16_f32 #18336

Are you sure you want to change the base?

[WIP]ggml-hexagon: improve leftover element calc at vec_dot_f16_f32 #18336

Conversation

chraac commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Uh oh!

chraac Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[WIP]ggml-hexagon: improve leftover element calc at `vec_dot_f16_f32` #18336

[WIP]ggml-hexagon: improve leftover element calc at `vec_dot_f16_f32` #18336

chraac commented Dec 24, 2025 •

edited

Loading

chraac Dec 24, 2025 •

edited

Loading