Skip to content

Conversation

@chraac
Copy link
Contributor

@chraac chraac commented Dec 24, 2025

Performance

Device: 8Gen2

Baseline: ed7597771
Optimization: 2058f28b3

Operation Params Baseline (GFLOPS) Optimization (GFLOPS) Speedup
MUL_MAT (f16, f32) k=128, n=1 3.68 5.74 1.56x
MUL_MAT (f16, f32) k=14336, n=1 3.43 7.26 2.12x
MUL_MAT (f16, f32) k=14336, n=2 3.46 7.29 2.11x
MUL_MAT (f16, f32) k=14336, n=3 3.46 7.29 2.11x
MUL_MAT (f16, f32) k=14336, n=4 3.46 7.34 2.12x
MUL_MAT (f16, f32) k=14336, n=5 3.46 7.31 2.11x
MUL_MAT (f16, f32) k=14336, n=8 3.48 7.35 2.11x

@chraac chraac marked this pull request as draft December 24, 2025 03:01
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 24, 2025
volatile HVX_Vector hi = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_hi_W(xp)), Q6_V_hi_W(yp));
volatile HVX_Vector lo = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_lo_W(xp)), Q6_V_lo_W(yp));
HVX_Vector hi = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_hi_W(xp)), Q6_V_hi_W(yp));
HVX_Vector lo = Q6_Vqf32_vmpy_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_V_lo_W(xp)), Q6_V_lo_W(yp));
Copy link
Contributor Author

@chraac chraac Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my observations, using volatile here seems to have several drawbacks:

  • Prevents inlining: With volatile, the binary retains a separate vec_dot_f16_f32 function instead of inlining it into matmul_f16_f32. image
  • Generates extra store instructions: noticed that the compiler generates extra vmem instructions to write the result registers to the stack, as seen in the highlight below. This will increase the mem bandwidth pressure, which impacts the processing speed.
    image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant