ggml-cpu: extend support for RVV floating-point kernels #17318

taimur-10x · 2025-11-17T10:47:19Z

This PR extends the existing RISC-V Vector (RVV) floating-point support introduced introduced in (PR# 15075), adding new kernels.

Summary

Adds a BF16 RVV Flag to ggml-cpu/CMakeLists.txt to enable the zvfbfwma extension
Adds 6 new kernels for floating-point operations.

Newly Added Kernels

ggml_vec_dot_bf16
ggml_vec_mad_f16
ggml_vec_scale_f16
ggml_vec_dot_f16_unroll
ggml_cpu_bf16_to_fp32
ggml_cpu_fp16_to_fp32

Testing

Kernels were functionally tested on QEMU for VLENs (128-bit, 256-bit, 512-bit and 1024-bit) for a range of input sizes.

Co-authored-by: Rehan Qasim <[email protected]>

taimur-10x · 2025-11-25T12:15:17Z

@ggerganov, could this be reviewed? Thanks.

xctan · 2025-11-27T17:04:09Z

To clarify my previous point, I now understand the various microarchitectural optimizations tuned for BPI-F3. My main question now is whether these optimizations remain beneficial for other microarchitectures compared to a more generic implementation. I understand that the scarcity of other RVV 1.0 devices makes specific experiments challenging. Would GGML_BACKEND_DL be suitable for this tuning purpose? @ggerganov, what are your thoughts?

luhenry · 2025-11-27T17:16:29Z

@xctan the BPI-F3 is currently the better board broadly commercially available. It makes it easy for anyone else to replicate and verify. Every other contribution to this project are optimized for a certain micro-architecture in mind (as it should be to make sure it goes faster!). Now the question is what platform should it be optimized for by default? I still believe that it should be based on something that’s broadly commercially available.

To your point, I think other micro-architectures should have specific optimizations, and have dynamic selection as proposed in #17461. More specifically, the choice for various lmul could be based on some dynamic detection.

taimur-10x · 2025-12-11T09:18:06Z

@xctan, any update on this?

ggerganov · 2025-12-11T09:25:52Z

Would GGML_BACKEND_DL be suitable for this tuning purpose? @ggerganov, what are your thoughts?

@xctan I don't have a strong opinion as I am not familiar with the specifics of RISC architectures. Was #17461 what you had in mind, or you have something additional in mind?

ixgbe · 2025-12-12T06:56:02Z

Would GGML_BACKEND_DL be suitable for this tuning purpose? @ggerganov, what are your thoughts?

@xctan I don't have a strong opinion as I am not familiar with the specifics of RISC architectures. Was #17461 what you had in mind, or you have something additional in mind?

In my mind, #17318 (comment) is ok!

xctan · 2025-12-12T07:39:53Z

I think #17461 is a good starting point. The riscv_hwprobe syscall can provide detailed microarchitecture info like vendor ID, march ID, and mimpl ID. This information would be helpful for implementing microarchitecture-tuning infrastructure in a future PR.

As for this PR, it works well on any RVV devices, so I'm fine with using this implementation first before tuning it for other hardware. I'm aware that some microarchitectures implement RVV using element-wise uops, meaning a larger LMUL will be preferred for vector operations. More kernels designed for maximum LMUL usage, rather than relying solely on benchmark-based tuning, can then be added later. RISC-V's openness allows the coevolution of hardware and software designs, so we just need to be open to other design choices. Also, these types of vector operations should be simple enough for compiler auto-vectorization, making a generic implementation with intrinsics not as necessary as I previously thought.

xctan · 2025-11-26T12:32:39Z

ggml/src/ggml-cpu/ggml-cpu.c

+    const int step = epr * 2;
+    const int np = (n & ~(step - 1));
+
+    // unroll by 2


Is there a reason not to use f16m4 -> f32m8 directly, rather than manual unrolling?"

unroll by 2 is what yielded the best results: https://docs.google.com/presentation/d/1Vrb4qt8YBt0pbiOA4-z2XcIcZIbLwizJa7-s5DclGpo/edit?slide=id.g39983ae8256_0_47#slide=id.g39983ae8256_0_47

riseproject-dev#1

Decisions around LMUL and unrolling are a result of the bench-marking numbers summarized in the above PR. We bench-marked at various LMUL and unrolling configurations, as well as preventing the compiler from re-arranging any load accesses, etc. These permutations were tested on cache hot and cache cold numbers, with cache hot numbers prioritized.

xctan · 2025-11-26T12:40:15Z

ggml/src/ggml-cpu/ggml-cpu.c

+        __riscv_vse32_v_f32m4(y + i + epr, ay1, epr);
    }
+
+    // leftovers


We can eliminate the separate leftover loop by configuring the vector length directly within the main loop. This simplifies the code and enables the CPU implementation to distribute tail elements more evenly. There are some examples in vec.h.

Assuming we are keeping the unroll 2 (see https://github.com/ggml-org/llama.cpp/pull/17318/files#r2564828448), this leftover loop allows to treat the elements left in a vectorized manner. There is redundancy between this loop and the scalar one, however the compiler is smart enough to remove the scalar loop.

xctan · 2025-11-26T12:41:49Z

ggml/src/ggml-cpu/ggml-cpu.c

+    // unroll by 2
+    for (; i < np; i += step) {
+        vbfloat16m2_t ax0 = __riscv_vle16_v_bf16m2((const __bf16*)x + i, epr);
+        vfloat32m4_t ay0 = __riscv_vfwcvtbf16_f_f_v_f32m4(ax0, epr);
+        __riscv_vse32_v_f32m4(y + i, ay0, epr);
+
+        vbfloat16m2_t ax1 = __riscv_vle16_v_bf16m2((const __bf16*)x + i + epr, epr);
+        vfloat32m4_t ay1 = __riscv_vfwcvtbf16_f_f_v_f32m4(ax1, epr);
+        __riscv_vse32_v_f32m4(y + i + epr, ay1, epr);
+    }
+
+    // leftovers
+    int vl;
+    for (i = np; i < n; i += vl) {
+        vl = __riscv_vsetvl_e16m2(n - i);
+        vbfloat16m2_t ax0 = __riscv_vle16_v_bf16m2((const __bf16*)x + i, vl);
+        vfloat32m4_t ay0 = __riscv_vfwcvtbf16_f_f_v_f32m4(ax0, vl);
+        __riscv_vse32_v_f32m4(y + i, ay0, vl);
+    }


Same as above.

https://docs.google.com/presentation/d/1Vrb4qt8YBt0pbiOA4-z2XcIcZIbLwizJa7-s5DclGpo/edit?slide=id.g39983ae8256_0_47#slide=id.g39983ae8256_0_47, lmul=2 and unroll=2 is what yields the best performance.

xctan · 2025-11-26T12:47:01Z

ggml/src/ggml-cpu/vec.cpp

+        vbfloat16m2_t ax0 = __riscv_vle16_v_bf16m2((const __bf16 *)&x[i], epr);
+        vbfloat16m2_t ay0 = __riscv_vle16_v_bf16m2((const __bf16 *)&y[i], epr);
+        vsum0 = __riscv_vfwmaccbf16_vv_f32m4(vsum0, ax0, ay0, epr);
+        __asm__ __volatile__ ("" ::: "memory");


https://docs.google.com/presentation/d/1wN9XfH4cOHDvHUozKYHoDDg77v8tRuYZwXzeoctCDks/edit?slide=id.g3a0ea1dfc33_0_507#slide=id.g3a0ea1dfc33_0_507

xctan · 2025-11-26T12:54:16Z

ggml/src/ggml-cpu/vec.cpp

+    // reduce
+    vl = __riscv_vsetvlmax_e32m2();
+    vfloat32m2_t acc0 = __riscv_vfadd_vv_f32m2(__riscv_vget_v_f32m4_f32m2(vsum0, 0), __riscv_vget_v_f32m4_f32m2(vsum0, 1), vl);
+    vl = __riscv_vsetvlmax_e32m1();
+    vfloat32m1_t acc1 = __riscv_vfadd_vv_f32m1(__riscv_vget_v_f32m2_f32m1(acc0, 0), __riscv_vget_v_f32m2_f32m1(acc0, 1), vl);
+    vfloat32m1_t redsum = __riscv_vfredusum_vs_f32m1_f32m1(acc1, __riscv_vfmv_v_f_f32m1(0.0f, 1), vl);
+    sumf += __riscv_vfmv_f_s_f32m1_f32(redsum);


Why not directly use f32m4 -> f32m1 instead of multiple accumulation steps?

xctan · 2025-11-26T13:02:31Z

ggml/src/ggml-cpu/vec.cpp

+    vfloat32m4_t vsum0 = __riscv_vfmv_v_f_f32m4(0.0f, vl);
+    vfloat32m4_t vsum1 = __riscv_vfmv_v_f_f32m4(0.0f, vl);


Consider increasing LMUL for unrolling to prevent code duplication.

Similarly to fp16_to_fp32, this is what leads to the best performance: https://docs.google.com/presentation/d/1Vrb4qt8YBt0pbiOA4-z2XcIcZIbLwizJa7-s5DclGpo/edit?slide=id.g39983ae8256_0_23#slide=id.g39983ae8256_0_23

xctan · 2025-11-26T13:05:51Z

ggml/src/ggml-cpu/vec.h

-      }
+
+    #elif defined(__riscv_v_intrinsic) && defined(__riscv_zvfh)
+        size_t vl = __riscv_vsetvlmax_e32m4();


The same suggestions from vec.cpp are applicable here.

https://docs.google.com/presentation/d/1Vrb4qt8YBt0pbiOA4-z2XcIcZIbLwizJa7-s5DclGpo/edit?slide=id.g39983ae8256_0_41#slide=id.g39983ae8256_0_41

xctan · 2025-11-26T13:20:28Z

ggml/src/ggml-cpu/vec.h

-        for (int i = 0; i < n; ++i) {
-            y[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(y[i]) + GGML_CPU_FP16_TO_FP32(x[i])*v);
+    #elif defined(__riscv_v_intrinsic) && defined(__riscv_zvfh)
+        const ggml_fp16_t s = GGML_CPU_FP32_TO_FP16(v);


Consider a true VLEN-agnostic loop here for a cleaner implementation.

xctan · 2025-11-26T13:38:47Z

ggml/src/ggml-cpu/vec.h

-        vy32 = __riscv_vfmul_vf_f32m4(vy32, v, vl);
-        vy = __riscv_vfncvt_f_f_w_f16m2(vy32, vl);
-        __riscv_vse16_v_f16m2((_Float16 *)&y[i], vy, vl);
+    const ggml_fp16_t s = GGML_CPU_FP32_TO_FP16(v);


I admire the commitment to remove the unnecessary float32 proxy and use float16 directly, but using RVV merely to emulate fixed-length SIMD seems like a missed opportunity for elegance. It would be delightful to see an implementation that actually leverages the hardware's native agility.

I'm not sure I understand what you mean here? Would you rather have the fp16 -> fp32 -> fp16 conversion on all the elements of y, rather than the single fp32 -> fp16 conversion on v?

luhenry · 2025-12-12T22:38:09Z

ggml/src/ggml-cpu/vec.h

+
+    // unroll by 2
+    for (int i = 0; i < np; i += step) {
+        vfloat16m4_t ay0 = __riscv_vle16_v_f16m4((const _Float16*)y + i, epr);


https://docs.google.com/presentation/d/1Vrb4qt8YBt0pbiOA4-z2XcIcZIbLwizJa7-s5DclGpo/edit?slide=id.g39983ae8256_0_35#slide=id.g39983ae8256_0_35 for the numbers and why the choice of lmul=4 and unroll=2

taimur-10x and others added 4 commits November 17, 2025 11:40

cmake: add BF16 RVV flag for ggml-cpu

28fcd3e

ggml-cpu: add floating-point conversion kernels

96128a9

ggml: add floating-point kernels

1bda8fb

Co-authored-by: Rehan Qasim <[email protected]>

ggml-cpu: fix lmul in vec_dot_bf16

e07238d

taimur-10x requested review from ggerganov and slaren as code owners November 17, 2025 10:47

taimur-10x mentioned this pull request Nov 17, 2025

[RISC-V] Extend support for RVV floating-point kernels riseproject-dev/llama.cpp#1

Merged

taimur-10x changed the title ~~[RISC-V] Extend support for RVV floating-point kernels~~ ggml-cpu: extend support for RVV floating-point kernels Nov 17, 2025

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 17, 2025

Merge branch 'master' into riscv

3fb0501

loci-dev mentioned this pull request Nov 20, 2025

UPSTREAM PR #17318: ggml-cpu: extend support for RVV floating-point kernels auroralabs-loci/llama.cpp#264

Closed

Merge branch 'master' into riscv

addf578

loci-dev mentioned this pull request Nov 25, 2025

UPSTREAM PR #17318: ggml-cpu: extend support for RVV floating-point kernels auroralabs-loci/llama.cpp#318

Open

ggerganov requested a review from xctan November 26, 2025 09:27

Merge branch 'master' into riscv

2786a97

This was referenced Nov 27, 2025

ggml-cpu:add RISC-V Zvfh implementation for ggml_vec_dot_f16_unroll #17483

Open

ggml-cpu : add RISC-V Zvfh impl for ggml_vec_mad_f16 #17448

Merged

xctan approved these changes Dec 12, 2025

View reviewed changes

luhenry reviewed Dec 12, 2025

View reviewed changes

ggml-cpu: change redsum to lmul 4, fix leftover

e5c8adb

		vfloat32m4_t vsum0 = __riscv_vfmv_v_f_f32m4(0.0f, vl);
		vfloat32m4_t vsum1 = __riscv_vfmv_v_f_f32m4(0.0f, vl);

ggml-cpu: extend support for RVV floating-point kernels #17318

Are you sure you want to change the base?

ggml-cpu: extend support for RVV floating-point kernels #17318

Conversation

taimur-10x commented Nov 17, 2025

Summary

Newly Added Kernels

Testing

Uh oh!

taimur-10x commented Nov 25, 2025

Uh oh!

xctan commented Nov 27, 2025

Uh oh!

luhenry commented Nov 27, 2025

Uh oh!

taimur-10x commented Dec 11, 2025

Uh oh!

ggerganov commented Dec 11, 2025

Uh oh!

ixgbe commented Dec 12, 2025

Uh oh!

xctan commented Dec 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taimur-10x Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luhenry Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

taimur-10x Dec 15, 2025 •

edited

Loading

luhenry Dec 12, 2025 •

edited

Loading