Skip to content

Conversation

@taimur-10x
Copy link

This PR extends the existing RISC-V Vector (RVV) floating-point support introduced introduced in (PR# 15075), adding new kernels.

Summary

  • Adds a BF16 RVV Flag to ggml-cpu/CMakeLists.txt to enable the zvfbfwma extension
  • Adds 6 new kernels for floating-point operations.

Newly Added Kernels

  • ggml_vec_dot_bf16
  • ggml_vec_mad_f16
  • ggml_vec_scale_f16
  • ggml_vec_dot_f16_unroll
  • ggml_cpu_bf16_to_fp32
  • ggml_cpu_fp16_to_fp32

Testing

Kernels were functionally tested on QEMU for VLENs (128-bit, 256-bit, 512-bit and 1024-bit) for a range of input sizes.

@taimur-10x taimur-10x changed the title [RISC-V] Extend support for RVV floating-point kernels ggml-cpu: extend support for RVV floating-point kernels Nov 17, 2025
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 17, 2025
@taimur-10x
Copy link
Author

@ggerganov, could this be reviewed? Thanks.

@xctan
Copy link
Collaborator

xctan commented Nov 27, 2025

To clarify my previous point, I now understand the various microarchitectural optimizations tuned for BPI-F3. My main question now is whether these optimizations remain beneficial for other microarchitectures compared to a more generic implementation. I understand that the scarcity of other RVV 1.0 devices makes specific experiments challenging. Would GGML_BACKEND_DL be suitable for this tuning purpose? @ggerganov, what are your thoughts?

@luhenry
Copy link

luhenry commented Nov 27, 2025

@xctan the BPI-F3 is currently the better board broadly commercially available. It makes it easy for anyone else to replicate and verify. Every other contribution to this project are optimized for a certain micro-architecture in mind (as it should be to make sure it goes faster!). Now the question is what platform should it be optimized for by default? I still believe that it should be based on something that’s broadly commercially available.

To your point, I think other micro-architectures should have specific optimizations, and have dynamic selection as proposed in #17461. More specifically, the choice for various lmul could be based on some dynamic detection.

@taimur-10x
Copy link
Author

@xctan, any update on this?

@ggerganov
Copy link
Member

Would GGML_BACKEND_DL be suitable for this tuning purpose? @ggerganov, what are your thoughts?

@xctan I don't have a strong opinion as I am not familiar with the specifics of RISC architectures. Was #17461 what you had in mind, or you have something additional in mind?

@ixgbe
Copy link
Contributor

ixgbe commented Dec 12, 2025

Would GGML_BACKEND_DL be suitable for this tuning purpose? @ggerganov, what are your thoughts?

@xctan I don't have a strong opinion as I am not familiar with the specifics of RISC architectures. Was #17461 what you had in mind, or you have something additional in mind?

Would GGML_BACKEND_DL be suitable for this tuning purpose? @ggerganov, what are your thoughts?

@xctan I don't have a strong opinion as I am not familiar with the specifics of RISC architectures. Was #17461 what you had in mind, or you have something additional in mind?

In my mind, #17318 (comment) is ok!

@xctan
Copy link
Collaborator

xctan commented Dec 12, 2025

I think #17461 is a good starting point. The riscv_hwprobe syscall can provide detailed microarchitecture info like vendor ID, march ID, and mimpl ID. This information would be helpful for implementing microarchitecture-tuning infrastructure in a future PR.

As for this PR, it works well on any RVV devices, so I'm fine with using this implementation first before tuning it for other hardware. I'm aware that some microarchitectures implement RVV using element-wise uops, meaning a larger LMUL will be preferred for vector operations. More kernels designed for maximum LMUL usage, rather than relying solely on benchmark-based tuning, can then be added later. RISC-V's openness allows the coevolution of hardware and software designs, so we just need to be open to other design choices. Also, these types of vector operations should be simple enough for compiler auto-vectorization, making a generic implementation with intrinsics not as necessary as I previously thought.

const int step = epr * 2;
const int np = (n & ~(step - 1));

// unroll by 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to use f16m4 -> f32m8 directly, rather than manual unrolling?"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

@taimur-10x taimur-10x Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

riseproject-dev#1

Decisions around LMUL and unrolling are a result of the bench-marking numbers summarized in the above PR. We bench-marked at various LMUL and unrolling configurations, as well as preventing the compiler from re-arranging any load accesses, etc. These permutations were tested on cache hot and cache cold numbers, with cache hot numbers prioritized.

__riscv_vse32_v_f32m4(y + i + epr, ay1, epr);
}

// leftovers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can eliminate the separate leftover loop by configuring the vector length directly within the main loop. This simplifies the code and enables the CPU implementation to distribute tail elements more evenly. There are some examples in vec.h.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we are keeping the unroll 2 (see https://github.com/ggml-org/llama.cpp/pull/17318/files#r2564828448), this leftover loop allows to treat the elements left in a vectorized manner. There is redundancy between this loop and the scalar one, however the compiler is smart enough to remove the scalar loop.

Comment on lines +3376 to +3394
// unroll by 2
for (; i < np; i += step) {
vbfloat16m2_t ax0 = __riscv_vle16_v_bf16m2((const __bf16*)x + i, epr);
vfloat32m4_t ay0 = __riscv_vfwcvtbf16_f_f_v_f32m4(ax0, epr);
__riscv_vse32_v_f32m4(y + i, ay0, epr);

vbfloat16m2_t ax1 = __riscv_vle16_v_bf16m2((const __bf16*)x + i + epr, epr);
vfloat32m4_t ay1 = __riscv_vfwcvtbf16_f_f_v_f32m4(ax1, epr);
__riscv_vse32_v_f32m4(y + i + epr, ay1, epr);
}

// leftovers
int vl;
for (i = np; i < n; i += vl) {
vl = __riscv_vsetvl_e16m2(n - i);
vbfloat16m2_t ax0 = __riscv_vle16_v_bf16m2((const __bf16*)x + i, vl);
vfloat32m4_t ay0 = __riscv_vfwcvtbf16_f_f_v_f32m4(ax0, vl);
__riscv_vse32_v_f32m4(y + i, ay0, vl);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vbfloat16m2_t ax0 = __riscv_vle16_v_bf16m2((const __bf16 *)&x[i], epr);
vbfloat16m2_t ay0 = __riscv_vle16_v_bf16m2((const __bf16 *)&y[i], epr);
vsum0 = __riscv_vfwmaccbf16_vv_f32m4(vsum0, ax0, ay0, epr);
__asm__ __volatile__ ("" ::: "memory");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 234 to 240
// reduce
vl = __riscv_vsetvlmax_e32m2();
vfloat32m2_t acc0 = __riscv_vfadd_vv_f32m2(__riscv_vget_v_f32m4_f32m2(vsum0, 0), __riscv_vget_v_f32m4_f32m2(vsum0, 1), vl);
vl = __riscv_vsetvlmax_e32m1();
vfloat32m1_t acc1 = __riscv_vfadd_vv_f32m1(__riscv_vget_v_f32m2_f32m1(acc0, 0), __riscv_vget_v_f32m2_f32m1(acc0, 1), vl);
vfloat32m1_t redsum = __riscv_vfredusum_vs_f32m1_f32m1(acc1, __riscv_vfmv_v_f_f32m1(0.0f, 1), vl);
sumf += __riscv_vfmv_f_s_f32m1_f32(redsum);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not directly use f32m4 -> f32m1 instead of multiple accumulation steps?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines +202 to +203
vfloat32m4_t vsum0 = __riscv_vfmv_v_f_f32m4(0.0f, vl);
vfloat32m4_t vsum1 = __riscv_vfmv_v_f_f32m4(0.0f, vl);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider increasing LMUL for unrolling to prevent code duplication.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

#elif defined(__riscv_v_intrinsic) && defined(__riscv_zvfh)
size_t vl = __riscv_vsetvlmax_e32m4();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same suggestions from vec.cpp are applicable here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (int i = 0; i < n; ++i) {
y[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(y[i]) + GGML_CPU_FP16_TO_FP32(x[i])*v);
#elif defined(__riscv_v_intrinsic) && defined(__riscv_zvfh)
const ggml_fp16_t s = GGML_CPU_FP32_TO_FP16(v);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider a true VLEN-agnostic loop here for a cleaner implementation.

vy32 = __riscv_vfmul_vf_f32m4(vy32, v, vl);
vy = __riscv_vfncvt_f_f_w_f16m2(vy32, vl);
__riscv_vse16_v_f16m2((_Float16 *)&y[i], vy, vl);
const ggml_fp16_t s = GGML_CPU_FP32_TO_FP16(v);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admire the commitment to remove the unnecessary float32 proxy and use float16 directly, but using RVV merely to emulate fixed-length SIMD seems like a missed opportunity for elegance. It would be delightful to see an implementation that actually leverages the hardware's native agility.

Copy link

@luhenry luhenry Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what you mean here? Would you rather have the fp16 -> fp32 -> fp16 conversion on all the elements of y, rather than the single fp32 -> fp16 conversion on v?


// unroll by 2
for (int i = 0; i < np; i += step) {
vfloat16m4_t ay0 = __riscv_vle16_v_f16m4((const _Float16*)y + i, epr);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants