-
Notifications
You must be signed in to change notification settings - Fork 14.1k
ggml-cpu: extend support for RVV floating-point kernels #17318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@ggerganov, could this be reviewed? Thanks. |
|
To clarify my previous point, I now understand the various microarchitectural optimizations tuned for BPI-F3. My main question now is whether these optimizations remain beneficial for other microarchitectures compared to a more generic implementation. I understand that the scarcity of other RVV 1.0 devices makes specific experiments challenging. Would |
|
@xctan the BPI-F3 is currently the better board broadly commercially available. It makes it easy for anyone else to replicate and verify. Every other contribution to this project are optimized for a certain micro-architecture in mind (as it should be to make sure it goes faster!). Now the question is what platform should it be optimized for by default? I still believe that it should be based on something that’s broadly commercially available. To your point, I think other micro-architectures should have specific optimizations, and have dynamic selection as proposed in #17461. More specifically, the choice for various lmul could be based on some dynamic detection. |
|
@xctan, any update on this? |
@xctan I don't have a strong opinion as I am not familiar with the specifics of RISC architectures. Was #17461 what you had in mind, or you have something additional in mind? |
In my mind, #17318 (comment) is ok! |
|
I think #17461 is a good starting point. The As for this PR, it works well on any RVV devices, so I'm fine with using this implementation first before tuning it for other hardware. I'm aware that some microarchitectures implement RVV using element-wise uops, meaning a larger LMUL will be preferred for vector operations. More kernels designed for maximum LMUL usage, rather than relying solely on benchmark-based tuning, can then be added later. RISC-V's openness allows the coevolution of hardware and software designs, so we just need to be open to other design choices. Also, these types of vector operations should be simple enough for compiler auto-vectorization, making a generic implementation with intrinsics not as necessary as I previously thought. |
| const int step = epr * 2; | ||
| const int np = (n & ~(step - 1)); | ||
|
|
||
| // unroll by 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to use f16m4 -> f32m8 directly, rather than manual unrolling?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unroll by 2 is what yielded the best results: https://docs.google.com/presentation/d/1Vrb4qt8YBt0pbiOA4-z2XcIcZIbLwizJa7-s5DclGpo/edit?slide=id.g39983ae8256_0_47#slide=id.g39983ae8256_0_47
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decisions around LMUL and unrolling are a result of the bench-marking numbers summarized in the above PR. We bench-marked at various LMUL and unrolling configurations, as well as preventing the compiler from re-arranging any load accesses, etc. These permutations were tested on cache hot and cache cold numbers, with cache hot numbers prioritized.
| __riscv_vse32_v_f32m4(y + i + epr, ay1, epr); | ||
| } | ||
|
|
||
| // leftovers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can eliminate the separate leftover loop by configuring the vector length directly within the main loop. This simplifies the code and enables the CPU implementation to distribute tail elements more evenly. There are some examples in vec.h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we are keeping the unroll 2 (see https://github.com/ggml-org/llama.cpp/pull/17318/files#r2564828448), this leftover loop allows to treat the elements left in a vectorized manner. There is redundancy between this loop and the scalar one, however the compiler is smart enough to remove the scalar loop.
| // unroll by 2 | ||
| for (; i < np; i += step) { | ||
| vbfloat16m2_t ax0 = __riscv_vle16_v_bf16m2((const __bf16*)x + i, epr); | ||
| vfloat32m4_t ay0 = __riscv_vfwcvtbf16_f_f_v_f32m4(ax0, epr); | ||
| __riscv_vse32_v_f32m4(y + i, ay0, epr); | ||
|
|
||
| vbfloat16m2_t ax1 = __riscv_vle16_v_bf16m2((const __bf16*)x + i + epr, epr); | ||
| vfloat32m4_t ay1 = __riscv_vfwcvtbf16_f_f_v_f32m4(ax1, epr); | ||
| __riscv_vse32_v_f32m4(y + i + epr, ay1, epr); | ||
| } | ||
|
|
||
| // leftovers | ||
| int vl; | ||
| for (i = np; i < n; i += vl) { | ||
| vl = __riscv_vsetvl_e16m2(n - i); | ||
| vbfloat16m2_t ax0 = __riscv_vle16_v_bf16m2((const __bf16*)x + i, vl); | ||
| vfloat32m4_t ay0 = __riscv_vfwcvtbf16_f_f_v_f32m4(ax0, vl); | ||
| __riscv_vse32_v_f32m4(y + i, ay0, vl); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.google.com/presentation/d/1Vrb4qt8YBt0pbiOA4-z2XcIcZIbLwizJa7-s5DclGpo/edit?slide=id.g39983ae8256_0_47#slide=id.g39983ae8256_0_47, lmul=2 and unroll=2 is what yields the best performance.
| vbfloat16m2_t ax0 = __riscv_vle16_v_bf16m2((const __bf16 *)&x[i], epr); | ||
| vbfloat16m2_t ay0 = __riscv_vle16_v_bf16m2((const __bf16 *)&y[i], epr); | ||
| vsum0 = __riscv_vfwmaccbf16_vv_f32m4(vsum0, ax0, ay0, epr); | ||
| __asm__ __volatile__ ("" ::: "memory"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // reduce | ||
| vl = __riscv_vsetvlmax_e32m2(); | ||
| vfloat32m2_t acc0 = __riscv_vfadd_vv_f32m2(__riscv_vget_v_f32m4_f32m2(vsum0, 0), __riscv_vget_v_f32m4_f32m2(vsum0, 1), vl); | ||
| vl = __riscv_vsetvlmax_e32m1(); | ||
| vfloat32m1_t acc1 = __riscv_vfadd_vv_f32m1(__riscv_vget_v_f32m2_f32m1(acc0, 0), __riscv_vget_v_f32m2_f32m1(acc0, 1), vl); | ||
| vfloat32m1_t redsum = __riscv_vfredusum_vs_f32m1_f32m1(acc1, __riscv_vfmv_v_f_f32m1(0.0f, 1), vl); | ||
| sumf += __riscv_vfmv_f_s_f32m1_f32(redsum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not directly use f32m4 -> f32m1 instead of multiple accumulation steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| vfloat32m4_t vsum0 = __riscv_vfmv_v_f_f32m4(0.0f, vl); | ||
| vfloat32m4_t vsum1 = __riscv_vfmv_v_f_f32m4(0.0f, vl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider increasing LMUL for unrolling to prevent code duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to fp16_to_fp32, this is what leads to the best performance: https://docs.google.com/presentation/d/1Vrb4qt8YBt0pbiOA4-z2XcIcZIbLwizJa7-s5DclGpo/edit?slide=id.g39983ae8256_0_23#slide=id.g39983ae8256_0_23
| } | ||
|
|
||
| #elif defined(__riscv_v_intrinsic) && defined(__riscv_zvfh) | ||
| size_t vl = __riscv_vsetvlmax_e32m4(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same suggestions from vec.cpp are applicable here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ggml/src/ggml-cpu/vec.h
Outdated
| for (int i = 0; i < n; ++i) { | ||
| y[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(y[i]) + GGML_CPU_FP16_TO_FP32(x[i])*v); | ||
| #elif defined(__riscv_v_intrinsic) && defined(__riscv_zvfh) | ||
| const ggml_fp16_t s = GGML_CPU_FP32_TO_FP16(v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider a true VLEN-agnostic loop here for a cleaner implementation.
| vy32 = __riscv_vfmul_vf_f32m4(vy32, v, vl); | ||
| vy = __riscv_vfncvt_f_f_w_f16m2(vy32, vl); | ||
| __riscv_vse16_v_f16m2((_Float16 *)&y[i], vy, vl); | ||
| const ggml_fp16_t s = GGML_CPU_FP32_TO_FP16(v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admire the commitment to remove the unnecessary float32 proxy and use float16 directly, but using RVV merely to emulate fixed-length SIMD seems like a missed opportunity for elegance. It would be delightful to see an implementation that actually leverages the hardware's native agility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what you mean here? Would you rather have the fp16 -> fp32 -> fp16 conversion on all the elements of y, rather than the single fp32 -> fp16 conversion on v?
|
|
||
| // unroll by 2 | ||
| for (int i = 0; i < np; i += step) { | ||
| vfloat16m4_t ay0 = __riscv_vle16_v_f16m4((const _Float16*)y + i, epr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.google.com/presentation/d/1Vrb4qt8YBt0pbiOA4-z2XcIcZIbLwizJa7-s5DclGpo/edit?slide=id.g39983ae8256_0_35#slide=id.g39983ae8256_0_35 for the numbers and why the choice of lmul=4 and unroll=2
This PR extends the existing RISC-V Vector (RVV) floating-point support introduced introduced in (PR# 15075), adding new kernels.
Summary
BF16RVV Flag toggml-cpu/CMakeLists.txtto enable thezvfbfwmaextensionNewly Added Kernels
Testing
Kernels were functionally tested on QEMU for VLENs (128-bit, 256-bit, 512-bit and 1024-bit) for a range of input sizes.