-
Notifications
You must be signed in to change notification settings - Fork 329
igzip/riscv64: Optimize isal_adler32_rvv with 4x loop unrolling and tail agnostic(ta) #373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
9e77959 to
d571a01
Compare
|
It looks like among the 3 commits, one is empty and the other two are the same. Can you merge them into one? |
d571a01 to
3d3eee7
Compare
Done. I have merged them as requested. |
| add a1, a1, t4 | ||
| sub a2, a2, t4 | ||
|
|
||
| vsetvli zero, t1, e32, m4, tu, ma |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to use zero, zero here to represent processing the same number of elements as before? (including some other vset)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make the modifications as suggested.
| vle8.v v2, (a4) | ||
| add a5, a4, t1 | ||
| vle8.v v3, (a5) | ||
| mv t5, a2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before unrolling, this can be moved outside the loop, as long as vrsub.vx usage is changed to a2, and the modification of a2 is moved to a bit later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I will modify this part
| add a4, a3, t1 | ||
| vle8.v v2, (a4) | ||
| add a5, a4, t1 | ||
| vle8.v v3, (a5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we can use the same register
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make the modifications as suggested.
| vmv.x.s a4, v0 // B = a4 | ||
| vmv.x.s t2, v24 // A = t2 | ||
| add t3, t4, t3 | ||
| add t3, a4, t3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change the name here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make the modifications as suggested.
f6b8c54 to
3d3eee7
Compare
|
How is it looking now, @sunyuechi ? |
| vid.v v12 // 0, 1, 2, .. vl-1 | ||
| vadd.vv v8, v8, v4 | ||
| vrsub.vx v12, v12, a2 // len, len-1, len-2 | ||
| vwmaccu.vv v16, v12, v4 // v16: B += weight * next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These three comment lines have added extra spaces. Please remove them
to restore the original alignment.
| sub a2, a2, t1 | ||
| bnez a2, single | ||
|
|
||
| 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numbering is 1: 3: 4: but missing 2:. It might be better to use
1: 2: 3: instead.
| vadd.vv v8, v8, v28 | ||
| vwmaccu.vv v16, v12, v28 | ||
| sub a2, a2, a4 | ||
| bge a2, t0, unroll_loop_4x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the sub instruction earlier to avoid the dependency
with bge.
| vwmaccu.vv v16, v12, v4 // v16: B += weight * next | ||
| add a1, a1, t1 | ||
| bnez a2, 1b | ||
| sub a2, a2, t1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the sub instruction earlier to avoid the dependency
with bge.
|
After addressing the above minor issues, |
Thanks for the review. I 'll address the issues and merge the commits. |
fbcf370 to
8b9e5e6
Compare
…ail agnostic(ta) Signed-off-by: WenLei <[email protected]>
8b9e5e6 to
7a11b91
Compare
|
LGTM |
This PR optimizes adler32_rvv implementation by introducing 4x loop unrolling and tail agnostic(ta) policy.
The optimization has been verified on the SG2044 platform: