-
Notifications
You must be signed in to change notification settings - Fork 288
sse4.1 instructions #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- HACK warning: messing with the constify macros - Selecting only one buffer gets optimized away and tests need to take this into account
#[inline(always)] | ||
#[target_feature = "+sse4.1"] | ||
#[cfg_attr(test, assert_instr(pblendw, imm8=0xF0))] | ||
pub unsafe fn _mm_blend_epi16(a: i16x8, b: i16x8, imm8: u8) -> i16x8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the imm8
be u8
or i32
? The public interface in C is const int
, but I followed _mm_dp_ps
for consistency and used u8
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd imagine that so long as more than 8 bits aren't read then passing u8
here seems fine to me, but @BurntSushi do you think we should super strictly follow C?
macro_rules! call { | ||
($imm2:expr) => { blendpd(a, b, $imm2) } | ||
} | ||
constify_imm2!(imm2, call) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the correct way, since I'm messing with macros, but it was the most obvious way to make it work for now.
src/x86/sse41.rs
Outdated
#[inline(always)] | ||
#[target_feature = "+sse4.1"] | ||
#[cfg_attr(test, assert_instr(extractps, imm8=0))] | ||
pub unsafe fn _mm_extract_ps(a: f32x4, imm8: u8) -> i32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is again a question of how much should C be followed.
This intrinsic returns somewhat nonsensically an i32
:
int _mm_extract_ps (__m128 a, const int imm8)
Should we just return an f32
or continue to use the transmute here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it seems ok to return f32
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it really should be an i32
. See answers at https://stackoverflow.com/questions/5526658/intel-sse-why-does-mm-extract-ps-return-int-instead-of-float
src/x86/sse41.rs
Outdated
#[inline(always)] | ||
#[target_feature = "+sse4.1"] | ||
#[cfg_attr(test, assert_instr(pextrb, imm8=0))] | ||
pub unsafe fn _mm_extract_epi8(a: i8x16, imm8: u8) -> i32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some as the previous one, but should it be i8
instead of i32
?
Looks like CI may still be failing? |
Some context:
|
Win64 uses a different calling convention. Unless you define a different calling convention, it will pass vector arguments in memory instead of via the XMM/YMM registers. This causes different code to be generated. See, e.g., https://msdn.microsoft.com/en-us/library/9z1stfyw.aspx |
Thanks @nominolo ! Is there a way I can fix this? |
@p32blo You probably want to just change the expected code. I.e., something like: https://github.com/rust-lang-nursery/stdsimd/blob/2dbe8d0b39a1b07e9cf4846d593908bed588ce9f/src/x86/sse.rs#L751 |
src/x86/sse41.rs
Outdated
#[cfg(target_arch = "x86_64")] | ||
#[inline(always)] | ||
#[target_feature = "+sse4.1"] | ||
#[cfg_attr(all(test, windows), assert_instr(mov, imm8=1))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should Windows be tested at all, since it does not generate the correct output? Or is testing for mov
ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think testing for plain mov
doesn't make sense, because it'll always be satisfied by mov rbp,rsp
which is part of every function prelude.
Ok looks great, thanks! We can always work on adding more tests later if necessary. |
* sse4.1: _mm_blendv_ps and _mm_blendv_pd * sse4.1: _mm_blend_ps and _mm_blend_pd - HACK warning: messing with the constify macros - Selecting only one buffer gets optimized away and tests need to take this into account * sse4.1: _mm_blend_epi16 * sse4.1: _mm_extract_ps * sse4.1: _mm_extract_epi8 * see4.1: _mm_extract_epi32 * sse4.1: _mm_extract_epi64 * sse4.1: _mm_insert_ps * sse4.1: _mm_insert_epi8 * sse4.1: _mm_insert_epi32 and _mm_insert_epi64 * Formmating * sse4.1: _mm_max_epi8, _mm_max_epu16, _mm_max_epi32 and _mm_max_epu32 * Fix wrong compiler flag - avx -> sse4.1 * Fix intrinsics that only work with x86-64 * sse4.1: use appropriate types * Revert '_mm_extract_ps' to return i32 * sse4.1: Use the v128 types for consistency * Try fix for windows * Try "vectorcall" calling convention * Revert "Try "vectorcall" calling convention" This reverts commit 12936e9. * Revert "Try fix for windows" This reverts commit 9c47380. * Change tests for windows * Remove useless Windows test
I have some doubts over several aspects of the PR so I will add inline comments.