-
Notifications
You must be signed in to change notification settings - Fork 288
Implement SSE _mm_load* instructions #99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/// ``` | ||
#[inline(always)] | ||
#[target_feature = "+sse"] | ||
// TODO: generates MOVHPD if the CPU supports SSE2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this actually got me thinking! I remembered that x86_64 is defined with SSE2 support by default, which may explain this function on that target. It turns out as well, however, that i686 also has SSE2 turned on by default!
I think to assert this instruction we'd have to use the i586 target (which has sse/sse2 disabled), but in the meantime this is probably ok.
src/x86/sse.rs
Outdated
pub unsafe fn _mm_loadu_ps(p: *const f32) -> f32x4 { | ||
// TODO: This also seems to generate the same code. Don't know which one | ||
// will behave better when inlined into other code. | ||
// f32x4::new(*p, *p.offset(1), *p.offset(2), *p.offset(3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if there's no restrictions on alignment here that the implementation below is the one we'll have to go with, AFAIK a *p
will require correct alignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK, then I'll remove this TODO.
Awesome, thanks! |
Looks like the tests may be failing on i686? |
Looks like it picks different instructions in 32-bit mode:
Not sure what to do about this. The implementation is based on the one from We have some other problems with |
Hm maybe we could just relax it here for now? I'm not sure how to deal with this as I'm not sure why there's different codegen... |
src/x86/sse.rs
Outdated
pub unsafe fn _mm_loadu_ps(p: *const f32) -> f32x4 { | ||
// Note: Using `*p` would require `f32` alignment, but `movups` has no | ||
// alignment restrictions. | ||
let mut dst = mem::uninitialized(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may be causing the faults seen on Travis, I had to use f32x4::splat(uninitialized())
to fix these issues.
unsafe fn _mm_loadu_ps() { | ||
let vals = &[1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]; | ||
let p = vals.as_ptr().offset(3); | ||
let r = sse::_mm_loadu_ps(black_box(p)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another case of a possible crash here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the failing tests are due to different instruction selection on i586 vs i686 for loadh_ps
and loadl_ps
(movss
vs movsd
, and unpcklps
vs unpcklpd
). I don't see a convenient way to distinguish them, though. They're both target_arch = "x86"
and I see no other way to distinguish them based on standard attributes: https://stackoverflow.com/questions/41742046/is-there-a-list-of-all-cfg-features
The perhaps actionable difference is whether SSE2 is available or not. I'm guessing all the sse42
and sse2
tests are just no-ops on i586?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops misread the logs!
You can conditionalize for i586/i686 with #[cfg(target_feature = "sse2")]
, that's enabled on i686 and disabled on i586
Thanks again! |
This implements SSE instructions:
_mm_loadh_pi
_mm_loadl_pi
_mm_load_ss
_mm_load1_ps
_mm_load_ps1
_mm_load_ps
_mm_loadu_ps
_mm_loadr_ps
I also added some experimental doctests, but I can't get them to run. (That is, I can't get them to fail by adding
assert!(false)
. They are being compiled, though)