Skip to content

Implement SSE _mm_load* instructions #99

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Oct 8, 2017

Conversation

nominolo
Copy link
Contributor

@nominolo nominolo commented Oct 6, 2017

This implements SSE instructions:

  • _mm_loadh_pi
  • _mm_loadl_pi
  • _mm_load_ss
  • _mm_load1_ps
  • _mm_load_ps1
  • _mm_load_ps
  • _mm_loadu_ps
  • _mm_loadr_ps

I also added some experimental doctests, but I can't get them to run. (That is, I can't get them to fail by adding assert!(false). They are being compiled, though)

/// ```
#[inline(always)]
#[target_feature = "+sse"]
// TODO: generates MOVHPD if the CPU supports SSE2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this actually got me thinking! I remembered that x86_64 is defined with SSE2 support by default, which may explain this function on that target. It turns out as well, however, that i686 also has SSE2 turned on by default!

I think to assert this instruction we'd have to use the i586 target (which has sse/sse2 disabled), but in the meantime this is probably ok.

src/x86/sse.rs Outdated
pub unsafe fn _mm_loadu_ps(p: *const f32) -> f32x4 {
// TODO: This also seems to generate the same code. Don't know which one
// will behave better when inlined into other code.
// f32x4::new(*p, *p.offset(1), *p.offset(2), *p.offset(3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if there's no restrictions on alignment here that the implementation below is the one we'll have to go with, AFAIK a *p will require correct alignment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, then I'll remove this TODO.

@alexcrichton
Copy link
Member

Awesome, thanks!

@alexcrichton
Copy link
Member

Looks like the tests may be failing on i686?

@nominolo
Copy link
Contributor Author

nominolo commented Oct 6, 2017

Looks like it picks different instructions in 32-bit mode:

disassembly for stdsimd::x86::sse::_mm_loadh_pi: 
	 0: push %ebp 
	 1: mov %esp,%ebp 
	 2: mov 0x8(%ebp),%eax 
	 3: movsd (%eax),%xmm1        // 3&4 could be combined into
	 4: unpcklpd %xmm1,%xmm0      // movhps (%eax),%xmm0
	 5: pop %ebp 
	 6: ret 
disassembly for stdsimd::x86::sse::_mm_loadl_pi: 
	 0: push %ebp 
	 1: mov %esp,%ebp 
	 2: mov 0x8(%ebp),%eax 
	 3: movsd (%eax),%xmm1   // 3&4 could be written as
	 4: movsd %xmm1,%xmm0    // movlps (%eax),%xmm0
	 5: pop %ebp 
	 6: ret

Not sure what to do about this. The implementation is based on the one from xmmintrin.h

We have some other problems with assert_instr being too strict / fragile. We can make it less strict, or we could have a test based on the result matching the C intrinsic. But that would require invoking a C compiler somewhere. Perhaps, we can have a separate C file reference_sse.c and then have a macro matches_c_body(<function_name>).

@alexcrichton
Copy link
Member

Hm maybe we could just relax it here for now? I'm not sure how to deal with this as I'm not sure why there's different codegen...

src/x86/sse.rs Outdated
pub unsafe fn _mm_loadu_ps(p: *const f32) -> f32x4 {
// Note: Using `*p` would require `f32` alignment, but `movups` has no
// alignment restrictions.
let mut dst = mem::uninitialized();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be causing the faults seen on Travis, I had to use f32x4::splat(uninitialized()) to fix these issues.

unsafe fn _mm_loadu_ps() {
let vals = &[1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
let p = vals.as_ptr().offset(3);
let r = sse::_mm_loadu_ps(black_box(p));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another case of a possible crash here?

Copy link
Contributor Author

@nominolo nominolo Oct 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the failing tests are due to different instruction selection on i586 vs i686 for loadh_ps and loadl_ps (movss vs movsd, and unpcklps vs unpcklpd). I don't see a convenient way to distinguish them, though. They're both target_arch = "x86" and I see no other way to distinguish them based on standard attributes: https://stackoverflow.com/questions/41742046/is-there-a-list-of-all-cfg-features
The perhaps actionable difference is whether SSE2 is available or not. I'm guessing all the sse42 and sse2 tests are just no-ops on i586?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops misread the logs!

You can conditionalize for i586/i686 with #[cfg(target_feature = "sse2")], that's enabled on i686 and disabled on i586

@alexcrichton alexcrichton merged commit c9d23d2 into rust-lang:master Oct 8, 2017
@alexcrichton
Copy link
Member

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants