Skip to content

sse4.1 instructions #98

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Oct 18, 2017
Merged

sse4.1 instructions #98

merged 27 commits into from
Oct 18, 2017

Conversation

p32blo
Copy link
Contributor

@p32blo p32blo commented Oct 6, 2017

I have some doubts over several aspects of the PR so I will add inline comments.

#[inline(always)]
#[target_feature = "+sse4.1"]
#[cfg_attr(test, assert_instr(pblendw, imm8=0xF0))]
pub unsafe fn _mm_blend_epi16(a: i16x8, b: i16x8, imm8: u8) -> i16x8 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the imm8 be u8 or i32? The public interface in C is const int, but I followed _mm_dp_ps for consistency and used u8.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd imagine that so long as more than 8 bits aren't read then passing u8 here seems fine to me, but @BurntSushi do you think we should super strictly follow C?

macro_rules! call {
($imm2:expr) => { blendpd(a, b, $imm2) }
}
constify_imm2!(imm2, call)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the correct way, since I'm messing with macros, but it was the most obvious way to make it work for now.

src/x86/sse41.rs Outdated
#[inline(always)]
#[target_feature = "+sse4.1"]
#[cfg_attr(test, assert_instr(extractps, imm8=0))]
pub unsafe fn _mm_extract_ps(a: f32x4, imm8: u8) -> i32 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is again a question of how much should C be followed.

This intrinsic returns somewhat nonsensically an i32:

int _mm_extract_ps (__m128 a, const int imm8)

Should we just return an f32 or continue to use the transmute here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it seems ok to return f32 here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/x86/sse41.rs Outdated
#[inline(always)]
#[target_feature = "+sse4.1"]
#[cfg_attr(test, assert_instr(pextrb, imm8=0))]
pub unsafe fn _mm_extract_epi8(a: i8x16, imm8: u8) -> i32 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some as the previous one, but should it be i8 instead of i32?

@p32blo p32blo changed the title sse4. 1 instructions sse4.1 instructions Oct 6, 2017
@alexcrichton
Copy link
Member

Looks like CI may still be failing?

@p32blo
Copy link
Contributor Author

p32blo commented Oct 10, 2017

Some context:

  • Travis is failing with some avx tests. They work on my machine so I don't know what to do to fix this. Also can't get why AppVeyor is failing.
  • Couldn't change _mm_extract_ps to return f32 because it does not generate the correct instruction with the change.
  • Changed _mm_extract_epi64 and _mm_insert_epi64 to only work on x86_64 since it failed to generate the instruction on 32-bit builds. See here.
  • used i8x16 instead of __m128i on _mm_blendv_epi8 for consistency.

@nominolo
Copy link
Contributor

Win64 uses a different calling convention. Unless you define a different calling convention, it will pass vector arguments in memory instead of via the XMM/YMM registers. This causes different code to be generated. See, e.g., https://msdn.microsoft.com/en-us/library/9z1stfyw.aspx

@p32blo
Copy link
Contributor Author

p32blo commented Oct 17, 2017

Thanks @nominolo !

Is there a way I can fix this?

@nominolo
Copy link
Contributor

@p32blo You probably want to just change the expected code. I.e., something like: https://github.com/rust-lang-nursery/stdsimd/blob/2dbe8d0b39a1b07e9cf4846d593908bed588ce9f/src/x86/sse.rs#L751

src/x86/sse41.rs Outdated
#[cfg(target_arch = "x86_64")]
#[inline(always)]
#[target_feature = "+sse4.1"]
#[cfg_attr(all(test, windows), assert_instr(mov, imm8=1))]
Copy link
Contributor Author

@p32blo p32blo Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Windows be tested at all, since it does not generate the correct output? Or is testing for mov ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think testing for plain mov doesn't make sense, because it'll always be satisfied by mov rbp,rsp which is part of every function prelude.

@alexcrichton
Copy link
Member

Ok looks great, thanks! We can always work on adding more tests later if necessary.

@alexcrichton alexcrichton merged commit ac1e68a into rust-lang:master Oct 18, 2017
pythoneer pushed a commit to pythoneer/stdsimd that referenced this pull request Oct 19, 2017
* sse4.1: _mm_blendv_ps and _mm_blendv_pd

* sse4.1: _mm_blend_ps and _mm_blend_pd

- HACK warning: messing with the constify macros
- Selecting only one buffer gets optimized away and tests need to take this into account

* sse4.1: _mm_blend_epi16

* sse4.1: _mm_extract_ps

* sse4.1: _mm_extract_epi8

* see4.1: _mm_extract_epi32

* sse4.1: _mm_extract_epi64

* sse4.1: _mm_insert_ps

* sse4.1: _mm_insert_epi8

* sse4.1: _mm_insert_epi32 and _mm_insert_epi64

* Formmating

* sse4.1: _mm_max_epi8, _mm_max_epu16, _mm_max_epi32 and _mm_max_epu32

* Fix wrong compiler flag

- avx -> sse4.1

* Fix intrinsics that only work with x86-64

* sse4.1: use appropriate types

* Revert '_mm_extract_ps' to return i32

* sse4.1: Use the v128 types for consistency

* Try fix for windows

* Try "vectorcall" calling convention

* Revert "Try "vectorcall" calling convention"

This reverts commit 12936e9.

* Revert "Try fix for windows"

This reverts commit 9c47380.

* Change tests for windows

* Remove useless Windows test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants