-
Notifications
You must be signed in to change notification settings - Fork 293
Avx512f #921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avx512f #921
Conversation
merge from base
merge base
…i128_si512,zextsi256_si512
…, cvtusepi64_epi8
r? @Amanieu (rust_highfive has picked a reviewer for you, use r? to override) |
#[target_feature(enable = "avx512f")] | ||
pub unsafe fn _mm512_setr4_epi64(d: i64, c: i64, b: i64, a: i64) -> __m512i { | ||
_mm512_set_epi64(a, b, c, d, a, b, c, d) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these only available on x86_64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these only available on x86_64?
I put in x86_64 because _mm512_set_epi64 is put in x86_64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think any of the SSE/AVX intrinsics are specific to x86_64, they should all work on x86.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I move set4 to x86, it show:
---- verify_all_signatures stdout ----
failed to verify _mm512_set4_epi64
- intrinsic
_mm512_set4_epi64
uses a 64-bit bare type but may be available on 32-bit platforms
failed to verify_mm512_setr4_epi64
- intrinsic
_mm512_setr4_epi64
uses a 64-bit bare type but may be available on 32-bit platforms
thread 'verify_all_signatures' panicked at 'assertion failed: all_valid', crates/stdarch-verify/tests/x86-intel.rs:362:5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add them to the whitelist in crates/stdarch-verify/tests/x86-intel.rs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the set
intrinsics should be added to the whitelist and moved to x86.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the
set
intrinsics should be added to the whitelist and moved to x86.
How about _mm512_set_epi64 and _mm512_setr_epi64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those too.
crates/core_arch/src/x86/avx512f.rs
Outdated
_mm512_setzero_si512().as_i32x16(), | ||
k, | ||
)); | ||
ptr::write_unaligned(mem_addr as *mut __m512i, r); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect: the intrinsic is only supposed to store the elements selected by the mask, not the full 512 bits. This is done with a special LLVM intrinsic: https://godbolt.org/z/vWf7KE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect: the intrinsic is only supposed to store the elements selected by the mask, not the full 512 bits. This is done with a special LLVM intrinsic: https://godbolt.org/z/vWf7KE
Yes, I tried to implement use llvm.masked.compressstore.v16f32, but whatever orders for the three parameters. It says the link error because of bad parameters.(I tried 3x2x1 different orders)
Also, compress_ps and compressstoreu_ps use the same Instruction: vcompressps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory and register forms of the instruction behave differently: https://www.felixcloutier.com/x86/vcompressps
Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z must be zero.
Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper bits are zeroed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to follow llvm document "declare void @llvm.masked.compressstore.v16f32(<16 x float> , float* , <16 x i1> )" to implement.
#[link_name = "llvm.masked.compressstore.v16f32"]
fn vcompresspss(a: f32x16, p: mut f32, mask: i16);
However, when I compiled it. It shows
"Intrinsic has incorrect argument type
void (<16 x float>, float, i16)* @llvm.masked.compressstore.v16f32"
Any clue to solve this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will require special support in the compiler. We don't currently have a way to express vectors of i1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will require special support in the compiler. We don't currently have a way to express vectors of
i1
.
True, I think I will drop off these functions first.
crates/core_arch/src/x86/avx512f.rs
Outdated
k: __mmask16, | ||
mem_addr: *const i32, | ||
) -> __m512i { | ||
let load = ptr::read_unaligned(mem_addr as *const __m512i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to remove expandloadu
since it reads more than it should, which could cause out-of-bounds reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expandloadu is copy the value from src when mask is false.
m := 0
FOR j := 0 to 15
i := j*32
IF k[j]
dst[i+31:i] := MEM[mem_addr+m+31:mem_addr+m]
m := m + 32
ELSE
dst[i+31:i] := src[i+31:i]
FI
ENDFOR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, @llvm.masked.expandload.v16f32(float* %{{.}}, <16 x i1> %{{.}}, <16 x float> %{{.*}})
I will remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I thought expandloadu_ps = loadu_ps + expand_ps because it has the similar latency.
latency 11 = 4 + 8. It seems expandloadu_ps is a little faster.
alignr: epi32,epi64
zextps128_ps512,zextps256_ps512,zextpd128_pd512,zextpd256_pd512,zextsi128_si512,zextsi256_si512
undefined: epi32, (); set_zero: epi32, ()
set_epi8, set_epi16, set1_epi8, set1_epi16
set4: epi32,epi64,ps,pd; setr4: epi32,epi64,ps,pd
cvtepi8_epi32, cvtepi8_epi64, cvtepu8_epi32, cvtepu8_epi64
cvtepi16_epi32, cvtepi16_epi64, cvtepu16_epi32, cvtepu16_epi64
cvtepi32_epi64, cvtepu32_epi64, cvtepi32_ps, cvtepi32_pd
cvtepu32_ps, cvtepu32_pd, cvtepi32lo_pd, cvtepu32lo_pd
cvtepi32_epi16, cvtepi32_epi8,
cvtepi64_epi32, cvtepi64_epi16, cvtepi64_epi8
cvtsepi32_epi16, cvtsepi32_epi8
cvtsepi64_epi32, cvtsepi64_epi16, cvtsepi64_epi8
cvtusepi32_epi16, cvtusepi32_epi8, cvtusepi64_epi32, cvtusepi64_epi16, cvtusepi64_epi8
cvtpd_ps, cvt_roundpd_ps
cvtpd_pslo, cvtpslo_pd
cvt_roundpd_epi32, cvt_roundpd_epu32
cvt_roundepi32_ps, cvt_roundepu32_ps
cvt_roundps_ph, cvtps_ph
cvt_roundph_ps, cvtph_ps
reduce_add: epi32,ps,pd
reduce_mul: epi32,ps,pd
reduce_max: epi32,epu32,ps,pd
reduce_min: epi32,epu32,ps,pd
reduce_and: epi32
loadu: epi32,epi64,si512; storeu: epi32,epi64,si512
load: epi32,epi64,si512,ps,pd; store: epi32,epi64,si512,ps,pd
extractf32x4_ps, extractf64x4_pd, extracti32x4_epi32, extracti64x4_epi64
reduce_or: epi32
compress: epi32,epi64,ps,pd
expand: epi32,epi64,ps,pd