Skip to content

Avx512f #921

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 48 commits into from
Oct 10, 2020
Merged

Avx512f #921

merged 48 commits into from
Oct 10, 2020

Conversation

minybot
Copy link
Contributor

@minybot minybot commented Sep 30, 2020

alignr: epi32,epi64
zextps128_ps512,zextps256_ps512,zextpd128_pd512,zextpd256_pd512,zextsi128_si512,zextsi256_si512
undefined: epi32, (); set_zero: epi32, ()
set_epi8, set_epi16, set1_epi8, set1_epi16
set4: epi32,epi64,ps,pd; setr4: epi32,epi64,ps,pd
cvtepi8_epi32, cvtepi8_epi64, cvtepu8_epi32, cvtepu8_epi64
cvtepi16_epi32, cvtepi16_epi64, cvtepu16_epi32, cvtepu16_epi64
cvtepi32_epi64, cvtepu32_epi64, cvtepi32_ps, cvtepi32_pd
cvtepu32_ps, cvtepu32_pd, cvtepi32lo_pd, cvtepu32lo_pd
cvtepi32_epi16, cvtepi32_epi8,
cvtepi64_epi32, cvtepi64_epi16, cvtepi64_epi8
cvtsepi32_epi16, cvtsepi32_epi8
cvtsepi64_epi32, cvtsepi64_epi16, cvtsepi64_epi8
cvtusepi32_epi16, cvtusepi32_epi8, cvtusepi64_epi32, cvtusepi64_epi16, cvtusepi64_epi8
cvtpd_ps, cvt_roundpd_ps
cvtpd_pslo, cvtpslo_pd
cvt_roundpd_epi32, cvt_roundpd_epu32
cvt_roundepi32_ps, cvt_roundepu32_ps
cvt_roundps_ph, cvtps_ph
cvt_roundph_ps, cvtph_ps
reduce_add: epi32,ps,pd
reduce_mul: epi32,ps,pd
reduce_max: epi32,epu32,ps,pd
reduce_min: epi32,epu32,ps,pd
reduce_and: epi32
loadu: epi32,epi64,si512; storeu: epi32,epi64,si512
load: epi32,epi64,si512,ps,pd; store: epi32,epi64,si512,ps,pd
extractf32x4_ps, extractf64x4_pd, extracti32x4_epi32, extracti64x4_epi64
reduce_or: epi32
compress: epi32,epi64,ps,pd
expand: epi32,epi64,ps,pd

@rust-highfive
Copy link

r? @Amanieu

(rust_highfive has picked a reviewer for you, use r? to override)

#[target_feature(enable = "avx512f")]
pub unsafe fn _mm512_setr4_epi64(d: i64, c: i64, b: i64, a: i64) -> __m512i {
_mm512_set_epi64(a, b, c, d, a, b, c, d)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these only available on x86_64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these only available on x86_64?
I put in x86_64 because _mm512_set_epi64 is put in x86_64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any of the SSE/AVX intrinsics are specific to x86_64, they should all work on x86.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I move set4 to x86, it show:
---- verify_all_signatures stdout ----
failed to verify _mm512_set4_epi64

  • intrinsic _mm512_set4_epi64 uses a 64-bit bare type but may be available on 32-bit platforms
    failed to verify _mm512_setr4_epi64
  • intrinsic _mm512_setr4_epi64 uses a 64-bit bare type but may be available on 32-bit platforms
    thread 'verify_all_signatures' panicked at 'assertion failed: all_valid', crates/stdarch-verify/tests/x86-intel.rs:362:5

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add them to the whitelist in crates/stdarch-verify/tests/x86-intel.rs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the set intrinsics should be added to the whitelist and moved to x86.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the set intrinsics should be added to the whitelist and moved to x86.

How about _mm512_set_epi64 and _mm512_setr_epi64?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those too.

_mm512_setzero_si512().as_i32x16(),
k,
));
ptr::write_unaligned(mem_addr as *mut __m512i, r);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect: the intrinsic is only supposed to store the elements selected by the mask, not the full 512 bits. This is done with a special LLVM intrinsic: https://godbolt.org/z/vWf7KE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect: the intrinsic is only supposed to store the elements selected by the mask, not the full 512 bits. This is done with a special LLVM intrinsic: https://godbolt.org/z/vWf7KE

Yes, I tried to implement use llvm.masked.compressstore.v16f32, but whatever orders for the three parameters. It says the link error because of bad parameters.(I tried 3x2x1 different orders)
Also, compress_ps and compressstoreu_ps use the same Instruction: vcompressps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory and register forms of the instruction behave differently: https://www.felixcloutier.com/x86/vcompressps

Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z must be zero.

Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper bits are zeroed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to follow llvm document "declare void @llvm.masked.compressstore.v16f32(<16 x float> , float* , <16 x i1> )" to implement.
#[link_name = "llvm.masked.compressstore.v16f32"]
fn vcompresspss(a: f32x16, p: mut f32, mask: i16);
However, when I compiled it. It shows
"Intrinsic has incorrect argument type
void (<16 x float>, float
, i16)* @llvm.masked.compressstore.v16f32"

Any clue to solve this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will require special support in the compiler. We don't currently have a way to express vectors of i1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will require special support in the compiler. We don't currently have a way to express vectors of i1.
True, I think I will drop off these functions first.

k: __mmask16,
mem_addr: *const i32,
) -> __m512i {
let load = ptr::read_unaligned(mem_addr as *const __m512i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to remove expandloadu since it reads more than it should, which could cause out-of-bounds reads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expandloadu is copy the value from src when mask is false.
m := 0
FOR j := 0 to 15
i := j*32
IF k[j]
dst[i+31:i] := MEM[mem_addr+m+31:mem_addr+m]
m := m + 32
ELSE
dst[i+31:i] := src[i+31:i]
FI
ENDFOR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @llvm.masked.expandload.v16f32(float* %{{.}}, <16 x i1> %{{.}}, <16 x float> %{{.*}})
I will remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I thought expandloadu_ps = loadu_ps + expand_ps because it has the similar latency.
latency 11 = 4 + 8. It seems expandloadu_ps is a little faster.

@Amanieu Amanieu merged commit 7b92756 into rust-lang:master Oct 10, 2020
@minybot minybot deleted the avx512f branch October 11, 2020 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants