[X86][AVX] Prefer per-element vector shifts for known splats

|  |  |
| --- | --- |
| Bugzilla Link | [40077](https://llvm.org/bz40077) |
| Version | trunk |
| OS | Windows NT |
| CC | @adibiagio,@topperc,@RKSimon,@rotateright |

## Extended Description 
As detailed on https://reviews.llvm.org/rL340813, many recent machines have better throughput for the 'per-element' variable vector shifts than the old style 'scalar-count-in-xmm' variable shifts if we know that the shift amount is already splatted:

Probably the wrong place to report this, but I looked at some other sequences:
```
; AVX-LABEL: splatvar_shift_v4i32:
; AVX:       # %bb.0:
; AVX-NEXT:    vpmovzxdq {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero   # 1 uop / 1c latency
; AVX-NEXT:    vpsrad %xmm1, %xmm0, %xmm0                # 2 uops / 2c latency on Intel since Haswell at least
; AVX-NEXT:    retq
```
For Skylake, variable-shifts (vpsraVd) are single uop, but count-in-xmm shifts are 2 uops. Probably they're implemented internally as broadcast to feed the SIMD variable-shift hardware.

The above is 3 uops / 3c latency on SKL.

So for AVX2 Skylake (but not Broadwell or earlier) we want this 2 uop / 2c latency implementation:
```
vpbroadcastd %xmm1, %xmm1         = xmm1[0],xmm1[1],xmm1[2],xmm1[3]   # 1 uop / 1c latency
vpsravd      %xmm1, %xmm0, %xmm0                          # 1 uop / 1c latency on SKL.   3 / 3 on BDW and earlier.
```
Same for SKX AVX512 with vpsravw and so on. There are some test cases where we use the same shift-count register multiple times, and it would be significantly better to broadcast it and use variable-shifts instead of count-from-the-low-element shifts.

But on Ryzen, and Broadwell and earlier, variable-shifts cost more. (Interestingly, on Ryzen they run on a different execution port from normal count-in-xmm shifts; still a single uop (per lane) but 3c latency and not fully pipelined. Ryzen has shift-in-xmm shifts as efficient as immediate shifts, unlike Intel where shift-in-xmm is always 2 uops (port5 + shift port).

KNL is horrible for pslld xmm,xmm (13c throughput/latency), but it has the same throughput as immediate for variable shifts like VPSRLVD z,z,z. I don't totally trust Agner's numbers for x,x shifts; maybe he only used the non-VEX encoding?

Anyway, for AVX512 we should prefer broadcast + variable-shift instead of pmovzxb/wq / regular shift, because it's better on SKX and at least as good on KNL. This includes 16-bit elements for AVX512BW, unlike AVX2.

(With AVX1, we don't have variable shifts so the earlier implementation with vpsrad is our best option.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[X86][AVX] Prefer per-element vector shifts for known splats #39424

Extended Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development


Bugzilla Link	40077
Version	trunk
OS	Windows NT
CC	@adibiagio,@topperc,@RKSimon,@rotateright

[X86][AVX] Prefer per-element vector shifts for known splats #39424

Description

Extended Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions