Skip to content

[X86] Failure to use PHADDD on Intel CPUs on the second to last step of a v8i32 pairwise reduction #39267

Closed
@topperc

Description

@topperc
Bugzilla Link 39920
Resolution FIXED
Resolved on May 09, 2019 11:14
Version trunk
OS Windows NT
Blocks #35132
CC @adibiagio,@topperc,@RKSimon,@rotateright
Fixed by commit(s) r360360

Extended Description

I think we should use HADDPS for the first reduction step of this on Intel CPUs

define fastcc i32 @​pairwise_reduction4i32(<4 x i32> %rdx, i32 %f1) {
%rdx.shuf.1.0 = shufflevector <4 x i32> %rdx, <4 x i32> undef,<4 x i32> <i32 0, i32 2, i32 undef, i32 undef>
%rdx.shuf.1.1 = shufflevector <4 x i32> %rdx, <4 x i32> undef,<4 x i32> <i32 1, i32 3, i32 undef, i32 undef>
%bin.rdx8 = add <4 x i32> %rdx.shuf.1.0, %rdx.shuf.1.1
%rdx.shuf.2.0 = shufflevector <4 x i32> %bin.rdx8, <4 x i32> undef,<4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>
%rdx.shuf.2.1 = shufflevector <4 x i32> %bin.rdx8, <4 x i32> undef,<4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
%bin.rdx9 = add <4 x i32> %rdx.shuf.2.0, %rdx.shuf.2.1

%r = extractelement <4 x i32> %bin.rdx9, i32 0
ret i32 %r
}

This is the assembly we get on sse4.1

    pshufd  $232, %xmm0, %xmm1      # xmm1 = xmm0[0,2,2,3]
    pshufd  $237, %xmm0, %xmm0      # xmm0 = xmm0[1,3,2,3]
    paddd   %xmm1, %xmm0
    pshufd  $229, %xmm0, %xmm1      # xmm1 = xmm0[1,1,2,3]
    paddd   %xmm0, %xmm1
    movd    %xmm1, %eax
    retq

PHADDD uses 2 shuffles internally on Intel CPus, but as you can see the assembly we emitted also uses 2 shuffles. So I don't think we saved anything by avoiding PHADDD.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions