Skip to content

[MCP] Move dependencies if they block copy propagation #105562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

spaits
Copy link
Contributor

@spaits spaits commented Aug 21, 2024

As we have discussed in a previous PR (#98087) here is an implementation using ScheduleDAG in the MCP.

This PR is not fully finished yet. I have not really done any precise benchmarking.

The only thing I have done is that, I have tested how much time does the generation of some regression tests take before my patch and after my path. I have not seen any increases there on my machine. But this is not a precise way of measuring.

I have not updated all the tests yet. I have only somewhat checked the RISCV, AArch64, ARM, X86 and Thumb2 tests.

Could you please take a quick look at this PR and give some feedback?
Is this direction good? Should we continue with this? (Then I will try to do some compile time benchmarking and also update the tests).

@llvmbot
Copy link
Member

llvmbot commented Aug 21, 2024

@llvm/pr-subscribers-backend-aarch64
@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-llvm-regalloc

Author: Gábor Spaits (spaits)

Changes

As we have discussed in a previous PR (#98087) here is an implementation using ScheduleDAG in the MCP.

This PR is not fully finished yet. I have not really done any precise benchmarking.

The only thing I have done is that, I have tested how much time does the generation of some regression tests take before my patch and after my path. I have not seen any increases there on my machine. But this is not a precise way of measuring.

I have not updated all the tests yet.

Could you please take a quick look at this PR and give some feedback?
Is this direction good? Should we continue with this? (Then I will try to do some compile time benchmarking and also update the tests).


Patch is 1.38 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/105562.diff

241 Files Affected:

  • (modified) llvm/lib/CodeGen/MachineCopyPropagation.cpp (+255-23)
  • (modified) llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll (+5-5)
  • (modified) llvm/test/CodeGen/AArch64/GlobalISel/arm64-pcsections.ll (+98-98)
  • (modified) llvm/test/CodeGen/AArch64/GlobalISel/merge-stores-truncating.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/aarch64-mulv.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/aarch64-wide-mul.ll (+4-6)
  • (modified) llvm/test/CodeGen/AArch64/addp-shuffle.ll (+2-4)
  • (added) llvm/test/CodeGen/AArch64/anti-dependencies-mcp.mir (+201)
  • (modified) llvm/test/CodeGen/AArch64/arm64-non-pow2-ldst.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/arm64-subvector-extend.ll (+30-72)
  • (modified) llvm/test/CodeGen/AArch64/arm64-windows-calls.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/avoid-zero-copy.mir (+3)
  • (modified) llvm/test/CodeGen/AArch64/cgp-usubo.ll (+5-10)
  • (modified) llvm/test/CodeGen/AArch64/cmpxchg-idioms.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/duplane-index-patfrags.ll (+4-8)
  • (modified) llvm/test/CodeGen/AArch64/fcmp.ll (+9-9)
  • (modified) llvm/test/CodeGen/AArch64/fexplog.ll (+180-330)
  • (modified) llvm/test/CodeGen/AArch64/fpext.ll (+14-32)
  • (modified) llvm/test/CodeGen/AArch64/fpow.ll (+20-36)
  • (modified) llvm/test/CodeGen/AArch64/fpowi.ll (+36-66)
  • (modified) llvm/test/CodeGen/AArch64/frem.ll (+20-36)
  • (modified) llvm/test/CodeGen/AArch64/fsincos.ll (+72-132)
  • (modified) llvm/test/CodeGen/AArch64/ldrpre-ldr-merge.mir (+76-76)
  • (modified) llvm/test/CodeGen/AArch64/llvm.exp10.ll (+6-12)
  • (modified) llvm/test/CodeGen/AArch64/load.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/lr-reserved-for-ra-live-in.ll (+2-2)
  • (modified) llvm/test/CodeGen/AArch64/machine-cp-sub-reg.mir (+3-3)
  • (modified) llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/named-vector-shuffles-neon.ll (+2-4)
  • (modified) llvm/test/CodeGen/AArch64/neon-extadd.ll (+18-36)
  • (modified) llvm/test/CodeGen/AArch64/neon-extmul.ll (+4-6)
  • (modified) llvm/test/CodeGen/AArch64/neon-perm.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/sext.ll (+27-54)
  • (modified) llvm/test/CodeGen/AArch64/shufflevector.ll (+5-12)
  • (modified) llvm/test/CodeGen/AArch64/spillfill-sve.mir (+6-106)
  • (modified) llvm/test/CodeGen/AArch64/streaming-compatible-memory-ops.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sve-sext-zext.ll (+9-18)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-trunc.ll (+2-2)
  • (modified) llvm/test/CodeGen/AArch64/sve-vector-deinterleave.ll (+5-10)
  • (modified) llvm/test/CodeGen/AArch64/sve-vector-interleave.ll (+2-4)
  • (modified) llvm/test/CodeGen/AArch64/vec_umulo.ll (+5-9)
  • (modified) llvm/test/CodeGen/AArch64/vecreduce-add.ll (+24-25)
  • (modified) llvm/test/CodeGen/AArch64/vselect-ext.ll (+7-10)
  • (modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+9-9)
  • (modified) llvm/test/CodeGen/AArch64/zext.ll (+27-54)
  • (modified) llvm/test/CodeGen/ARM/addsubo-legalization.ll (+2-4)
  • (modified) llvm/test/CodeGen/ARM/fpclamptosat_vec.ll (+19-24)
  • (modified) llvm/test/CodeGen/ARM/funnel-shift.ll (+2-3)
  • (modified) llvm/test/CodeGen/ARM/llvm.exp10.ll (+3-9)
  • (modified) llvm/test/CodeGen/ARM/load-combine-big-endian.ll (+3-9)
  • (modified) llvm/test/CodeGen/ARM/load-combine.ll (+2-6)
  • (modified) llvm/test/CodeGen/ARM/sub-cmp-peephole.ll (+6-14)
  • (modified) llvm/test/CodeGen/ARM/vecreduce-fadd-legalization-strict.ll (+16-18)
  • (modified) llvm/test/CodeGen/ARM/vlddup.ll (+10-20)
  • (modified) llvm/test/CodeGen/ARM/vldlane.ll (+9-22)
  • (modified) llvm/test/CodeGen/RISCV/alu64.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/branch-on-zero.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/condops.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/double-fcmp-strict.ll (+12-24)
  • (modified) llvm/test/CodeGen/RISCV/float-fcmp-strict.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/half-fcmp-strict.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/llvm.frexp.ll (+12-18)
  • (modified) llvm/test/CodeGen/RISCV/machine-cp.mir (+5-4)
  • (modified) llvm/test/CodeGen/RISCV/neg-abs.ll (+4-6)
  • (modified) llvm/test/CodeGen/RISCV/nontemporal.ll (+50-75)
  • (modified) llvm/test/CodeGen/RISCV/overflow-intrinsics.ll (+5-8)
  • (modified) llvm/test/CodeGen/RISCV/rv32zbb-zbkb.ll (+3-5)
  • (added) llvm/test/CodeGen/RISCV/rv64-legal-i32/xaluo.ll (+2603)
  • (modified) llvm/test/CodeGen/RISCV/rv64-statepoint-call-lowering.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/constant-folding-crash.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-deinterleave-load.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fmaximum-vp.ll (+17-25)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fmaximum.ll (+20-32)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fminimum-vp.ll (+17-25)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fminimum.ll (+20-32)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-store-fp.ll (+4-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-store-int.ll (+4-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll (+12-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-int-vp.ll (+12-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fmaximum-sdnode.ll (+24-34)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fmaximum-vp.ll (+22-30)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fminimum-sdnode.ll (+24-34)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fminimum-vp.ll (+22-30)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fpclamptosat_vec.ll (+3-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/mask-reg-alloc.mir (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/no-reserved-frame.ll (+3-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfeq.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfge.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfgt.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfle.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmflt.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfne.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmseq.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsge.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsgeu.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsgt.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsgtu.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsle.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsleu.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmslt.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsltu.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsne.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vselect-fp.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vsetvli-regression.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vxrm.mir (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/shifts.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/srem-vector-lkk.ll (+11-20)
  • (modified) llvm/test/CodeGen/RISCV/tail-calls.ll (+6-8)
  • (modified) llvm/test/CodeGen/RISCV/unaligned-load-store.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+13-24)
  • (modified) llvm/test/CodeGen/RISCV/wide-mem.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-by-byte-multiple-legalization.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-legalization.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/xaluo.ll (+27-54)
  • (modified) llvm/test/CodeGen/RISCV/xtheadmemidx.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/zcmp-cm-popretz.mir (+8-8)
  • (modified) llvm/test/CodeGen/Thumb/smul_fix_sat.ll (+2-4)
  • (modified) llvm/test/CodeGen/Thumb/umulo-128-legalisation-lowering.ll (+2-2)
  • (modified) llvm/test/CodeGen/Thumb2/mve-div-expand.ll (+6-11)
  • (modified) llvm/test/CodeGen/Thumb2/mve-fmath.ll (+29-66)
  • (modified) llvm/test/CodeGen/Thumb2/mve-fpclamptosat_vec.ll (+16-16)
  • (modified) llvm/test/CodeGen/Thumb2/mve-fptosi-sat-vector.ll (+19-21)
  • (modified) llvm/test/CodeGen/Thumb2/mve-fptoui-sat-vector.ll (+20-22)
  • (modified) llvm/test/CodeGen/Thumb2/mve-frint.ll (+6-18)
  • (modified) llvm/test/CodeGen/Thumb2/mve-laneinterleaving.ll (+3-3)
  • (modified) llvm/test/CodeGen/Thumb2/mve-sext-masked-load.ll (+3-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-shuffle.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-shufflemov.ll (+25-25)
  • (modified) llvm/test/CodeGen/Thumb2/mve-simple-arith.ll (+6-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vabdus.ll (+3-3)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vcvt.ll (+2-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vcvt16.ll (+2-2)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld4.ll (+1-3)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vmovn.ll (+2-2)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vst4.ll (+2-2)
  • (modified) llvm/test/CodeGen/Thumb2/mve-zext-masked-load.ll (+3-7)
  • (modified) llvm/test/CodeGen/X86/apx/mul-i1024.ll (+13-6)
  • (modified) llvm/test/CodeGen/X86/atomic-unordered.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/avx10_2_512ni-intrinsics.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/avx10_2ni-intrinsics.ll (+4-8)
  • (modified) llvm/test/CodeGen/X86/avx512-calling-conv.ll (+19-19)
  • (modified) llvm/test/CodeGen/X86/avx512-gfni-intrinsics.ll (+24-36)
  • (modified) llvm/test/CodeGen/X86/avx512-insert-extract.ll (+7-7)
  • (modified) llvm/test/CodeGen/X86/avx512-intrinsics.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/avx512-mask-op.ll (+5-5)
  • (modified) llvm/test/CodeGen/X86/avx512bw-intrinsics-upgrade.ll (+8-12)
  • (modified) llvm/test/CodeGen/X86/avx512bw-intrinsics.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/avx512bwvl-intrinsics-upgrade.ll (+4-8)
  • (modified) llvm/test/CodeGen/X86/avx512bwvl-intrinsics.ll (+10-20)
  • (modified) llvm/test/CodeGen/X86/avx512vbmi2vl-intrinsics-upgrade.ll (+28-56)
  • (modified) llvm/test/CodeGen/X86/avx512vbmi2vl-intrinsics.ll (+4-8)
  • (modified) llvm/test/CodeGen/X86/avx512vl-intrinsics-upgrade.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/div-rem-pair-recomposition-signed.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/div-rem-pair-recomposition-unsigned.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/element-wise-atomic-memory-intrinsics.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/expand-vp-cast-intrinsics.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/extract-bits.ll (+13-20)
  • (modified) llvm/test/CodeGen/X86/icmp-abs-C-vec.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/is_fpclass.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/ldexp.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/legalize-shl-vec.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/matrix-multiply.ll (+29-31)
  • (modified) llvm/test/CodeGen/X86/mul-i1024.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/mul-i256.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/mul-i512.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/pmul.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/pmulh.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/pointer-vector.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/pr11334.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/pr34177.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/pr61964.ll (+4-6)
  • (modified) llvm/test/CodeGen/X86/shift-i128.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/sibcall.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/smul_fix.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/smul_fix_sat.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/smulo-128-legalisation-lowering.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/subvectorwise-store-of-vector-splat.ll (+15-15)
  • (modified) llvm/test/CodeGen/X86/umul-with-overflow.ll (+3-4)
  • (modified) llvm/test/CodeGen/X86/umul_fix.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/umul_fix_sat.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/umulo-128-legalisation-lowering.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/vec_int_to_fp.ll (+5-10)
  • (modified) llvm/test/CodeGen/X86/vec_saddo.ll (+5-9)
  • (modified) llvm/test/CodeGen/X86/vec_ssubo.ll (+2-3)
  • (modified) llvm/test/CodeGen/X86/vec_umulo.ll (+11-18)
  • (modified) llvm/test/CodeGen/X86/vector-interleave.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-2.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-3.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-4.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-5.ll (+13-13)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-6.ll (+10-10)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-7.ll (+23-26)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-8.ll (+32-32)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-3.ll (+22-22)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-4.ll (+20-20)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-5.ll (+65-68)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-6.ll (+24-27)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-7.ll (+30-30)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-8.ll (+85-85)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i64-stride-4.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i64-stride-5.ll (+20-20)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i64-stride-6.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i64-stride-7.ll (+80-84)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i64-stride-8.ll (+232-232)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-3.ll (+3-3)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-4.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-5.ll (+5-5)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-6.ll (+11-11)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-7.ll (+41-41)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-8.ll (+48-50)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-3.ll (+32-32)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-4.ll (+12-12)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-5.ll (+15-15)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-6.ll (+24-24)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-7.ll (+39-42)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-2.ll (+12-12)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-3.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-5.ll (+11-11)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-6.ll (+56-60)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-7.ll (+57-58)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-8.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-3.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-4.ll (+24-24)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-5.ll (+5-5)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-7.ll (+140-142)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-8.ll (+48-48)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-3.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-5.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-6.ll (+19-19)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-7.ll (+17-20)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-8.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/vector-intrinsics.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vector-sext.ll (+5-10)
  • (modified) llvm/test/CodeGen/X86/vector-shuffle-combining-avx.ll (+10-11)
  • (modified) llvm/test/CodeGen/X86/vector-zext.ll (+3-3)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-legalization.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/x86-interleaved-access.ll (+5-5)
  • (modified) llvm/test/CodeGen/X86/xmulo.ll (+46-88)
diff --git a/llvm/lib/CodeGen/MachineCopyPropagation.cpp b/llvm/lib/CodeGen/MachineCopyPropagation.cpp
index b34e0939d1c7c6..493d7cd7d8c920 100644
--- a/llvm/lib/CodeGen/MachineCopyPropagation.cpp
+++ b/llvm/lib/CodeGen/MachineCopyPropagation.cpp
@@ -48,19 +48,27 @@
 //
 //===----------------------------------------------------------------------===//
 
+#include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DepthFirstIterator.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SetVector.h"
 #include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/Statistic.h"
 #include "llvm/ADT/iterator_range.h"
+#include "llvm/Analysis/AliasAnalysis.h"
+#include "llvm/CodeGen/LiveIntervals.h"
 #include "llvm/CodeGen/MachineBasicBlock.h"
 #include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include "llvm/CodeGen/MachineInstr.h"
+#include "llvm/CodeGen/MachineLoopInfo.h"
 #include "llvm/CodeGen/MachineOperand.h"
 #include "llvm/CodeGen/MachineRegisterInfo.h"
+#include "llvm/CodeGen/ScheduleDAG.h"
+#include "llvm/CodeGen/ScheduleDAGInstrs.h"
+#include "llvm/CodeGen/SelectionDAGNodes.h"
 #include "llvm/CodeGen/TargetInstrInfo.h"
 #include "llvm/CodeGen/TargetRegisterInfo.h"
 #include "llvm/CodeGen/TargetSubtargetInfo.h"
@@ -70,9 +78,15 @@
 #include "llvm/Pass.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/DebugCounter.h"
+#include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/raw_ostream.h"
+#include <algorithm>
 #include <cassert>
 #include <iterator>
+#include <optional>
+#include <queue>
+#include <utility>
+#include <variant>
 
 using namespace llvm;
 
@@ -92,6 +106,113 @@ static cl::opt<cl::boolOrDefault>
     EnableSpillageCopyElimination("enable-spill-copy-elim", cl::Hidden);
 
 namespace {
+// A ScheduleDAG subclass that is used as a dependency graph.
+class ScheduleDAGMCP : public ScheduleDAGInstrs {
+public:
+  void schedule() override {
+    llvm_unreachable("This schedule dag is only used as a dependency graph for "
+                     "Machine Copy Propagation\n");
+  }
+
+  ScheduleDAGMCP(MachineFunction &MF, const MachineLoopInfo *MLI,
+                 bool RemoveKillFlags = false)
+      : ScheduleDAGInstrs(MF, MLI, RemoveKillFlags) {
+    CanHandleTerminators = true;
+  }
+};
+
+static bool moveInstructionsOutOfTheWayIfWeCan(SUnit *Dst,
+                                               SUnit *Src,
+                                               ScheduleDAGMCP &DG) {
+  MachineInstr *DstInstr = Dst->getInstr();
+  MachineInstr *SrcInstr = Src->getInstr();
+  MachineBasicBlock *MBB = SrcInstr->getParent();
+
+  if (DstInstr == nullptr || SrcInstr == nullptr)
+    return false;
+  assert("This function only operates on a basic block level." &&
+         MBB == SrcInstr->getParent());
+
+  int SectionSize =
+      std::distance(SrcInstr->getIterator(), DstInstr->getIterator());
+
+  // The bit vector representing the instructions in the section.
+  // This vector stores which instruction needs to be moved and which does not.
+  BitVector SectionInstr(SectionSize, false);
+
+  // The queue for the breadth first search.
+  std::queue<const SUnit *> Edges;
+
+  // Process the children of a node.
+  // Basically every node are checked before it is being put into the queue.
+  // A node is enqueued if it has no dependencies on the source of the copy
+  // (only if we are not talking about the destination node which is a special
+  // case indicated by a flag) and is located between the source of the copy and
+  // the destination of the copy.
+  auto ProcessSNodeChildren = [SrcInstr, &SectionSize, &SectionInstr](
+                                  std::queue<const SUnit *> &Queue,
+                                  const SUnit *Node, bool IsRoot) -> bool {
+    for (llvm::SDep I : Node->Preds) {
+      SUnit *SU = I.getSUnit();
+      MachineInstr &MI = *(SU->getInstr());
+      if (!IsRoot && &MI == SrcInstr)
+        return false;
+
+      int DestinationFromSource =
+          std::distance(SrcInstr->getIterator(), MI.getIterator());
+
+      if (&MI != SrcInstr && DestinationFromSource > 0 &&
+          DestinationFromSource < SectionSize) {
+        // If an instruction is already in the Instructions to move map, than
+        // that means that it has already been processes with all of their
+        // dependence. We do not need to do anything with it again.
+        if (!SectionInstr[DestinationFromSource]) {
+          SectionInstr[DestinationFromSource] = true;
+          Queue.push(SU);
+        }
+      }
+    }
+    return true;
+  };
+
+  // The BFS happens here.
+  //
+  // Could not use the ADT implementation of BFS here.
+  // In ADT graph traversals we don't have the chance to select exactly which
+  // children are being put into the "nodes to traverse" queue or stack.
+  //
+  // We couldn't work around this by checking the need for the node in the
+  // processing stage. In some context it does matter what the parent of the
+  // instruction was: Namely when we are starting the traversal with the source
+  // of the copy propagation. This instruction must have the destination as a
+  // dependency. In case of other instruction than has the destination as a dependency, this
+  // dependency would mean the end of the traversal, but in this scenario this
+  // must be ignored. Let's say that we can not control what nodes to process
+  // and we come across the copy source. How do I know what node has that copy
+  // source as their dependency? We can check of which node is the copy source
+  // the dependency of. This list will alway contain the source. To decide if we
+  // have it as dependency of another instruction, we must check in the already
+  // traversed list if any of the instructions that is depended on the source is
+  // contained. This would introduce extra costs.
+  ProcessSNodeChildren(Edges, Dst, true);
+  while (!Edges.empty()) {
+    const auto *Current = Edges.front();
+    Edges.pop();
+    if (!ProcessSNodeChildren(Edges, Current, false))
+      return false;
+  }
+
+  // If all of the dependencies were deemed valid during the BFS then we
+  // are moving them before the copy source here keeping their relative
+  // order to each other.
+  auto CurrentInst = SrcInstr->getIterator();
+  for (int I = 0; I < SectionSize; I++) {
+    if (SectionInstr[I])
+      MBB->splice(SrcInstr->getIterator(), MBB, CurrentInst->getIterator());
+    ++CurrentInst;
+  }
+  return true;
+}
 
 static std::optional<DestSourcePair> isCopyInstr(const MachineInstr &MI,
                                                  const TargetInstrInfo &TII,
@@ -114,6 +235,7 @@ class CopyTracker {
   };
 
   DenseMap<MCRegUnit, CopyInfo> Copies;
+  DenseMap<MCRegUnit, CopyInfo> InvalidCopies;
 
 public:
   /// Mark all of the given registers and their subregisters as unavailable for
@@ -130,9 +252,14 @@ class CopyTracker {
     }
   }
 
+  int getInvalidCopiesSize() {
+    return InvalidCopies.size();
+  }
+
   /// Remove register from copy maps.
   void invalidateRegister(MCRegister Reg, const TargetRegisterInfo &TRI,
-                          const TargetInstrInfo &TII, bool UseCopyInstr) {
+                          const TargetInstrInfo &TII, bool UseCopyInstr,
+                          bool MayStillBePropagated = false) {
     // Since Reg might be a subreg of some registers, only invalidate Reg is not
     // enough. We have to find the COPY defines Reg or registers defined by Reg
     // and invalidate all of them. Similarly, we must invalidate all of the
@@ -158,8 +285,11 @@ class CopyTracker {
           InvalidateCopy(MI);
       }
     }
-    for (MCRegUnit Unit : RegUnitsToInvalidate)
+    for (MCRegUnit Unit : RegUnitsToInvalidate) {
+      if (Copies.contains(Unit) && MayStillBePropagated)
+        InvalidCopies[Unit] = Copies[Unit];
       Copies.erase(Unit);
+    }
   }
 
   /// Clobber a single register, removing it from the tracker's copy maps.
@@ -252,6 +382,10 @@ class CopyTracker {
     return !Copies.empty();
   }
 
+  bool hasAnyInvalidCopies() {
+    return !InvalidCopies.empty();
+  }
+
   MachineInstr *findCopyForUnit(MCRegUnit RegUnit,
                                 const TargetRegisterInfo &TRI,
                                 bool MustBeAvailable = false) {
@@ -263,6 +397,17 @@ class CopyTracker {
     return CI->second.MI;
   }
 
+  MachineInstr *findInvalidCopyForUnit(MCRegUnit RegUnit,
+                                const TargetRegisterInfo &TRI,
+                                bool MustBeAvailable = false) {
+    auto CI = InvalidCopies.find(RegUnit);
+    if (CI == InvalidCopies.end())
+      return nullptr;
+    if (MustBeAvailable && !CI->second.Avail)
+      return nullptr;
+    return CI->second.MI;
+  }
+
   MachineInstr *findCopyDefViaUnit(MCRegUnit RegUnit,
                                    const TargetRegisterInfo &TRI) {
     auto CI = Copies.find(RegUnit);
@@ -274,12 +419,28 @@ class CopyTracker {
     return findCopyForUnit(RU, TRI, true);
   }
 
+  MachineInstr *findInvalidCopyDefViaUnit(MCRegUnit RegUnit,
+                                   const TargetRegisterInfo &TRI) {
+    auto CI = InvalidCopies.find(RegUnit);
+    if (CI == InvalidCopies.end())
+      return nullptr;
+    if (CI->second.DefRegs.size() != 1)
+      return nullptr;
+    MCRegUnit RU = *TRI.regunits(CI->second.DefRegs[0]).begin();
+    return findInvalidCopyForUnit(RU, TRI, false);
+  }
+
+  // TODO: This is ugly there shall be a more elegant solution to invalid
+  //       copy searching. Create a variant that either returns a valid an invalid
+  //       copy or no copy at all (std::monotype).
   MachineInstr *findAvailBackwardCopy(MachineInstr &I, MCRegister Reg,
                                       const TargetRegisterInfo &TRI,
                                       const TargetInstrInfo &TII,
-                                      bool UseCopyInstr) {
+                                      bool UseCopyInstr,
+                                      bool SearchInvalid = false) {
     MCRegUnit RU = *TRI.regunits(Reg).begin();
-    MachineInstr *AvailCopy = findCopyDefViaUnit(RU, TRI);
+    MachineInstr *AvailCopy = SearchInvalid ? findInvalidCopyDefViaUnit(RU, TRI)
+                                            : findCopyDefViaUnit(RU, TRI);
 
     if (!AvailCopy)
       return nullptr;
@@ -377,13 +538,20 @@ class CopyTracker {
 
   void clear() {
     Copies.clear();
+    InvalidCopies.clear();
   }
 };
 
+using Copy = MachineInstr*;
+using InvalidCopy = std::pair<Copy, MachineInstr *>;
+using CopyLookupResult = std::variant<std::monostate, Copy, InvalidCopy>;
+
 class MachineCopyPropagation : public MachineFunctionPass {
+  LiveIntervals *LIS = nullptr;
   const TargetRegisterInfo *TRI = nullptr;
   const TargetInstrInfo *TII = nullptr;
   const MachineRegisterInfo *MRI = nullptr;
+  AAResults *AA = nullptr;
 
   // Return true if this is a copy instruction and false otherwise.
   bool UseCopyInstr;
@@ -398,6 +566,7 @@ class MachineCopyPropagation : public MachineFunctionPass {
 
   void getAnalysisUsage(AnalysisUsage &AU) const override {
     AU.setPreservesCFG();
+    AU.addUsedIfAvailable<LiveIntervalsWrapperPass>();
     MachineFunctionPass::getAnalysisUsage(AU);
   }
 
@@ -414,11 +583,11 @@ class MachineCopyPropagation : public MachineFunctionPass {
   void ReadRegister(MCRegister Reg, MachineInstr &Reader, DebugType DT);
   void readSuccessorLiveIns(const MachineBasicBlock &MBB);
   void ForwardCopyPropagateBlock(MachineBasicBlock &MBB);
-  void BackwardCopyPropagateBlock(MachineBasicBlock &MBB);
+  void BackwardCopyPropagateBlock(MachineBasicBlock &MBB, bool ResolveAntiDeps = false);
   void EliminateSpillageCopies(MachineBasicBlock &MBB);
   bool eraseIfRedundant(MachineInstr &Copy, MCRegister Src, MCRegister Def);
   void forwardUses(MachineInstr &MI);
-  void propagateDefs(MachineInstr &MI);
+  void propagateDefs(MachineInstr &MI, ScheduleDAGMCP &DG, bool ResolveAntiDeps = false);
   bool isForwardableRegClassCopy(const MachineInstr &Copy,
                                  const MachineInstr &UseI, unsigned UseIdx);
   bool isBackwardPropagatableRegClassCopy(const MachineInstr &Copy,
@@ -427,7 +596,7 @@ class MachineCopyPropagation : public MachineFunctionPass {
   bool hasImplicitOverlap(const MachineInstr &MI, const MachineOperand &Use);
   bool hasOverlappingMultipleDef(const MachineInstr &MI,
                                  const MachineOperand &MODef, Register Def);
-
+  
   /// Candidates for deletion.
   SmallSetVector<MachineInstr *, 8> MaybeDeadCopies;
 
@@ -986,8 +1155,10 @@ static bool isBackwardPropagatableCopy(const DestSourcePair &CopyOperands,
   return CopyOperands.Source->isRenamable() && CopyOperands.Source->isKill();
 }
 
-void MachineCopyPropagation::propagateDefs(MachineInstr &MI) {
-  if (!Tracker.hasAnyCopies())
+void MachineCopyPropagation::propagateDefs(MachineInstr &MI,
+                                           ScheduleDAGMCP &DG,
+                                           bool MoveDependenciesForBetterCopyPropagation) {
+  if (!Tracker.hasAnyCopies() && !Tracker.hasAnyInvalidCopies())
     return;
 
   for (unsigned OpIdx = 0, OpEnd = MI.getNumOperands(); OpIdx != OpEnd;
@@ -1010,8 +1181,30 @@ void MachineCopyPropagation::propagateDefs(MachineInstr &MI) {
 
     MachineInstr *Copy = Tracker.findAvailBackwardCopy(
         MI, MODef.getReg().asMCReg(), *TRI, *TII, UseCopyInstr);
-    if (!Copy)
-      continue;
+    if (!Copy) {
+      if (!MoveDependenciesForBetterCopyPropagation)
+        continue;
+
+      LLVM_DEBUG(
+          dbgs()
+          << "MCP: Couldn't find any backward copy that has no dependency.\n");
+      Copy = Tracker.findAvailBackwardCopy(MI, MODef.getReg().asMCReg(), *TRI,
+                                           *TII, UseCopyInstr, true);
+      if (!Copy) {
+        LLVM_DEBUG(
+            dbgs()
+            << "MCP: Couldn't find any backward copy that has dependency.\n");
+        continue;
+      }
+      LLVM_DEBUG(
+          dbgs()
+          << "MCP: Found potential backward copy that has dependency.\n");
+      SUnit *DstSUnit = DG.getSUnit(Copy);
+      SUnit *SrcSUnit = DG.getSUnit(&MI);
+
+      if (!moveInstructionsOutOfTheWayIfWeCan(DstSUnit, SrcSUnit, DG))
+        continue;
+    }
 
     std::optional<DestSourcePair> CopyOperands =
         isCopyInstr(*Copy, *TII, UseCopyInstr);
@@ -1033,23 +1226,35 @@ void MachineCopyPropagation::propagateDefs(MachineInstr &MI) {
     LLVM_DEBUG(dbgs() << "MCP: Replacing " << printReg(MODef.getReg(), TRI)
                       << "\n     with " << printReg(Def, TRI) << "\n     in "
                       << MI << "     from " << *Copy);
+    if (!MoveDependenciesForBetterCopyPropagation) {
+      MODef.setReg(Def);
+      MODef.setIsRenamable(CopyOperands->Destination->isRenamable());
 
-    MODef.setReg(Def);
-    MODef.setIsRenamable(CopyOperands->Destination->isRenamable());
-
-    LLVM_DEBUG(dbgs() << "MCP: After replacement: " << MI << "\n");
-    MaybeDeadCopies.insert(Copy);
-    Changed = true;
-    ++NumCopyBackwardPropagated;
+      LLVM_DEBUG(dbgs() << "MCP: After replacement: " << MI << "\n");
+      MaybeDeadCopies.insert(Copy);
+      Changed = true;
+      ++NumCopyBackwardPropagated;
+    }
   }
 }
 
 void MachineCopyPropagation::BackwardCopyPropagateBlock(
-    MachineBasicBlock &MBB) {
+    MachineBasicBlock &MBB, bool MoveDependenciesForBetterCopyPropagation) {
+  ScheduleDAGMCP DG{*(MBB.getParent()), nullptr, false};
+  if (MoveDependenciesForBetterCopyPropagation) {
+    DG.startBlock(&MBB);
+    DG.enterRegion(&MBB, MBB.begin(), MBB.end(), MBB.size());
+    DG.buildSchedGraph(nullptr);
+    // DG.viewGraph();
+  }
+ 
+
   LLVM_DEBUG(dbgs() << "MCP: BackwardCopyPropagateBlock " << MBB.getName()
                     << "\n");
 
   for (MachineInstr &MI : llvm::make_early_inc_range(llvm::reverse(MBB))) {
+    //llvm::errs() << "Next MI: ";
+    //MI.dump();
     // Ignore non-trivial COPYs.
     std::optional<DestSourcePair> CopyOperands =
         isCopyInstr(MI, *TII, UseCopyInstr);
@@ -1062,7 +1267,7 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
         // just let forward cp do COPY-to-COPY propagation.
         if (isBackwardPropagatableCopy(*CopyOperands, *MRI)) {
           Tracker.invalidateRegister(SrcReg.asMCReg(), *TRI, *TII,
-                                     UseCopyInstr);
+                                     UseCopyInstr, MoveDependenciesForBetterCopyPropagation);
           Tracker.invalidateRegister(DefReg.asMCReg(), *TRI, *TII,
                                      UseCopyInstr);
           Tracker.trackCopy(&MI, *TRI, *TII, UseCopyInstr);
@@ -1077,10 +1282,10 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
         MCRegister Reg = MO.getReg().asMCReg();
         if (!Reg)
           continue;
-        Tracker.invalidateRegister(Reg, *TRI, *TII, UseCopyInstr);
+        Tracker.invalidateRegister(Reg, *TRI, *TII, UseCopyInstr, false);
       }
 
-    propagateDefs(MI);
+    propagateDefs(MI, DG, MoveDependenciesForBetterCopyPropagation);
     for (const MachineOperand &MO : MI.operands()) {
       if (!MO.isReg())
         continue;
@@ -1104,7 +1309,7 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
           }
         } else {
           Tracker.invalidateRegister(MO.getReg().asMCReg(), *TRI, *TII,
-                                     UseCopyInstr);
+                                     UseCopyInstr, MoveDependenciesForBetterCopyPropagation);
         }
       }
     }
@@ -1122,6 +1327,15 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
     Copy->eraseFromParent();
     ++NumDeletes;
   }
+  if (MoveDependenciesForBetterCopyPropagation) {
+    DG.exitRegion();
+    DG.finishBlock();
+    // QUESTION: Does it makes sense to keep the kill flags here?
+    //           On the other parts of this pass we juts throw out
+    //           the kill flags.
+    DG.fixupKills(MBB);
+  }
+
 
   MaybeDeadCopies.clear();
   CopyDbgUsers.clear();
@@ -1472,11 +1686,29 @@ bool MachineCopyPropagation::runOnMachineFunction(MachineFunction &MF) {
   TRI = MF.getSubtarget().getRegisterInfo();
   TII = MF.getSubtarget().getInstrInfo();
   MRI = &MF.getRegInfo();
+  auto *LISWrapper = getAnalysisIfAvailable<LiveIntervalsWrapperPass>();
+  LIS = LISWrapper ? &LISWrapper->getLIS() : nullptr;
 
   for (MachineBasicBlock &MBB : MF) {
     if (isSpillageCopyElimEnabled)
       EliminateSpillageCopies(MBB);
+
+    // BackwardCopyPropagateBlock happens in two stages.
+    // First we move those unnecessary dependencies out of the way
+    // that may block copy propagations.
+    //
+    // The reason for this two stage approach is that the ScheduleDAG can not
+    // handle register renaming.
+    // QUESTION: I think these two stages could be merged together, if I were to change
+    // the renaming mechanism.
+    //
+    // The renaming wouldn't happen instantly. There would be a data structure
+    // that contained what register should be renamed to what. Then after the
+    // backward propagation has concluded the renaming would happen.
+    BackwardCopyPropagateBlock(MBB, true);
+    // Then we do the actual copy propagation.
     BackwardCopyPropagateBlock(MBB);
+
     ForwardCopyPropagateBlock(MBB);
   }
 
diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
index de3f323891a36a..92575d701f4281 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
@@ -6026,8 +6026,8 @@ define { i8, i1 } @cmpxchg_i8(ptr %ptr, i8 %desired, i8 %new) {
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w29, -16
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w19, -24
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w20, -32
-; CHECK-OUTLINE-O1-NEXT:    mov x3, x0
 ; CHECK-OUTLINE-O1-NEXT:    mov w19, w1
+; CHECK-OUTLINE-O1-NEXT:    mov x3, x0
 ; CHECK-OUTLINE-O1-NEXT:    mov w1, w2
 ; CHECK-OUTLINE-O1-NEXT:    mov w0, w19
 ; CHECK-OUTLINE-O1-NEXT:    mov x2, x3
@@ -6133,8 +6133,8 @@ define { i16, i1 } @cmpxchg_i16(ptr %ptr, i16 %desired, i16 %new) {
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w29, -16
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w19, -24
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w20, -32
-; CHECK-OUTLINE-O1-NEXT:    mov x3, x0
 ; CHECK-OUTLI...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Aug 21, 2024

@llvm/pr-subscribers-backend-x86

Author: Gábor Spaits (spaits)

Changes

As we have discussed in a previous PR (#98087) here is an implementation using ScheduleDAG in the MCP.

This PR is not fully finished yet. I have not really done any precise benchmarking.

The only thing I have done is that, I have tested how much time does the generation of some regression tests take before my patch and after my path. I have not seen any increases there on my machine. But this is not a precise way of measuring.

I have not updated all the tests yet.

Could you please take a quick look at this PR and give some feedback?
Is this direction good? Should we continue with this? (Then I will try to do some compile time benchmarking and also update the tests).


Patch is 1.38 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/105562.diff

241 Files Affected:

  • (modified) llvm/lib/CodeGen/MachineCopyPropagation.cpp (+255-23)
  • (modified) llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll (+5-5)
  • (modified) llvm/test/CodeGen/AArch64/GlobalISel/arm64-pcsections.ll (+98-98)
  • (modified) llvm/test/CodeGen/AArch64/GlobalISel/merge-stores-truncating.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/aarch64-mulv.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/aarch64-wide-mul.ll (+4-6)
  • (modified) llvm/test/CodeGen/AArch64/addp-shuffle.ll (+2-4)
  • (added) llvm/test/CodeGen/AArch64/anti-dependencies-mcp.mir (+201)
  • (modified) llvm/test/CodeGen/AArch64/arm64-non-pow2-ldst.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/arm64-subvector-extend.ll (+30-72)
  • (modified) llvm/test/CodeGen/AArch64/arm64-windows-calls.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/avoid-zero-copy.mir (+3)
  • (modified) llvm/test/CodeGen/AArch64/cgp-usubo.ll (+5-10)
  • (modified) llvm/test/CodeGen/AArch64/cmpxchg-idioms.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/duplane-index-patfrags.ll (+4-8)
  • (modified) llvm/test/CodeGen/AArch64/fcmp.ll (+9-9)
  • (modified) llvm/test/CodeGen/AArch64/fexplog.ll (+180-330)
  • (modified) llvm/test/CodeGen/AArch64/fpext.ll (+14-32)
  • (modified) llvm/test/CodeGen/AArch64/fpow.ll (+20-36)
  • (modified) llvm/test/CodeGen/AArch64/fpowi.ll (+36-66)
  • (modified) llvm/test/CodeGen/AArch64/frem.ll (+20-36)
  • (modified) llvm/test/CodeGen/AArch64/fsincos.ll (+72-132)
  • (modified) llvm/test/CodeGen/AArch64/ldrpre-ldr-merge.mir (+76-76)
  • (modified) llvm/test/CodeGen/AArch64/llvm.exp10.ll (+6-12)
  • (modified) llvm/test/CodeGen/AArch64/load.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/lr-reserved-for-ra-live-in.ll (+2-2)
  • (modified) llvm/test/CodeGen/AArch64/machine-cp-sub-reg.mir (+3-3)
  • (modified) llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/named-vector-shuffles-neon.ll (+2-4)
  • (modified) llvm/test/CodeGen/AArch64/neon-extadd.ll (+18-36)
  • (modified) llvm/test/CodeGen/AArch64/neon-extmul.ll (+4-6)
  • (modified) llvm/test/CodeGen/AArch64/neon-perm.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/sext.ll (+27-54)
  • (modified) llvm/test/CodeGen/AArch64/shufflevector.ll (+5-12)
  • (modified) llvm/test/CodeGen/AArch64/spillfill-sve.mir (+6-106)
  • (modified) llvm/test/CodeGen/AArch64/streaming-compatible-memory-ops.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sve-sext-zext.ll (+9-18)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-trunc.ll (+2-2)
  • (modified) llvm/test/CodeGen/AArch64/sve-vector-deinterleave.ll (+5-10)
  • (modified) llvm/test/CodeGen/AArch64/sve-vector-interleave.ll (+2-4)
  • (modified) llvm/test/CodeGen/AArch64/vec_umulo.ll (+5-9)
  • (modified) llvm/test/CodeGen/AArch64/vecreduce-add.ll (+24-25)
  • (modified) llvm/test/CodeGen/AArch64/vselect-ext.ll (+7-10)
  • (modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+9-9)
  • (modified) llvm/test/CodeGen/AArch64/zext.ll (+27-54)
  • (modified) llvm/test/CodeGen/ARM/addsubo-legalization.ll (+2-4)
  • (modified) llvm/test/CodeGen/ARM/fpclamptosat_vec.ll (+19-24)
  • (modified) llvm/test/CodeGen/ARM/funnel-shift.ll (+2-3)
  • (modified) llvm/test/CodeGen/ARM/llvm.exp10.ll (+3-9)
  • (modified) llvm/test/CodeGen/ARM/load-combine-big-endian.ll (+3-9)
  • (modified) llvm/test/CodeGen/ARM/load-combine.ll (+2-6)
  • (modified) llvm/test/CodeGen/ARM/sub-cmp-peephole.ll (+6-14)
  • (modified) llvm/test/CodeGen/ARM/vecreduce-fadd-legalization-strict.ll (+16-18)
  • (modified) llvm/test/CodeGen/ARM/vlddup.ll (+10-20)
  • (modified) llvm/test/CodeGen/ARM/vldlane.ll (+9-22)
  • (modified) llvm/test/CodeGen/RISCV/alu64.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/branch-on-zero.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/condops.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/double-fcmp-strict.ll (+12-24)
  • (modified) llvm/test/CodeGen/RISCV/float-fcmp-strict.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/half-fcmp-strict.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/llvm.frexp.ll (+12-18)
  • (modified) llvm/test/CodeGen/RISCV/machine-cp.mir (+5-4)
  • (modified) llvm/test/CodeGen/RISCV/neg-abs.ll (+4-6)
  • (modified) llvm/test/CodeGen/RISCV/nontemporal.ll (+50-75)
  • (modified) llvm/test/CodeGen/RISCV/overflow-intrinsics.ll (+5-8)
  • (modified) llvm/test/CodeGen/RISCV/rv32zbb-zbkb.ll (+3-5)
  • (added) llvm/test/CodeGen/RISCV/rv64-legal-i32/xaluo.ll (+2603)
  • (modified) llvm/test/CodeGen/RISCV/rv64-statepoint-call-lowering.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/constant-folding-crash.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-deinterleave-load.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fmaximum-vp.ll (+17-25)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fmaximum.ll (+20-32)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fminimum-vp.ll (+17-25)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fminimum.ll (+20-32)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-store-fp.ll (+4-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-store-int.ll (+4-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll (+12-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-int-vp.ll (+12-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fmaximum-sdnode.ll (+24-34)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fmaximum-vp.ll (+22-30)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fminimum-sdnode.ll (+24-34)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fminimum-vp.ll (+22-30)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fpclamptosat_vec.ll (+3-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/mask-reg-alloc.mir (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/no-reserved-frame.ll (+3-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfeq.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfge.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfgt.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfle.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmflt.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfne.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmseq.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsge.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsgeu.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsgt.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsgtu.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsle.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsleu.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmslt.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsltu.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsne.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vselect-fp.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vsetvli-regression.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vxrm.mir (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/shifts.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/srem-vector-lkk.ll (+11-20)
  • (modified) llvm/test/CodeGen/RISCV/tail-calls.ll (+6-8)
  • (modified) llvm/test/CodeGen/RISCV/unaligned-load-store.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+13-24)
  • (modified) llvm/test/CodeGen/RISCV/wide-mem.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-by-byte-multiple-legalization.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-legalization.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/xaluo.ll (+27-54)
  • (modified) llvm/test/CodeGen/RISCV/xtheadmemidx.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/zcmp-cm-popretz.mir (+8-8)
  • (modified) llvm/test/CodeGen/Thumb/smul_fix_sat.ll (+2-4)
  • (modified) llvm/test/CodeGen/Thumb/umulo-128-legalisation-lowering.ll (+2-2)
  • (modified) llvm/test/CodeGen/Thumb2/mve-div-expand.ll (+6-11)
  • (modified) llvm/test/CodeGen/Thumb2/mve-fmath.ll (+29-66)
  • (modified) llvm/test/CodeGen/Thumb2/mve-fpclamptosat_vec.ll (+16-16)
  • (modified) llvm/test/CodeGen/Thumb2/mve-fptosi-sat-vector.ll (+19-21)
  • (modified) llvm/test/CodeGen/Thumb2/mve-fptoui-sat-vector.ll (+20-22)
  • (modified) llvm/test/CodeGen/Thumb2/mve-frint.ll (+6-18)
  • (modified) llvm/test/CodeGen/Thumb2/mve-laneinterleaving.ll (+3-3)
  • (modified) llvm/test/CodeGen/Thumb2/mve-sext-masked-load.ll (+3-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-shuffle.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-shufflemov.ll (+25-25)
  • (modified) llvm/test/CodeGen/Thumb2/mve-simple-arith.ll (+6-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vabdus.ll (+3-3)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vcvt.ll (+2-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vcvt16.ll (+2-2)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld4.ll (+1-3)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vmovn.ll (+2-2)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vst4.ll (+2-2)
  • (modified) llvm/test/CodeGen/Thumb2/mve-zext-masked-load.ll (+3-7)
  • (modified) llvm/test/CodeGen/X86/apx/mul-i1024.ll (+13-6)
  • (modified) llvm/test/CodeGen/X86/atomic-unordered.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/avx10_2_512ni-intrinsics.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/avx10_2ni-intrinsics.ll (+4-8)
  • (modified) llvm/test/CodeGen/X86/avx512-calling-conv.ll (+19-19)
  • (modified) llvm/test/CodeGen/X86/avx512-gfni-intrinsics.ll (+24-36)
  • (modified) llvm/test/CodeGen/X86/avx512-insert-extract.ll (+7-7)
  • (modified) llvm/test/CodeGen/X86/avx512-intrinsics.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/avx512-mask-op.ll (+5-5)
  • (modified) llvm/test/CodeGen/X86/avx512bw-intrinsics-upgrade.ll (+8-12)
  • (modified) llvm/test/CodeGen/X86/avx512bw-intrinsics.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/avx512bwvl-intrinsics-upgrade.ll (+4-8)
  • (modified) llvm/test/CodeGen/X86/avx512bwvl-intrinsics.ll (+10-20)
  • (modified) llvm/test/CodeGen/X86/avx512vbmi2vl-intrinsics-upgrade.ll (+28-56)
  • (modified) llvm/test/CodeGen/X86/avx512vbmi2vl-intrinsics.ll (+4-8)
  • (modified) llvm/test/CodeGen/X86/avx512vl-intrinsics-upgrade.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/div-rem-pair-recomposition-signed.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/div-rem-pair-recomposition-unsigned.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/element-wise-atomic-memory-intrinsics.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/expand-vp-cast-intrinsics.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/extract-bits.ll (+13-20)
  • (modified) llvm/test/CodeGen/X86/icmp-abs-C-vec.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/is_fpclass.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/ldexp.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/legalize-shl-vec.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/matrix-multiply.ll (+29-31)
  • (modified) llvm/test/CodeGen/X86/mul-i1024.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/mul-i256.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/mul-i512.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/pmul.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/pmulh.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/pointer-vector.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/pr11334.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/pr34177.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/pr61964.ll (+4-6)
  • (modified) llvm/test/CodeGen/X86/shift-i128.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/sibcall.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/smul_fix.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/smul_fix_sat.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/smulo-128-legalisation-lowering.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/subvectorwise-store-of-vector-splat.ll (+15-15)
  • (modified) llvm/test/CodeGen/X86/umul-with-overflow.ll (+3-4)
  • (modified) llvm/test/CodeGen/X86/umul_fix.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/umul_fix_sat.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/umulo-128-legalisation-lowering.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/vec_int_to_fp.ll (+5-10)
  • (modified) llvm/test/CodeGen/X86/vec_saddo.ll (+5-9)
  • (modified) llvm/test/CodeGen/X86/vec_ssubo.ll (+2-3)
  • (modified) llvm/test/CodeGen/X86/vec_umulo.ll (+11-18)
  • (modified) llvm/test/CodeGen/X86/vector-interleave.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-2.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-3.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-4.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-5.ll (+13-13)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-6.ll (+10-10)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-7.ll (+23-26)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-8.ll (+32-32)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-3.ll (+22-22)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-4.ll (+20-20)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-5.ll (+65-68)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-6.ll (+24-27)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-7.ll (+30-30)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-8.ll (+85-85)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i64-stride-4.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i64-stride-5.ll (+20-20)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i64-stride-6.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i64-stride-7.ll (+80-84)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i64-stride-8.ll (+232-232)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-3.ll (+3-3)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-4.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-5.ll (+5-5)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-6.ll (+11-11)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-7.ll (+41-41)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-8.ll (+48-50)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-3.ll (+32-32)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-4.ll (+12-12)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-5.ll (+15-15)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-6.ll (+24-24)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-7.ll (+39-42)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-2.ll (+12-12)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-3.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-5.ll (+11-11)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-6.ll (+56-60)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-7.ll (+57-58)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-8.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-3.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-4.ll (+24-24)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-5.ll (+5-5)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-7.ll (+140-142)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-8.ll (+48-48)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-3.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-5.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-6.ll (+19-19)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-7.ll (+17-20)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-8.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/vector-intrinsics.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vector-sext.ll (+5-10)
  • (modified) llvm/test/CodeGen/X86/vector-shuffle-combining-avx.ll (+10-11)
  • (modified) llvm/test/CodeGen/X86/vector-zext.ll (+3-3)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-legalization.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/x86-interleaved-access.ll (+5-5)
  • (modified) llvm/test/CodeGen/X86/xmulo.ll (+46-88)
diff --git a/llvm/lib/CodeGen/MachineCopyPropagation.cpp b/llvm/lib/CodeGen/MachineCopyPropagation.cpp
index b34e0939d1c7c6..493d7cd7d8c920 100644
--- a/llvm/lib/CodeGen/MachineCopyPropagation.cpp
+++ b/llvm/lib/CodeGen/MachineCopyPropagation.cpp
@@ -48,19 +48,27 @@
 //
 //===----------------------------------------------------------------------===//
 
+#include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DepthFirstIterator.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SetVector.h"
 #include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/Statistic.h"
 #include "llvm/ADT/iterator_range.h"
+#include "llvm/Analysis/AliasAnalysis.h"
+#include "llvm/CodeGen/LiveIntervals.h"
 #include "llvm/CodeGen/MachineBasicBlock.h"
 #include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include "llvm/CodeGen/MachineInstr.h"
+#include "llvm/CodeGen/MachineLoopInfo.h"
 #include "llvm/CodeGen/MachineOperand.h"
 #include "llvm/CodeGen/MachineRegisterInfo.h"
+#include "llvm/CodeGen/ScheduleDAG.h"
+#include "llvm/CodeGen/ScheduleDAGInstrs.h"
+#include "llvm/CodeGen/SelectionDAGNodes.h"
 #include "llvm/CodeGen/TargetInstrInfo.h"
 #include "llvm/CodeGen/TargetRegisterInfo.h"
 #include "llvm/CodeGen/TargetSubtargetInfo.h"
@@ -70,9 +78,15 @@
 #include "llvm/Pass.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/DebugCounter.h"
+#include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/raw_ostream.h"
+#include <algorithm>
 #include <cassert>
 #include <iterator>
+#include <optional>
+#include <queue>
+#include <utility>
+#include <variant>
 
 using namespace llvm;
 
@@ -92,6 +106,113 @@ static cl::opt<cl::boolOrDefault>
     EnableSpillageCopyElimination("enable-spill-copy-elim", cl::Hidden);
 
 namespace {
+// A ScheduleDAG subclass that is used as a dependency graph.
+class ScheduleDAGMCP : public ScheduleDAGInstrs {
+public:
+  void schedule() override {
+    llvm_unreachable("This schedule dag is only used as a dependency graph for "
+                     "Machine Copy Propagation\n");
+  }
+
+  ScheduleDAGMCP(MachineFunction &MF, const MachineLoopInfo *MLI,
+                 bool RemoveKillFlags = false)
+      : ScheduleDAGInstrs(MF, MLI, RemoveKillFlags) {
+    CanHandleTerminators = true;
+  }
+};
+
+static bool moveInstructionsOutOfTheWayIfWeCan(SUnit *Dst,
+                                               SUnit *Src,
+                                               ScheduleDAGMCP &DG) {
+  MachineInstr *DstInstr = Dst->getInstr();
+  MachineInstr *SrcInstr = Src->getInstr();
+  MachineBasicBlock *MBB = SrcInstr->getParent();
+
+  if (DstInstr == nullptr || SrcInstr == nullptr)
+    return false;
+  assert("This function only operates on a basic block level." &&
+         MBB == SrcInstr->getParent());
+
+  int SectionSize =
+      std::distance(SrcInstr->getIterator(), DstInstr->getIterator());
+
+  // The bit vector representing the instructions in the section.
+  // This vector stores which instruction needs to be moved and which does not.
+  BitVector SectionInstr(SectionSize, false);
+
+  // The queue for the breadth first search.
+  std::queue<const SUnit *> Edges;
+
+  // Process the children of a node.
+  // Basically every node are checked before it is being put into the queue.
+  // A node is enqueued if it has no dependencies on the source of the copy
+  // (only if we are not talking about the destination node which is a special
+  // case indicated by a flag) and is located between the source of the copy and
+  // the destination of the copy.
+  auto ProcessSNodeChildren = [SrcInstr, &SectionSize, &SectionInstr](
+                                  std::queue<const SUnit *> &Queue,
+                                  const SUnit *Node, bool IsRoot) -> bool {
+    for (llvm::SDep I : Node->Preds) {
+      SUnit *SU = I.getSUnit();
+      MachineInstr &MI = *(SU->getInstr());
+      if (!IsRoot && &MI == SrcInstr)
+        return false;
+
+      int DestinationFromSource =
+          std::distance(SrcInstr->getIterator(), MI.getIterator());
+
+      if (&MI != SrcInstr && DestinationFromSource > 0 &&
+          DestinationFromSource < SectionSize) {
+        // If an instruction is already in the Instructions to move map, than
+        // that means that it has already been processes with all of their
+        // dependence. We do not need to do anything with it again.
+        if (!SectionInstr[DestinationFromSource]) {
+          SectionInstr[DestinationFromSource] = true;
+          Queue.push(SU);
+        }
+      }
+    }
+    return true;
+  };
+
+  // The BFS happens here.
+  //
+  // Could not use the ADT implementation of BFS here.
+  // In ADT graph traversals we don't have the chance to select exactly which
+  // children are being put into the "nodes to traverse" queue or stack.
+  //
+  // We couldn't work around this by checking the need for the node in the
+  // processing stage. In some context it does matter what the parent of the
+  // instruction was: Namely when we are starting the traversal with the source
+  // of the copy propagation. This instruction must have the destination as a
+  // dependency. In case of other instruction than has the destination as a dependency, this
+  // dependency would mean the end of the traversal, but in this scenario this
+  // must be ignored. Let's say that we can not control what nodes to process
+  // and we come across the copy source. How do I know what node has that copy
+  // source as their dependency? We can check of which node is the copy source
+  // the dependency of. This list will alway contain the source. To decide if we
+  // have it as dependency of another instruction, we must check in the already
+  // traversed list if any of the instructions that is depended on the source is
+  // contained. This would introduce extra costs.
+  ProcessSNodeChildren(Edges, Dst, true);
+  while (!Edges.empty()) {
+    const auto *Current = Edges.front();
+    Edges.pop();
+    if (!ProcessSNodeChildren(Edges, Current, false))
+      return false;
+  }
+
+  // If all of the dependencies were deemed valid during the BFS then we
+  // are moving them before the copy source here keeping their relative
+  // order to each other.
+  auto CurrentInst = SrcInstr->getIterator();
+  for (int I = 0; I < SectionSize; I++) {
+    if (SectionInstr[I])
+      MBB->splice(SrcInstr->getIterator(), MBB, CurrentInst->getIterator());
+    ++CurrentInst;
+  }
+  return true;
+}
 
 static std::optional<DestSourcePair> isCopyInstr(const MachineInstr &MI,
                                                  const TargetInstrInfo &TII,
@@ -114,6 +235,7 @@ class CopyTracker {
   };
 
   DenseMap<MCRegUnit, CopyInfo> Copies;
+  DenseMap<MCRegUnit, CopyInfo> InvalidCopies;
 
 public:
   /// Mark all of the given registers and their subregisters as unavailable for
@@ -130,9 +252,14 @@ class CopyTracker {
     }
   }
 
+  int getInvalidCopiesSize() {
+    return InvalidCopies.size();
+  }
+
   /// Remove register from copy maps.
   void invalidateRegister(MCRegister Reg, const TargetRegisterInfo &TRI,
-                          const TargetInstrInfo &TII, bool UseCopyInstr) {
+                          const TargetInstrInfo &TII, bool UseCopyInstr,
+                          bool MayStillBePropagated = false) {
     // Since Reg might be a subreg of some registers, only invalidate Reg is not
     // enough. We have to find the COPY defines Reg or registers defined by Reg
     // and invalidate all of them. Similarly, we must invalidate all of the
@@ -158,8 +285,11 @@ class CopyTracker {
           InvalidateCopy(MI);
       }
     }
-    for (MCRegUnit Unit : RegUnitsToInvalidate)
+    for (MCRegUnit Unit : RegUnitsToInvalidate) {
+      if (Copies.contains(Unit) && MayStillBePropagated)
+        InvalidCopies[Unit] = Copies[Unit];
       Copies.erase(Unit);
+    }
   }
 
   /// Clobber a single register, removing it from the tracker's copy maps.
@@ -252,6 +382,10 @@ class CopyTracker {
     return !Copies.empty();
   }
 
+  bool hasAnyInvalidCopies() {
+    return !InvalidCopies.empty();
+  }
+
   MachineInstr *findCopyForUnit(MCRegUnit RegUnit,
                                 const TargetRegisterInfo &TRI,
                                 bool MustBeAvailable = false) {
@@ -263,6 +397,17 @@ class CopyTracker {
     return CI->second.MI;
   }
 
+  MachineInstr *findInvalidCopyForUnit(MCRegUnit RegUnit,
+                                const TargetRegisterInfo &TRI,
+                                bool MustBeAvailable = false) {
+    auto CI = InvalidCopies.find(RegUnit);
+    if (CI == InvalidCopies.end())
+      return nullptr;
+    if (MustBeAvailable && !CI->second.Avail)
+      return nullptr;
+    return CI->second.MI;
+  }
+
   MachineInstr *findCopyDefViaUnit(MCRegUnit RegUnit,
                                    const TargetRegisterInfo &TRI) {
     auto CI = Copies.find(RegUnit);
@@ -274,12 +419,28 @@ class CopyTracker {
     return findCopyForUnit(RU, TRI, true);
   }
 
+  MachineInstr *findInvalidCopyDefViaUnit(MCRegUnit RegUnit,
+                                   const TargetRegisterInfo &TRI) {
+    auto CI = InvalidCopies.find(RegUnit);
+    if (CI == InvalidCopies.end())
+      return nullptr;
+    if (CI->second.DefRegs.size() != 1)
+      return nullptr;
+    MCRegUnit RU = *TRI.regunits(CI->second.DefRegs[0]).begin();
+    return findInvalidCopyForUnit(RU, TRI, false);
+  }
+
+  // TODO: This is ugly there shall be a more elegant solution to invalid
+  //       copy searching. Create a variant that either returns a valid an invalid
+  //       copy or no copy at all (std::monotype).
   MachineInstr *findAvailBackwardCopy(MachineInstr &I, MCRegister Reg,
                                       const TargetRegisterInfo &TRI,
                                       const TargetInstrInfo &TII,
-                                      bool UseCopyInstr) {
+                                      bool UseCopyInstr,
+                                      bool SearchInvalid = false) {
     MCRegUnit RU = *TRI.regunits(Reg).begin();
-    MachineInstr *AvailCopy = findCopyDefViaUnit(RU, TRI);
+    MachineInstr *AvailCopy = SearchInvalid ? findInvalidCopyDefViaUnit(RU, TRI)
+                                            : findCopyDefViaUnit(RU, TRI);
 
     if (!AvailCopy)
       return nullptr;
@@ -377,13 +538,20 @@ class CopyTracker {
 
   void clear() {
     Copies.clear();
+    InvalidCopies.clear();
   }
 };
 
+using Copy = MachineInstr*;
+using InvalidCopy = std::pair<Copy, MachineInstr *>;
+using CopyLookupResult = std::variant<std::monostate, Copy, InvalidCopy>;
+
 class MachineCopyPropagation : public MachineFunctionPass {
+  LiveIntervals *LIS = nullptr;
   const TargetRegisterInfo *TRI = nullptr;
   const TargetInstrInfo *TII = nullptr;
   const MachineRegisterInfo *MRI = nullptr;
+  AAResults *AA = nullptr;
 
   // Return true if this is a copy instruction and false otherwise.
   bool UseCopyInstr;
@@ -398,6 +566,7 @@ class MachineCopyPropagation : public MachineFunctionPass {
 
   void getAnalysisUsage(AnalysisUsage &AU) const override {
     AU.setPreservesCFG();
+    AU.addUsedIfAvailable<LiveIntervalsWrapperPass>();
     MachineFunctionPass::getAnalysisUsage(AU);
   }
 
@@ -414,11 +583,11 @@ class MachineCopyPropagation : public MachineFunctionPass {
   void ReadRegister(MCRegister Reg, MachineInstr &Reader, DebugType DT);
   void readSuccessorLiveIns(const MachineBasicBlock &MBB);
   void ForwardCopyPropagateBlock(MachineBasicBlock &MBB);
-  void BackwardCopyPropagateBlock(MachineBasicBlock &MBB);
+  void BackwardCopyPropagateBlock(MachineBasicBlock &MBB, bool ResolveAntiDeps = false);
   void EliminateSpillageCopies(MachineBasicBlock &MBB);
   bool eraseIfRedundant(MachineInstr &Copy, MCRegister Src, MCRegister Def);
   void forwardUses(MachineInstr &MI);
-  void propagateDefs(MachineInstr &MI);
+  void propagateDefs(MachineInstr &MI, ScheduleDAGMCP &DG, bool ResolveAntiDeps = false);
   bool isForwardableRegClassCopy(const MachineInstr &Copy,
                                  const MachineInstr &UseI, unsigned UseIdx);
   bool isBackwardPropagatableRegClassCopy(const MachineInstr &Copy,
@@ -427,7 +596,7 @@ class MachineCopyPropagation : public MachineFunctionPass {
   bool hasImplicitOverlap(const MachineInstr &MI, const MachineOperand &Use);
   bool hasOverlappingMultipleDef(const MachineInstr &MI,
                                  const MachineOperand &MODef, Register Def);
-
+  
   /// Candidates for deletion.
   SmallSetVector<MachineInstr *, 8> MaybeDeadCopies;
 
@@ -986,8 +1155,10 @@ static bool isBackwardPropagatableCopy(const DestSourcePair &CopyOperands,
   return CopyOperands.Source->isRenamable() && CopyOperands.Source->isKill();
 }
 
-void MachineCopyPropagation::propagateDefs(MachineInstr &MI) {
-  if (!Tracker.hasAnyCopies())
+void MachineCopyPropagation::propagateDefs(MachineInstr &MI,
+                                           ScheduleDAGMCP &DG,
+                                           bool MoveDependenciesForBetterCopyPropagation) {
+  if (!Tracker.hasAnyCopies() && !Tracker.hasAnyInvalidCopies())
     return;
 
   for (unsigned OpIdx = 0, OpEnd = MI.getNumOperands(); OpIdx != OpEnd;
@@ -1010,8 +1181,30 @@ void MachineCopyPropagation::propagateDefs(MachineInstr &MI) {
 
     MachineInstr *Copy = Tracker.findAvailBackwardCopy(
         MI, MODef.getReg().asMCReg(), *TRI, *TII, UseCopyInstr);
-    if (!Copy)
-      continue;
+    if (!Copy) {
+      if (!MoveDependenciesForBetterCopyPropagation)
+        continue;
+
+      LLVM_DEBUG(
+          dbgs()
+          << "MCP: Couldn't find any backward copy that has no dependency.\n");
+      Copy = Tracker.findAvailBackwardCopy(MI, MODef.getReg().asMCReg(), *TRI,
+                                           *TII, UseCopyInstr, true);
+      if (!Copy) {
+        LLVM_DEBUG(
+            dbgs()
+            << "MCP: Couldn't find any backward copy that has dependency.\n");
+        continue;
+      }
+      LLVM_DEBUG(
+          dbgs()
+          << "MCP: Found potential backward copy that has dependency.\n");
+      SUnit *DstSUnit = DG.getSUnit(Copy);
+      SUnit *SrcSUnit = DG.getSUnit(&MI);
+
+      if (!moveInstructionsOutOfTheWayIfWeCan(DstSUnit, SrcSUnit, DG))
+        continue;
+    }
 
     std::optional<DestSourcePair> CopyOperands =
         isCopyInstr(*Copy, *TII, UseCopyInstr);
@@ -1033,23 +1226,35 @@ void MachineCopyPropagation::propagateDefs(MachineInstr &MI) {
     LLVM_DEBUG(dbgs() << "MCP: Replacing " << printReg(MODef.getReg(), TRI)
                       << "\n     with " << printReg(Def, TRI) << "\n     in "
                       << MI << "     from " << *Copy);
+    if (!MoveDependenciesForBetterCopyPropagation) {
+      MODef.setReg(Def);
+      MODef.setIsRenamable(CopyOperands->Destination->isRenamable());
 
-    MODef.setReg(Def);
-    MODef.setIsRenamable(CopyOperands->Destination->isRenamable());
-
-    LLVM_DEBUG(dbgs() << "MCP: After replacement: " << MI << "\n");
-    MaybeDeadCopies.insert(Copy);
-    Changed = true;
-    ++NumCopyBackwardPropagated;
+      LLVM_DEBUG(dbgs() << "MCP: After replacement: " << MI << "\n");
+      MaybeDeadCopies.insert(Copy);
+      Changed = true;
+      ++NumCopyBackwardPropagated;
+    }
   }
 }
 
 void MachineCopyPropagation::BackwardCopyPropagateBlock(
-    MachineBasicBlock &MBB) {
+    MachineBasicBlock &MBB, bool MoveDependenciesForBetterCopyPropagation) {
+  ScheduleDAGMCP DG{*(MBB.getParent()), nullptr, false};
+  if (MoveDependenciesForBetterCopyPropagation) {
+    DG.startBlock(&MBB);
+    DG.enterRegion(&MBB, MBB.begin(), MBB.end(), MBB.size());
+    DG.buildSchedGraph(nullptr);
+    // DG.viewGraph();
+  }
+ 
+
   LLVM_DEBUG(dbgs() << "MCP: BackwardCopyPropagateBlock " << MBB.getName()
                     << "\n");
 
   for (MachineInstr &MI : llvm::make_early_inc_range(llvm::reverse(MBB))) {
+    //llvm::errs() << "Next MI: ";
+    //MI.dump();
     // Ignore non-trivial COPYs.
     std::optional<DestSourcePair> CopyOperands =
         isCopyInstr(MI, *TII, UseCopyInstr);
@@ -1062,7 +1267,7 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
         // just let forward cp do COPY-to-COPY propagation.
         if (isBackwardPropagatableCopy(*CopyOperands, *MRI)) {
           Tracker.invalidateRegister(SrcReg.asMCReg(), *TRI, *TII,
-                                     UseCopyInstr);
+                                     UseCopyInstr, MoveDependenciesForBetterCopyPropagation);
           Tracker.invalidateRegister(DefReg.asMCReg(), *TRI, *TII,
                                      UseCopyInstr);
           Tracker.trackCopy(&MI, *TRI, *TII, UseCopyInstr);
@@ -1077,10 +1282,10 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
         MCRegister Reg = MO.getReg().asMCReg();
         if (!Reg)
           continue;
-        Tracker.invalidateRegister(Reg, *TRI, *TII, UseCopyInstr);
+        Tracker.invalidateRegister(Reg, *TRI, *TII, UseCopyInstr, false);
       }
 
-    propagateDefs(MI);
+    propagateDefs(MI, DG, MoveDependenciesForBetterCopyPropagation);
     for (const MachineOperand &MO : MI.operands()) {
       if (!MO.isReg())
         continue;
@@ -1104,7 +1309,7 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
           }
         } else {
           Tracker.invalidateRegister(MO.getReg().asMCReg(), *TRI, *TII,
-                                     UseCopyInstr);
+                                     UseCopyInstr, MoveDependenciesForBetterCopyPropagation);
         }
       }
     }
@@ -1122,6 +1327,15 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
     Copy->eraseFromParent();
     ++NumDeletes;
   }
+  if (MoveDependenciesForBetterCopyPropagation) {
+    DG.exitRegion();
+    DG.finishBlock();
+    // QUESTION: Does it makes sense to keep the kill flags here?
+    //           On the other parts of this pass we juts throw out
+    //           the kill flags.
+    DG.fixupKills(MBB);
+  }
+
 
   MaybeDeadCopies.clear();
   CopyDbgUsers.clear();
@@ -1472,11 +1686,29 @@ bool MachineCopyPropagation::runOnMachineFunction(MachineFunction &MF) {
   TRI = MF.getSubtarget().getRegisterInfo();
   TII = MF.getSubtarget().getInstrInfo();
   MRI = &MF.getRegInfo();
+  auto *LISWrapper = getAnalysisIfAvailable<LiveIntervalsWrapperPass>();
+  LIS = LISWrapper ? &LISWrapper->getLIS() : nullptr;
 
   for (MachineBasicBlock &MBB : MF) {
     if (isSpillageCopyElimEnabled)
       EliminateSpillageCopies(MBB);
+
+    // BackwardCopyPropagateBlock happens in two stages.
+    // First we move those unnecessary dependencies out of the way
+    // that may block copy propagations.
+    //
+    // The reason for this two stage approach is that the ScheduleDAG can not
+    // handle register renaming.
+    // QUESTION: I think these two stages could be merged together, if I were to change
+    // the renaming mechanism.
+    //
+    // The renaming wouldn't happen instantly. There would be a data structure
+    // that contained what register should be renamed to what. Then after the
+    // backward propagation has concluded the renaming would happen.
+    BackwardCopyPropagateBlock(MBB, true);
+    // Then we do the actual copy propagation.
     BackwardCopyPropagateBlock(MBB);
+
     ForwardCopyPropagateBlock(MBB);
   }
 
diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
index de3f323891a36a..92575d701f4281 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
@@ -6026,8 +6026,8 @@ define { i8, i1 } @cmpxchg_i8(ptr %ptr, i8 %desired, i8 %new) {
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w29, -16
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w19, -24
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w20, -32
-; CHECK-OUTLINE-O1-NEXT:    mov x3, x0
 ; CHECK-OUTLINE-O1-NEXT:    mov w19, w1
+; CHECK-OUTLINE-O1-NEXT:    mov x3, x0
 ; CHECK-OUTLINE-O1-NEXT:    mov w1, w2
 ; CHECK-OUTLINE-O1-NEXT:    mov w0, w19
 ; CHECK-OUTLINE-O1-NEXT:    mov x2, x3
@@ -6133,8 +6133,8 @@ define { i16, i1 } @cmpxchg_i16(ptr %ptr, i16 %desired, i16 %new) {
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w29, -16
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w19, -24
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w20, -32
-; CHECK-OUTLINE-O1-NEXT:    mov x3, x0
 ; CHECK-OUTLI...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Aug 21, 2024

@llvm/pr-subscribers-backend-arm

Author: Gábor Spaits (spaits)

Changes

As we have discussed in a previous PR (#98087) here is an implementation using ScheduleDAG in the MCP.

This PR is not fully finished yet. I have not really done any precise benchmarking.

The only thing I have done is that, I have tested how much time does the generation of some regression tests take before my patch and after my path. I have not seen any increases there on my machine. But this is not a precise way of measuring.

I have not updated all the tests yet.

Could you please take a quick look at this PR and give some feedback?
Is this direction good? Should we continue with this? (Then I will try to do some compile time benchmarking and also update the tests).


Patch is 1.38 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/105562.diff

241 Files Affected:

  • (modified) llvm/lib/CodeGen/MachineCopyPropagation.cpp (+255-23)
  • (modified) llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll (+5-5)
  • (modified) llvm/test/CodeGen/AArch64/GlobalISel/arm64-pcsections.ll (+98-98)
  • (modified) llvm/test/CodeGen/AArch64/GlobalISel/merge-stores-truncating.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/aarch64-mulv.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/aarch64-wide-mul.ll (+4-6)
  • (modified) llvm/test/CodeGen/AArch64/addp-shuffle.ll (+2-4)
  • (added) llvm/test/CodeGen/AArch64/anti-dependencies-mcp.mir (+201)
  • (modified) llvm/test/CodeGen/AArch64/arm64-non-pow2-ldst.ll (+4-4)
  • (modified) llvm/test/CodeGen/AArch64/arm64-subvector-extend.ll (+30-72)
  • (modified) llvm/test/CodeGen/AArch64/arm64-windows-calls.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/avoid-zero-copy.mir (+3)
  • (modified) llvm/test/CodeGen/AArch64/cgp-usubo.ll (+5-10)
  • (modified) llvm/test/CodeGen/AArch64/cmpxchg-idioms.ll (+6-6)
  • (modified) llvm/test/CodeGen/AArch64/duplane-index-patfrags.ll (+4-8)
  • (modified) llvm/test/CodeGen/AArch64/fcmp.ll (+9-9)
  • (modified) llvm/test/CodeGen/AArch64/fexplog.ll (+180-330)
  • (modified) llvm/test/CodeGen/AArch64/fpext.ll (+14-32)
  • (modified) llvm/test/CodeGen/AArch64/fpow.ll (+20-36)
  • (modified) llvm/test/CodeGen/AArch64/fpowi.ll (+36-66)
  • (modified) llvm/test/CodeGen/AArch64/frem.ll (+20-36)
  • (modified) llvm/test/CodeGen/AArch64/fsincos.ll (+72-132)
  • (modified) llvm/test/CodeGen/AArch64/ldrpre-ldr-merge.mir (+76-76)
  • (modified) llvm/test/CodeGen/AArch64/llvm.exp10.ll (+6-12)
  • (modified) llvm/test/CodeGen/AArch64/load.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/lr-reserved-for-ra-live-in.ll (+2-2)
  • (modified) llvm/test/CodeGen/AArch64/machine-cp-sub-reg.mir (+3-3)
  • (modified) llvm/test/CodeGen/AArch64/machine-sink-kill-flags.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/named-vector-shuffles-neon.ll (+2-4)
  • (modified) llvm/test/CodeGen/AArch64/neon-extadd.ll (+18-36)
  • (modified) llvm/test/CodeGen/AArch64/neon-extmul.ll (+4-6)
  • (modified) llvm/test/CodeGen/AArch64/neon-perm.ll (+1-2)
  • (modified) llvm/test/CodeGen/AArch64/sext.ll (+27-54)
  • (modified) llvm/test/CodeGen/AArch64/shufflevector.ll (+5-12)
  • (modified) llvm/test/CodeGen/AArch64/spillfill-sve.mir (+6-106)
  • (modified) llvm/test/CodeGen/AArch64/streaming-compatible-memory-ops.ll (+2-3)
  • (modified) llvm/test/CodeGen/AArch64/sve-sext-zext.ll (+9-18)
  • (modified) llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-trunc.ll (+2-2)
  • (modified) llvm/test/CodeGen/AArch64/sve-vector-deinterleave.ll (+5-10)
  • (modified) llvm/test/CodeGen/AArch64/sve-vector-interleave.ll (+2-4)
  • (modified) llvm/test/CodeGen/AArch64/vec_umulo.ll (+5-9)
  • (modified) llvm/test/CodeGen/AArch64/vecreduce-add.ll (+24-25)
  • (modified) llvm/test/CodeGen/AArch64/vselect-ext.ll (+7-10)
  • (modified) llvm/test/CodeGen/AArch64/zext-to-tbl.ll (+9-9)
  • (modified) llvm/test/CodeGen/AArch64/zext.ll (+27-54)
  • (modified) llvm/test/CodeGen/ARM/addsubo-legalization.ll (+2-4)
  • (modified) llvm/test/CodeGen/ARM/fpclamptosat_vec.ll (+19-24)
  • (modified) llvm/test/CodeGen/ARM/funnel-shift.ll (+2-3)
  • (modified) llvm/test/CodeGen/ARM/llvm.exp10.ll (+3-9)
  • (modified) llvm/test/CodeGen/ARM/load-combine-big-endian.ll (+3-9)
  • (modified) llvm/test/CodeGen/ARM/load-combine.ll (+2-6)
  • (modified) llvm/test/CodeGen/ARM/sub-cmp-peephole.ll (+6-14)
  • (modified) llvm/test/CodeGen/ARM/vecreduce-fadd-legalization-strict.ll (+16-18)
  • (modified) llvm/test/CodeGen/ARM/vlddup.ll (+10-20)
  • (modified) llvm/test/CodeGen/ARM/vldlane.ll (+9-22)
  • (modified) llvm/test/CodeGen/RISCV/alu64.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/branch-on-zero.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/condops.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/double-fcmp-strict.ll (+12-24)
  • (modified) llvm/test/CodeGen/RISCV/float-fcmp-strict.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/half-fcmp-strict.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/llvm.frexp.ll (+12-18)
  • (modified) llvm/test/CodeGen/RISCV/machine-cp.mir (+5-4)
  • (modified) llvm/test/CodeGen/RISCV/neg-abs.ll (+4-6)
  • (modified) llvm/test/CodeGen/RISCV/nontemporal.ll (+50-75)
  • (modified) llvm/test/CodeGen/RISCV/overflow-intrinsics.ll (+5-8)
  • (modified) llvm/test/CodeGen/RISCV/rv32zbb-zbkb.ll (+3-5)
  • (added) llvm/test/CodeGen/RISCV/rv64-legal-i32/xaluo.ll (+2603)
  • (modified) llvm/test/CodeGen/RISCV/rv64-statepoint-call-lowering.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/constant-folding-crash.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-deinterleave-load.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fmaximum-vp.ll (+17-25)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fmaximum.ll (+20-32)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fminimum-vp.ll (+17-25)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fminimum.ll (+20-32)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-store-fp.ll (+4-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-store-int.ll (+4-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll (+12-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-int-vp.ll (+12-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fmaximum-sdnode.ll (+24-34)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fmaximum-vp.ll (+22-30)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fminimum-sdnode.ll (+24-34)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fminimum-vp.ll (+22-30)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fpclamptosat_vec.ll (+3-6)
  • (modified) llvm/test/CodeGen/RISCV/rvv/mask-reg-alloc.mir (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/no-reserved-frame.ll (+3-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfeq.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfge.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfgt.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfle.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmflt.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmfne.ll (+6-12)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmseq.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsge.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsgeu.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsgt.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsgtu.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsle.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsleu.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmslt.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsltu.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vmsne.ll (+10-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vselect-fp.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vsetvli-regression.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vxrm.mir (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/shifts.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/srem-vector-lkk.ll (+11-20)
  • (modified) llvm/test/CodeGen/RISCV/tail-calls.ll (+6-8)
  • (modified) llvm/test/CodeGen/RISCV/unaligned-load-store.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/urem-vector-lkk.ll (+13-24)
  • (modified) llvm/test/CodeGen/RISCV/wide-mem.ll (+1-2)
  • (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-by-byte-multiple-legalization.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/wide-scalar-shift-legalization.ll (+2-4)
  • (modified) llvm/test/CodeGen/RISCV/xaluo.ll (+27-54)
  • (modified) llvm/test/CodeGen/RISCV/xtheadmemidx.ll (+1-1)
  • (modified) llvm/test/CodeGen/RISCV/zcmp-cm-popretz.mir (+8-8)
  • (modified) llvm/test/CodeGen/Thumb/smul_fix_sat.ll (+2-4)
  • (modified) llvm/test/CodeGen/Thumb/umulo-128-legalisation-lowering.ll (+2-2)
  • (modified) llvm/test/CodeGen/Thumb2/mve-div-expand.ll (+6-11)
  • (modified) llvm/test/CodeGen/Thumb2/mve-fmath.ll (+29-66)
  • (modified) llvm/test/CodeGen/Thumb2/mve-fpclamptosat_vec.ll (+16-16)
  • (modified) llvm/test/CodeGen/Thumb2/mve-fptosi-sat-vector.ll (+19-21)
  • (modified) llvm/test/CodeGen/Thumb2/mve-fptoui-sat-vector.ll (+20-22)
  • (modified) llvm/test/CodeGen/Thumb2/mve-frint.ll (+6-18)
  • (modified) llvm/test/CodeGen/Thumb2/mve-laneinterleaving.ll (+3-3)
  • (modified) llvm/test/CodeGen/Thumb2/mve-sext-masked-load.ll (+3-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-shuffle.ll (+12-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-shufflemov.ll (+25-25)
  • (modified) llvm/test/CodeGen/Thumb2/mve-simple-arith.ll (+6-12)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vabdus.ll (+3-3)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vcvt.ll (+2-6)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vcvt16.ll (+2-2)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vld4.ll (+1-3)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vmovn.ll (+2-2)
  • (modified) llvm/test/CodeGen/Thumb2/mve-vst4.ll (+2-2)
  • (modified) llvm/test/CodeGen/Thumb2/mve-zext-masked-load.ll (+3-7)
  • (modified) llvm/test/CodeGen/X86/apx/mul-i1024.ll (+13-6)
  • (modified) llvm/test/CodeGen/X86/atomic-unordered.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/avx10_2_512ni-intrinsics.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/avx10_2ni-intrinsics.ll (+4-8)
  • (modified) llvm/test/CodeGen/X86/avx512-calling-conv.ll (+19-19)
  • (modified) llvm/test/CodeGen/X86/avx512-gfni-intrinsics.ll (+24-36)
  • (modified) llvm/test/CodeGen/X86/avx512-insert-extract.ll (+7-7)
  • (modified) llvm/test/CodeGen/X86/avx512-intrinsics.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/avx512-mask-op.ll (+5-5)
  • (modified) llvm/test/CodeGen/X86/avx512bw-intrinsics-upgrade.ll (+8-12)
  • (modified) llvm/test/CodeGen/X86/avx512bw-intrinsics.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/avx512bwvl-intrinsics-upgrade.ll (+4-8)
  • (modified) llvm/test/CodeGen/X86/avx512bwvl-intrinsics.ll (+10-20)
  • (modified) llvm/test/CodeGen/X86/avx512vbmi2vl-intrinsics-upgrade.ll (+28-56)
  • (modified) llvm/test/CodeGen/X86/avx512vbmi2vl-intrinsics.ll (+4-8)
  • (modified) llvm/test/CodeGen/X86/avx512vl-intrinsics-upgrade.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/div-rem-pair-recomposition-signed.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/div-rem-pair-recomposition-unsigned.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/element-wise-atomic-memory-intrinsics.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/expand-vp-cast-intrinsics.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/extract-bits.ll (+13-20)
  • (modified) llvm/test/CodeGen/X86/icmp-abs-C-vec.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/is_fpclass.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/ldexp.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/legalize-shl-vec.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/matrix-multiply.ll (+29-31)
  • (modified) llvm/test/CodeGen/X86/mul-i1024.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/mul-i256.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/mul-i512.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/pmul.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/pmulh.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/pointer-vector.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/pr11334.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/pr34177.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/pr61964.ll (+4-6)
  • (modified) llvm/test/CodeGen/X86/shift-i128.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/sibcall.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/smul_fix.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/smul_fix_sat.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/smulo-128-legalisation-lowering.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/subvectorwise-store-of-vector-splat.ll (+15-15)
  • (modified) llvm/test/CodeGen/X86/umul-with-overflow.ll (+3-4)
  • (modified) llvm/test/CodeGen/X86/umul_fix.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/umul_fix_sat.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/umulo-128-legalisation-lowering.ll (+1-2)
  • (modified) llvm/test/CodeGen/X86/vec_int_to_fp.ll (+5-10)
  • (modified) llvm/test/CodeGen/X86/vec_saddo.ll (+5-9)
  • (modified) llvm/test/CodeGen/X86/vec_ssubo.ll (+2-3)
  • (modified) llvm/test/CodeGen/X86/vec_umulo.ll (+11-18)
  • (modified) llvm/test/CodeGen/X86/vector-interleave.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-2.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-3.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-4.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-5.ll (+13-13)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-6.ll (+10-10)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-7.ll (+23-26)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-8.ll (+32-32)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-3.ll (+22-22)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-4.ll (+20-20)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-5.ll (+65-68)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-6.ll (+24-27)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-7.ll (+30-30)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i32-stride-8.ll (+85-85)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i64-stride-4.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i64-stride-5.ll (+20-20)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i64-stride-6.ll (+8-8)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i64-stride-7.ll (+80-84)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i64-stride-8.ll (+232-232)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-3.ll (+3-3)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-4.ll (+6-6)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-5.ll (+5-5)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-6.ll (+11-11)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-7.ll (+41-41)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-load-i8-stride-8.ll (+48-50)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-3.ll (+32-32)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-4.ll (+12-12)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-5.ll (+15-15)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-6.ll (+24-24)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-7.ll (+39-42)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-2.ll (+12-12)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-3.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-5.ll (+11-11)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-6.ll (+56-60)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-7.ll (+57-58)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-8.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-3.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-4.ll (+24-24)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-5.ll (+5-5)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-7.ll (+140-142)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i64-stride-8.ll (+48-48)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-3.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-5.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-6.ll (+19-19)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-7.ll (+17-20)
  • (modified) llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-8.ll (+4-4)
  • (modified) llvm/test/CodeGen/X86/vector-intrinsics.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/vector-sext.ll (+5-10)
  • (modified) llvm/test/CodeGen/X86/vector-shuffle-combining-avx.ll (+10-11)
  • (modified) llvm/test/CodeGen/X86/vector-zext.ll (+3-3)
  • (modified) llvm/test/CodeGen/X86/wide-scalar-shift-legalization.ll (+1-1)
  • (modified) llvm/test/CodeGen/X86/x86-interleaved-access.ll (+5-5)
  • (modified) llvm/test/CodeGen/X86/xmulo.ll (+46-88)
diff --git a/llvm/lib/CodeGen/MachineCopyPropagation.cpp b/llvm/lib/CodeGen/MachineCopyPropagation.cpp
index b34e0939d1c7c6..493d7cd7d8c920 100644
--- a/llvm/lib/CodeGen/MachineCopyPropagation.cpp
+++ b/llvm/lib/CodeGen/MachineCopyPropagation.cpp
@@ -48,19 +48,27 @@
 //
 //===----------------------------------------------------------------------===//
 
+#include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DepthFirstIterator.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SetVector.h"
 #include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/Statistic.h"
 #include "llvm/ADT/iterator_range.h"
+#include "llvm/Analysis/AliasAnalysis.h"
+#include "llvm/CodeGen/LiveIntervals.h"
 #include "llvm/CodeGen/MachineBasicBlock.h"
 #include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include "llvm/CodeGen/MachineInstr.h"
+#include "llvm/CodeGen/MachineLoopInfo.h"
 #include "llvm/CodeGen/MachineOperand.h"
 #include "llvm/CodeGen/MachineRegisterInfo.h"
+#include "llvm/CodeGen/ScheduleDAG.h"
+#include "llvm/CodeGen/ScheduleDAGInstrs.h"
+#include "llvm/CodeGen/SelectionDAGNodes.h"
 #include "llvm/CodeGen/TargetInstrInfo.h"
 #include "llvm/CodeGen/TargetRegisterInfo.h"
 #include "llvm/CodeGen/TargetSubtargetInfo.h"
@@ -70,9 +78,15 @@
 #include "llvm/Pass.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/DebugCounter.h"
+#include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/raw_ostream.h"
+#include <algorithm>
 #include <cassert>
 #include <iterator>
+#include <optional>
+#include <queue>
+#include <utility>
+#include <variant>
 
 using namespace llvm;
 
@@ -92,6 +106,113 @@ static cl::opt<cl::boolOrDefault>
     EnableSpillageCopyElimination("enable-spill-copy-elim", cl::Hidden);
 
 namespace {
+// A ScheduleDAG subclass that is used as a dependency graph.
+class ScheduleDAGMCP : public ScheduleDAGInstrs {
+public:
+  void schedule() override {
+    llvm_unreachable("This schedule dag is only used as a dependency graph for "
+                     "Machine Copy Propagation\n");
+  }
+
+  ScheduleDAGMCP(MachineFunction &MF, const MachineLoopInfo *MLI,
+                 bool RemoveKillFlags = false)
+      : ScheduleDAGInstrs(MF, MLI, RemoveKillFlags) {
+    CanHandleTerminators = true;
+  }
+};
+
+static bool moveInstructionsOutOfTheWayIfWeCan(SUnit *Dst,
+                                               SUnit *Src,
+                                               ScheduleDAGMCP &DG) {
+  MachineInstr *DstInstr = Dst->getInstr();
+  MachineInstr *SrcInstr = Src->getInstr();
+  MachineBasicBlock *MBB = SrcInstr->getParent();
+
+  if (DstInstr == nullptr || SrcInstr == nullptr)
+    return false;
+  assert("This function only operates on a basic block level." &&
+         MBB == SrcInstr->getParent());
+
+  int SectionSize =
+      std::distance(SrcInstr->getIterator(), DstInstr->getIterator());
+
+  // The bit vector representing the instructions in the section.
+  // This vector stores which instruction needs to be moved and which does not.
+  BitVector SectionInstr(SectionSize, false);
+
+  // The queue for the breadth first search.
+  std::queue<const SUnit *> Edges;
+
+  // Process the children of a node.
+  // Basically every node are checked before it is being put into the queue.
+  // A node is enqueued if it has no dependencies on the source of the copy
+  // (only if we are not talking about the destination node which is a special
+  // case indicated by a flag) and is located between the source of the copy and
+  // the destination of the copy.
+  auto ProcessSNodeChildren = [SrcInstr, &SectionSize, &SectionInstr](
+                                  std::queue<const SUnit *> &Queue,
+                                  const SUnit *Node, bool IsRoot) -> bool {
+    for (llvm::SDep I : Node->Preds) {
+      SUnit *SU = I.getSUnit();
+      MachineInstr &MI = *(SU->getInstr());
+      if (!IsRoot && &MI == SrcInstr)
+        return false;
+
+      int DestinationFromSource =
+          std::distance(SrcInstr->getIterator(), MI.getIterator());
+
+      if (&MI != SrcInstr && DestinationFromSource > 0 &&
+          DestinationFromSource < SectionSize) {
+        // If an instruction is already in the Instructions to move map, than
+        // that means that it has already been processes with all of their
+        // dependence. We do not need to do anything with it again.
+        if (!SectionInstr[DestinationFromSource]) {
+          SectionInstr[DestinationFromSource] = true;
+          Queue.push(SU);
+        }
+      }
+    }
+    return true;
+  };
+
+  // The BFS happens here.
+  //
+  // Could not use the ADT implementation of BFS here.
+  // In ADT graph traversals we don't have the chance to select exactly which
+  // children are being put into the "nodes to traverse" queue or stack.
+  //
+  // We couldn't work around this by checking the need for the node in the
+  // processing stage. In some context it does matter what the parent of the
+  // instruction was: Namely when we are starting the traversal with the source
+  // of the copy propagation. This instruction must have the destination as a
+  // dependency. In case of other instruction than has the destination as a dependency, this
+  // dependency would mean the end of the traversal, but in this scenario this
+  // must be ignored. Let's say that we can not control what nodes to process
+  // and we come across the copy source. How do I know what node has that copy
+  // source as their dependency? We can check of which node is the copy source
+  // the dependency of. This list will alway contain the source. To decide if we
+  // have it as dependency of another instruction, we must check in the already
+  // traversed list if any of the instructions that is depended on the source is
+  // contained. This would introduce extra costs.
+  ProcessSNodeChildren(Edges, Dst, true);
+  while (!Edges.empty()) {
+    const auto *Current = Edges.front();
+    Edges.pop();
+    if (!ProcessSNodeChildren(Edges, Current, false))
+      return false;
+  }
+
+  // If all of the dependencies were deemed valid during the BFS then we
+  // are moving them before the copy source here keeping their relative
+  // order to each other.
+  auto CurrentInst = SrcInstr->getIterator();
+  for (int I = 0; I < SectionSize; I++) {
+    if (SectionInstr[I])
+      MBB->splice(SrcInstr->getIterator(), MBB, CurrentInst->getIterator());
+    ++CurrentInst;
+  }
+  return true;
+}
 
 static std::optional<DestSourcePair> isCopyInstr(const MachineInstr &MI,
                                                  const TargetInstrInfo &TII,
@@ -114,6 +235,7 @@ class CopyTracker {
   };
 
   DenseMap<MCRegUnit, CopyInfo> Copies;
+  DenseMap<MCRegUnit, CopyInfo> InvalidCopies;
 
 public:
   /// Mark all of the given registers and their subregisters as unavailable for
@@ -130,9 +252,14 @@ class CopyTracker {
     }
   }
 
+  int getInvalidCopiesSize() {
+    return InvalidCopies.size();
+  }
+
   /// Remove register from copy maps.
   void invalidateRegister(MCRegister Reg, const TargetRegisterInfo &TRI,
-                          const TargetInstrInfo &TII, bool UseCopyInstr) {
+                          const TargetInstrInfo &TII, bool UseCopyInstr,
+                          bool MayStillBePropagated = false) {
     // Since Reg might be a subreg of some registers, only invalidate Reg is not
     // enough. We have to find the COPY defines Reg or registers defined by Reg
     // and invalidate all of them. Similarly, we must invalidate all of the
@@ -158,8 +285,11 @@ class CopyTracker {
           InvalidateCopy(MI);
       }
     }
-    for (MCRegUnit Unit : RegUnitsToInvalidate)
+    for (MCRegUnit Unit : RegUnitsToInvalidate) {
+      if (Copies.contains(Unit) && MayStillBePropagated)
+        InvalidCopies[Unit] = Copies[Unit];
       Copies.erase(Unit);
+    }
   }
 
   /// Clobber a single register, removing it from the tracker's copy maps.
@@ -252,6 +382,10 @@ class CopyTracker {
     return !Copies.empty();
   }
 
+  bool hasAnyInvalidCopies() {
+    return !InvalidCopies.empty();
+  }
+
   MachineInstr *findCopyForUnit(MCRegUnit RegUnit,
                                 const TargetRegisterInfo &TRI,
                                 bool MustBeAvailable = false) {
@@ -263,6 +397,17 @@ class CopyTracker {
     return CI->second.MI;
   }
 
+  MachineInstr *findInvalidCopyForUnit(MCRegUnit RegUnit,
+                                const TargetRegisterInfo &TRI,
+                                bool MustBeAvailable = false) {
+    auto CI = InvalidCopies.find(RegUnit);
+    if (CI == InvalidCopies.end())
+      return nullptr;
+    if (MustBeAvailable && !CI->second.Avail)
+      return nullptr;
+    return CI->second.MI;
+  }
+
   MachineInstr *findCopyDefViaUnit(MCRegUnit RegUnit,
                                    const TargetRegisterInfo &TRI) {
     auto CI = Copies.find(RegUnit);
@@ -274,12 +419,28 @@ class CopyTracker {
     return findCopyForUnit(RU, TRI, true);
   }
 
+  MachineInstr *findInvalidCopyDefViaUnit(MCRegUnit RegUnit,
+                                   const TargetRegisterInfo &TRI) {
+    auto CI = InvalidCopies.find(RegUnit);
+    if (CI == InvalidCopies.end())
+      return nullptr;
+    if (CI->second.DefRegs.size() != 1)
+      return nullptr;
+    MCRegUnit RU = *TRI.regunits(CI->second.DefRegs[0]).begin();
+    return findInvalidCopyForUnit(RU, TRI, false);
+  }
+
+  // TODO: This is ugly there shall be a more elegant solution to invalid
+  //       copy searching. Create a variant that either returns a valid an invalid
+  //       copy or no copy at all (std::monotype).
   MachineInstr *findAvailBackwardCopy(MachineInstr &I, MCRegister Reg,
                                       const TargetRegisterInfo &TRI,
                                       const TargetInstrInfo &TII,
-                                      bool UseCopyInstr) {
+                                      bool UseCopyInstr,
+                                      bool SearchInvalid = false) {
     MCRegUnit RU = *TRI.regunits(Reg).begin();
-    MachineInstr *AvailCopy = findCopyDefViaUnit(RU, TRI);
+    MachineInstr *AvailCopy = SearchInvalid ? findInvalidCopyDefViaUnit(RU, TRI)
+                                            : findCopyDefViaUnit(RU, TRI);
 
     if (!AvailCopy)
       return nullptr;
@@ -377,13 +538,20 @@ class CopyTracker {
 
   void clear() {
     Copies.clear();
+    InvalidCopies.clear();
   }
 };
 
+using Copy = MachineInstr*;
+using InvalidCopy = std::pair<Copy, MachineInstr *>;
+using CopyLookupResult = std::variant<std::monostate, Copy, InvalidCopy>;
+
 class MachineCopyPropagation : public MachineFunctionPass {
+  LiveIntervals *LIS = nullptr;
   const TargetRegisterInfo *TRI = nullptr;
   const TargetInstrInfo *TII = nullptr;
   const MachineRegisterInfo *MRI = nullptr;
+  AAResults *AA = nullptr;
 
   // Return true if this is a copy instruction and false otherwise.
   bool UseCopyInstr;
@@ -398,6 +566,7 @@ class MachineCopyPropagation : public MachineFunctionPass {
 
   void getAnalysisUsage(AnalysisUsage &AU) const override {
     AU.setPreservesCFG();
+    AU.addUsedIfAvailable<LiveIntervalsWrapperPass>();
     MachineFunctionPass::getAnalysisUsage(AU);
   }
 
@@ -414,11 +583,11 @@ class MachineCopyPropagation : public MachineFunctionPass {
   void ReadRegister(MCRegister Reg, MachineInstr &Reader, DebugType DT);
   void readSuccessorLiveIns(const MachineBasicBlock &MBB);
   void ForwardCopyPropagateBlock(MachineBasicBlock &MBB);
-  void BackwardCopyPropagateBlock(MachineBasicBlock &MBB);
+  void BackwardCopyPropagateBlock(MachineBasicBlock &MBB, bool ResolveAntiDeps = false);
   void EliminateSpillageCopies(MachineBasicBlock &MBB);
   bool eraseIfRedundant(MachineInstr &Copy, MCRegister Src, MCRegister Def);
   void forwardUses(MachineInstr &MI);
-  void propagateDefs(MachineInstr &MI);
+  void propagateDefs(MachineInstr &MI, ScheduleDAGMCP &DG, bool ResolveAntiDeps = false);
   bool isForwardableRegClassCopy(const MachineInstr &Copy,
                                  const MachineInstr &UseI, unsigned UseIdx);
   bool isBackwardPropagatableRegClassCopy(const MachineInstr &Copy,
@@ -427,7 +596,7 @@ class MachineCopyPropagation : public MachineFunctionPass {
   bool hasImplicitOverlap(const MachineInstr &MI, const MachineOperand &Use);
   bool hasOverlappingMultipleDef(const MachineInstr &MI,
                                  const MachineOperand &MODef, Register Def);
-
+  
   /// Candidates for deletion.
   SmallSetVector<MachineInstr *, 8> MaybeDeadCopies;
 
@@ -986,8 +1155,10 @@ static bool isBackwardPropagatableCopy(const DestSourcePair &CopyOperands,
   return CopyOperands.Source->isRenamable() && CopyOperands.Source->isKill();
 }
 
-void MachineCopyPropagation::propagateDefs(MachineInstr &MI) {
-  if (!Tracker.hasAnyCopies())
+void MachineCopyPropagation::propagateDefs(MachineInstr &MI,
+                                           ScheduleDAGMCP &DG,
+                                           bool MoveDependenciesForBetterCopyPropagation) {
+  if (!Tracker.hasAnyCopies() && !Tracker.hasAnyInvalidCopies())
     return;
 
   for (unsigned OpIdx = 0, OpEnd = MI.getNumOperands(); OpIdx != OpEnd;
@@ -1010,8 +1181,30 @@ void MachineCopyPropagation::propagateDefs(MachineInstr &MI) {
 
     MachineInstr *Copy = Tracker.findAvailBackwardCopy(
         MI, MODef.getReg().asMCReg(), *TRI, *TII, UseCopyInstr);
-    if (!Copy)
-      continue;
+    if (!Copy) {
+      if (!MoveDependenciesForBetterCopyPropagation)
+        continue;
+
+      LLVM_DEBUG(
+          dbgs()
+          << "MCP: Couldn't find any backward copy that has no dependency.\n");
+      Copy = Tracker.findAvailBackwardCopy(MI, MODef.getReg().asMCReg(), *TRI,
+                                           *TII, UseCopyInstr, true);
+      if (!Copy) {
+        LLVM_DEBUG(
+            dbgs()
+            << "MCP: Couldn't find any backward copy that has dependency.\n");
+        continue;
+      }
+      LLVM_DEBUG(
+          dbgs()
+          << "MCP: Found potential backward copy that has dependency.\n");
+      SUnit *DstSUnit = DG.getSUnit(Copy);
+      SUnit *SrcSUnit = DG.getSUnit(&MI);
+
+      if (!moveInstructionsOutOfTheWayIfWeCan(DstSUnit, SrcSUnit, DG))
+        continue;
+    }
 
     std::optional<DestSourcePair> CopyOperands =
         isCopyInstr(*Copy, *TII, UseCopyInstr);
@@ -1033,23 +1226,35 @@ void MachineCopyPropagation::propagateDefs(MachineInstr &MI) {
     LLVM_DEBUG(dbgs() << "MCP: Replacing " << printReg(MODef.getReg(), TRI)
                       << "\n     with " << printReg(Def, TRI) << "\n     in "
                       << MI << "     from " << *Copy);
+    if (!MoveDependenciesForBetterCopyPropagation) {
+      MODef.setReg(Def);
+      MODef.setIsRenamable(CopyOperands->Destination->isRenamable());
 
-    MODef.setReg(Def);
-    MODef.setIsRenamable(CopyOperands->Destination->isRenamable());
-
-    LLVM_DEBUG(dbgs() << "MCP: After replacement: " << MI << "\n");
-    MaybeDeadCopies.insert(Copy);
-    Changed = true;
-    ++NumCopyBackwardPropagated;
+      LLVM_DEBUG(dbgs() << "MCP: After replacement: " << MI << "\n");
+      MaybeDeadCopies.insert(Copy);
+      Changed = true;
+      ++NumCopyBackwardPropagated;
+    }
   }
 }
 
 void MachineCopyPropagation::BackwardCopyPropagateBlock(
-    MachineBasicBlock &MBB) {
+    MachineBasicBlock &MBB, bool MoveDependenciesForBetterCopyPropagation) {
+  ScheduleDAGMCP DG{*(MBB.getParent()), nullptr, false};
+  if (MoveDependenciesForBetterCopyPropagation) {
+    DG.startBlock(&MBB);
+    DG.enterRegion(&MBB, MBB.begin(), MBB.end(), MBB.size());
+    DG.buildSchedGraph(nullptr);
+    // DG.viewGraph();
+  }
+ 
+
   LLVM_DEBUG(dbgs() << "MCP: BackwardCopyPropagateBlock " << MBB.getName()
                     << "\n");
 
   for (MachineInstr &MI : llvm::make_early_inc_range(llvm::reverse(MBB))) {
+    //llvm::errs() << "Next MI: ";
+    //MI.dump();
     // Ignore non-trivial COPYs.
     std::optional<DestSourcePair> CopyOperands =
         isCopyInstr(MI, *TII, UseCopyInstr);
@@ -1062,7 +1267,7 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
         // just let forward cp do COPY-to-COPY propagation.
         if (isBackwardPropagatableCopy(*CopyOperands, *MRI)) {
           Tracker.invalidateRegister(SrcReg.asMCReg(), *TRI, *TII,
-                                     UseCopyInstr);
+                                     UseCopyInstr, MoveDependenciesForBetterCopyPropagation);
           Tracker.invalidateRegister(DefReg.asMCReg(), *TRI, *TII,
                                      UseCopyInstr);
           Tracker.trackCopy(&MI, *TRI, *TII, UseCopyInstr);
@@ -1077,10 +1282,10 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
         MCRegister Reg = MO.getReg().asMCReg();
         if (!Reg)
           continue;
-        Tracker.invalidateRegister(Reg, *TRI, *TII, UseCopyInstr);
+        Tracker.invalidateRegister(Reg, *TRI, *TII, UseCopyInstr, false);
       }
 
-    propagateDefs(MI);
+    propagateDefs(MI, DG, MoveDependenciesForBetterCopyPropagation);
     for (const MachineOperand &MO : MI.operands()) {
       if (!MO.isReg())
         continue;
@@ -1104,7 +1309,7 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
           }
         } else {
           Tracker.invalidateRegister(MO.getReg().asMCReg(), *TRI, *TII,
-                                     UseCopyInstr);
+                                     UseCopyInstr, MoveDependenciesForBetterCopyPropagation);
         }
       }
     }
@@ -1122,6 +1327,15 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
     Copy->eraseFromParent();
     ++NumDeletes;
   }
+  if (MoveDependenciesForBetterCopyPropagation) {
+    DG.exitRegion();
+    DG.finishBlock();
+    // QUESTION: Does it makes sense to keep the kill flags here?
+    //           On the other parts of this pass we juts throw out
+    //           the kill flags.
+    DG.fixupKills(MBB);
+  }
+
 
   MaybeDeadCopies.clear();
   CopyDbgUsers.clear();
@@ -1472,11 +1686,29 @@ bool MachineCopyPropagation::runOnMachineFunction(MachineFunction &MF) {
   TRI = MF.getSubtarget().getRegisterInfo();
   TII = MF.getSubtarget().getInstrInfo();
   MRI = &MF.getRegInfo();
+  auto *LISWrapper = getAnalysisIfAvailable<LiveIntervalsWrapperPass>();
+  LIS = LISWrapper ? &LISWrapper->getLIS() : nullptr;
 
   for (MachineBasicBlock &MBB : MF) {
     if (isSpillageCopyElimEnabled)
       EliminateSpillageCopies(MBB);
+
+    // BackwardCopyPropagateBlock happens in two stages.
+    // First we move those unnecessary dependencies out of the way
+    // that may block copy propagations.
+    //
+    // The reason for this two stage approach is that the ScheduleDAG can not
+    // handle register renaming.
+    // QUESTION: I think these two stages could be merged together, if I were to change
+    // the renaming mechanism.
+    //
+    // The renaming wouldn't happen instantly. There would be a data structure
+    // that contained what register should be renamed to what. Then after the
+    // backward propagation has concluded the renaming would happen.
+    BackwardCopyPropagateBlock(MBB, true);
+    // Then we do the actual copy propagation.
     BackwardCopyPropagateBlock(MBB);
+
     ForwardCopyPropagateBlock(MBB);
   }
 
diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
index de3f323891a36a..92575d701f4281 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll
@@ -6026,8 +6026,8 @@ define { i8, i1 } @cmpxchg_i8(ptr %ptr, i8 %desired, i8 %new) {
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w29, -16
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w19, -24
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w20, -32
-; CHECK-OUTLINE-O1-NEXT:    mov x3, x0
 ; CHECK-OUTLINE-O1-NEXT:    mov w19, w1
+; CHECK-OUTLINE-O1-NEXT:    mov x3, x0
 ; CHECK-OUTLINE-O1-NEXT:    mov w1, w2
 ; CHECK-OUTLINE-O1-NEXT:    mov w0, w19
 ; CHECK-OUTLINE-O1-NEXT:    mov x2, x3
@@ -6133,8 +6133,8 @@ define { i16, i1 } @cmpxchg_i16(ptr %ptr, i16 %desired, i16 %new) {
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w29, -16
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w19, -24
 ; CHECK-OUTLINE-O1-NEXT:    .cfi_offset w20, -32
-; CHECK-OUTLINE-O1-NEXT:    mov x3, x0
 ; CHECK-OUTLI...
[truncated]

Copy link

github-actions bot commented Aug 21, 2024

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 22d3fb182c9199ac3d51e5577c6647508a7a37f0 027d7761dbb0293451a2c00e32cd6dc5ce83252c --extensions cpp -- llvm/lib/CodeGen/MachineCopyPropagation.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/CodeGen/MachineCopyPropagation.cpp b/llvm/lib/CodeGen/MachineCopyPropagation.cpp
index c1fe3f1964..2d07f4383c 100644
--- a/llvm/lib/CodeGen/MachineCopyPropagation.cpp
+++ b/llvm/lib/CodeGen/MachineCopyPropagation.cpp
@@ -123,15 +123,17 @@ public:
 };
 
 static std::optional<llvm::SmallVector<MachineInstr *>>
-moveInstructionsOutOfTheWayIfWeCan(MachineInstr *DstInstr, MachineInstr *SrcInstr, ScheduleDAGMCP &DG) {
+moveInstructionsOutOfTheWayIfWeCan(MachineInstr *DstInstr,
+                                   MachineInstr *SrcInstr, ScheduleDAGMCP &DG) {
   SUnit *Dst;
-  //SUnit *Src;
+  // SUnit *Src;
 
   MachineBasicBlock *MBB = SrcInstr->getParent();
   int SectionSize =
       std::distance(SrcInstr->getIterator(), DstInstr->getIterator());
 
-  DG.enterRegion(MBB, (SrcInstr->getIterator()), ++(DstInstr->getIterator()), SectionSize+1);
+  DG.enterRegion(MBB, (SrcInstr->getIterator()), ++(DstInstr->getIterator()),
+                 SectionSize + 1);
   DG.buildSchedGraph(nullptr);
   Dst = DG.getSUnit(DstInstr);
   unsigned MaxNumberOfNodesToBeProcessed = 10;
@@ -141,7 +143,6 @@ moveInstructionsOutOfTheWayIfWeCan(MachineInstr *DstInstr, MachineInstr *SrcInst
   assert("This function only operates on a basic block level." &&
          MBB == DstInstr->getParent());
 
-
   assert(SectionSize > 0 &&
          "The copy source must precede the copy destination.");
 
@@ -160,8 +161,9 @@ moveInstructionsOutOfTheWayIfWeCan(MachineInstr *DstInstr, MachineInstr *SrcInst
   // (only if we are not talking about the destination node which is a special
   // case indicated by a flag) and is located between the source of the copy and
   // the destination of the copy.
-  auto ProcessSNodeChildren = [&Edges, SrcInstr, &SectionSize, &SectionInstr, &NumProcessedNode, &MaxNumberOfNodesToBeProcessed](
-                                  const SUnit *Node, bool IsRoot) -> bool {
+  auto ProcessSNodeChildren =
+      [&Edges, SrcInstr, &SectionSize, &SectionInstr, &NumProcessedNode,
+       &MaxNumberOfNodesToBeProcessed](const SUnit *Node, bool IsRoot) -> bool {
     for (llvm::SDep I : Node->Preds) {
       SUnit *SU = I.getSUnit();
       MachineInstr &MI = *(SU->getInstr());
@@ -183,7 +185,7 @@ moveInstructionsOutOfTheWayIfWeCan(MachineInstr *DstInstr, MachineInstr *SrcInst
         }
       }
     }
-    return NumProcessedNode < MaxNumberOfNodesToBeProcessed;      
+    return NumProcessedNode < MaxNumberOfNodesToBeProcessed;
   };
 
   // The BFS happens here.
@@ -268,9 +270,7 @@ public:
     }
   }
 
-  int getInvalidCopiesSize() {
-    return InvalidCopies.size();
-  }
+  int getInvalidCopiesSize() { return InvalidCopies.size(); }
 
   /// Remove register from copy maps.
   void invalidateRegister(MCRegister Reg, const TargetRegisterInfo &TRI,
@@ -398,9 +398,7 @@ public:
     return !Copies.empty();
   }
 
-  bool hasAnyInvalidCopies() {
-    return !InvalidCopies.empty();
-  }
+  bool hasAnyInvalidCopies() { return !InvalidCopies.empty(); }
 
   MachineInstr *findCopyForUnit(MCRegUnit RegUnit,
                                 const TargetRegisterInfo &TRI,
@@ -414,8 +412,8 @@ public:
   }
 
   MachineInstr *findInvalidCopyForUnit(MCRegUnit RegUnit,
-                                const TargetRegisterInfo &TRI,
-                                bool MustBeAvailable = false) {
+                                       const TargetRegisterInfo &TRI,
+                                       bool MustBeAvailable = false) {
     auto CI = InvalidCopies.find(RegUnit);
     if (CI == InvalidCopies.end())
       return nullptr;
@@ -436,7 +434,7 @@ public:
   }
 
   MachineInstr *findInvalidCopyDefViaUnit(MCRegUnit RegUnit,
-                                   const TargetRegisterInfo &TRI) {
+                                          const TargetRegisterInfo &TRI) {
     auto CI = InvalidCopies.find(RegUnit);
     if (CI == InvalidCopies.end())
       return nullptr;
@@ -447,8 +445,8 @@ public:
   }
 
   // TODO: This is ugly there shall be a more elegant solution to invalid
-  //       copy searching. Create a variant that either returns a valid an invalid
-  //       copy or no copy at all (std::monotype).
+  //       copy searching. Create a variant that either returns a valid an
+  //       invalid copy or no copy at all (std::monotype).
   MachineInstr *findAvailBackwardCopy(MachineInstr &I, MCRegister Reg,
                                       const TargetRegisterInfo &TRI,
                                       const TargetInstrInfo &TII,
@@ -558,7 +556,7 @@ public:
   }
 };
 
-using Copy = MachineInstr*;
+using Copy = MachineInstr *;
 using InvalidCopy = std::pair<Copy, MachineInstr *>;
 using CopyLookupResult = std::variant<std::monostate, Copy, InvalidCopy>;
 
@@ -599,7 +597,8 @@ private:
   void ReadRegister(MCRegister Reg, MachineInstr &Reader, DebugType DT);
   void readSuccessorLiveIns(const MachineBasicBlock &MBB);
   void ForwardCopyPropagateBlock(MachineBasicBlock &MBB);
-  void BackwardCopyPropagateBlock(MachineBasicBlock &MBB, ScheduleDAGMCP *DG = nullptr);
+  void BackwardCopyPropagateBlock(MachineBasicBlock &MBB,
+                                  ScheduleDAGMCP *DG = nullptr);
   void EliminateSpillageCopies(MachineBasicBlock &MBB);
   bool eraseIfRedundant(MachineInstr &Copy, MCRegister Src, MCRegister Def);
   void forwardUses(MachineInstr &MI);
@@ -612,7 +611,7 @@ private:
   bool hasImplicitOverlap(const MachineInstr &MI, const MachineOperand &Use);
   bool hasOverlappingMultipleDef(const MachineInstr &MI,
                                  const MachineOperand &MODef, Register Def);
-  
+
   /// Candidates for deletion.
   SmallSetVector<MachineInstr *, 8> MaybeDeadCopies;
 
@@ -1216,8 +1215,7 @@ void MachineCopyPropagation::propagateDefs(MachineInstr &MI,
           dbgs()
           << "MCP: Found potential backward copy that has dependency.\n");
 
-      InstructionsToMove =
-          moveInstructionsOutOfTheWayIfWeCan(Copy, &MI, *DG);
+      InstructionsToMove = moveInstructionsOutOfTheWayIfWeCan(Copy, &MI, *DG);
       if (!InstructionsToMove)
         continue;
     }
@@ -1252,27 +1250,27 @@ void MachineCopyPropagation::propagateDefs(MachineInstr &MI,
       ++NumCopyBackwardPropagated;
     } else if (InstructionsToMove) {
       for (auto *I : *InstructionsToMove) {
-        MI.getParent()->splice(MI.getIterator(), MI.getParent(), I->getIterator());
+        MI.getParent()->splice(MI.getIterator(), MI.getParent(),
+                               I->getIterator());
       }
     }
   }
 }
 
-void MachineCopyPropagation::BackwardCopyPropagateBlock(
-    MachineBasicBlock &MBB, ScheduleDAGMCP *DG) {
+void MachineCopyPropagation::BackwardCopyPropagateBlock(MachineBasicBlock &MBB,
+                                                        ScheduleDAGMCP *DG) {
   if (DG) {
     DG->startBlock(&MBB);
     // DG.viewGraph();
   }
- 
 
   LLVM_DEBUG(dbgs() << "MCP: BackwardCopyPropagateBlock " << MBB.getName()
                     << "\n");
 
   for (MachineInstr &MI : llvm::make_early_inc_range(llvm::reverse(MBB))) {
-    //llvm::errs() << "Next MI: ";
-    //MI.dump();
-    // Ignore non-trivial COPYs.
+    // llvm::errs() << "Next MI: ";
+    // MI.dump();
+    //  Ignore non-trivial COPYs.
     std::optional<DestSourcePair> CopyOperands =
         isCopyInstr(MI, *TII, UseCopyInstr);
     if (CopyOperands && MI.getNumOperands() == 2) {
@@ -1326,8 +1324,7 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
           }
         } else {
           Tracker.invalidateRegister(MO.getReg().asMCReg(), *TRI, *TII,
-                                     UseCopyInstr,
-                                     DG);
+                                     UseCopyInstr, DG);
         }
       }
     }
@@ -1353,7 +1350,6 @@ void MachineCopyPropagation::BackwardCopyPropagateBlock(
     DG->fixupKills(MBB);
   }
 
-
   MaybeDeadCopies.clear();
   CopyDbgUsers.clear();
   Tracker.clear();
@@ -1716,8 +1712,8 @@ bool MachineCopyPropagation::runOnMachineFunction(MachineFunction &MF) {
     //
     // The reason for this two stage approach is that the ScheduleDAG can not
     // handle register renaming.
-    // QUESTION: I think these two stages could be merged together, if I were to change
-    // the renaming mechanism.
+    // QUESTION: I think these two stages could be merged together, if I were to
+    // change the renaming mechanism.
     //
     // The renaming wouldn't happen instantly. There would be a data structure
     // that contained what register should be renamed to what. Then after the

Copy link
Collaborator

@qcolombet qcolombet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only had a cursory look.
The direction seems fine (we compute a DDG and do some stuff with it) but I didn't look closely at the logic.
I remain skeptical that the compile time impact is negligible.
Could you use CTMark from the LLVM test suite to valid your initial measurements?

MBB == SrcInstr->getParent());

int SectionSize =
std::distance(SrcInstr->getIterator(), DstInstr->getIterator());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the size be negative?
I.e., are we sure DstInstr appears after SrcInstr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we should not have that. It shall be an unsigned instead of int.

Copy link
Contributor Author

@spaits spaits Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will keep using int for the section size. It is compared with possibly negative int (DestinationFromSource) and also good for the assertion. Is it ok for you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes an assertion sounds good.

SUnit *SU = I.getSUnit();
MachineInstr &MI = *(SU->getInstr());
if (!IsRoot && &MI == SrcInstr)
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the returned boolean mean?

Copy link
Contributor Author

@spaits spaits Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It signals to the main loop that we have an instruction that is not the root, the copy propagation destination is dependent on it and that it has any dependency on the copy source.

This means that moving of instructions is not possible. If this lambda returns with false the main loop that does the BFS stops.

SUnit *SU = I.getSUnit();
MachineInstr &MI = *(SU->getInstr());
if (!IsRoot && &MI == SrcInstr)
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we dequeue any of the unit pushed?

(I don't understand what the queue is used for yet, but it feels weird that depending on how the Node->Preds is ordered we enqueue potentially different things.)
For instance, let's say we have two SUnit in Node->Preds and one of them is SrcInstr.
If SrcInstr is viewed first, the queue will be left untouched, but if it is seen last then the queue may have some additional entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doing the dequeuing in the main loop. I should capture the queue instead of passing it as an argument to the lambda.

If we have an instruction, that is not the Destination and is part of the dependency tree staring from the Destination (if the traversal gets there it must be the part of that tree) and it has a dependency on the Source instruction that means that the moving of instructions between Destination and Source is not possible. We signal this to the main loop. The main loop returns no move is performed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doing the dequeuing in the main loop.

Thanks for the confirmation!

@spaits
Copy link
Contributor Author

spaits commented Aug 22, 2024

@qcolombet Than you for taking a look at this PR. Tomorrow I will check out CTMark.

Today I was also thinking about benchmarking. I have compiled coremark for RISCV. Here are the results of those. (Only code sizes and compilation times were measured):

coremark (riscv) FLAGS: -O3 -ffunction-sections -fdata-sections TARGET:rv32imafdc text section size compile time (five compilations)
before patch 78746 0.76 + 0.90 + 0.85 + 0.87 + 0.88 = 4.26
after patch 78704 0.75 + 0.86 + 0.86 + 0.89 + 0.93 = 4.29

I think I could inline the lambda in the BFS shall I do it?

@qcolombet
Copy link
Collaborator

I think I could inline the lambda in the BFS shall I do it?

I'm not sure what this would look like, but generally speaking I like subroutines (I even prefer static functions to lambdas) because it makes the contract between the functionality and the surrounding code more obvious and forces you to document the API (what the input arguments are and what is returned mean!)

@spaits
Copy link
Contributor Author

spaits commented Aug 25, 2024

I have done some benchmarking.

I used this cmake command for the benchmarks:

cmake -DCMAKE_C_COMPILER=llvm-project/build/bin/clang  -DCMAKE_CXX_COMPILER=llvm-project/build/bin/clang++ -DTEST_SUITE_BENCHMARKING_ONLY=1 .. -GNinja

I used this command to compile:

ninja -j1

This is the llvm-lit invcoaction:

llvm-lit -v -j 1 -o resnew21.json . && /home/spaits/repo/llvm-project/build/bin/llvm-lit -v -j 1 -o resnew22.json .

I ran compare with this command:

python3 ../utils/compare.py resnew21.json resnew22.json vs resold21.json resold22.json

And here are the results for compile time:

Tests: 1173
Metric: compile_time

/home/spaits/repo/llvm-test-suite/build/../utils/compare.py:206: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  name0 = names[0]
Program                                       compile_time             
                                              lhs          rhs    diff 
SingleSource/Benchmarks/Misc/pi                 0.14         0.16 13.2%
SingleSource/Benchmarks/Misc/fp-convert         0.24         0.26  6.1%
SingleSour...bench/stencils/fdtd-2d/fdtd-2d     1.63         1.73  6.0%
SingleSour...h/stencils/jacobi-1d/jacobi-1d     0.93         0.98  5.7%
SingleSour...near-algebra/kernels/bicg/bicg     0.81         0.86  5.2%
SingleSour...h/linear-algebra/solvers/lu/lu     1.18         1.22  3.4%
SingleSour...-algebra/solvers/durbin/durbin     0.89         0.92  3.4%
MultiSourc...Benchmarks/Olden/health/health     1.38         1.43  3.4%
SingleSource/Benchmarks/Misc/salsa20            0.56         0.58  3.1%
SingleSour...rks/Polybench/stencils/adi/adi     0.95         0.98  2.8%
SingleSour...bench/medley/nussinov/nussinov     1.13         1.16  2.6%
SingleSource/Benchmarks/Misc/mandel-2           0.28         0.29  2.4%
SingleSour...Benchmarks/Misc/matmul_f64_4x4     0.30         0.31  2.3%
SingleSource/Benchmarks/SmallPT/smallpt         1.95         2.00  2.3%
MultiSourc...arks/FreeBench/distray/distray     1.06         1.08  2.2%
                           Geomean difference                     -1.8%
      compile_time                         
l/r            lhs          rhs        diff
count  1173.000000  1173.000000  284.000000
mean   5.521519     5.432766    -0.018112  
std    35.607681    35.240664    0.026822  
min    0.000000     0.000000    -0.093135  
25%    0.000000     0.000000    -0.033920  
50%    0.000000     0.000000    -0.016928  
75%    0.000000     0.000000    -0.004431  
max    891.948600   888.081400   0.132171  

And for execution time:

Metric: exec_time

/home/spaits/repo/llvm-test-suite/build/../utils/compare.py:206: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  name0 = names[0]
Program                                       exec_time               
                                              lhs       rhs    diff   
SingleSour...Benchmarks/Stanford/Oscar.test     0.00      0.00    inf%
SingleSour...ncils/jacobi-1d/jacobi-1d.test     0.00      0.00    inf%
MultiSourc.../Prolangs-C/bison/mybison.test     0.00      0.00    inf%
MultiSourc...adpcm/rawcaudio/rawcaudio.test     0.00      0.00  140.0%
MultiSourc...abench/jpeg/jpeg-6a/cjpeg.test     0.00      0.00  100.0%
MultiSourc...ks/Prolangs-C++/city/city.test     0.00      0.00   61.5%
MultiSourc...cCat/03-testtrie/testtrie.test     0.00      0.00   60.0%
SingleSour...ootout/Shootout-ackermann.test     0.01      0.01   51.9%
SingleSour...s/BenchmarkGame/recursive.test     0.42      0.59   40.0%
SingleSour...tout-C++/Shootout-C++-ary.test     0.01      0.01   35.0%
SingleSour...++/Shootout-C++-ackermann.test     0.66      0.87   31.0%
SingleSour...out-C++/Shootout-C++-ary2.test     0.01      0.01   25.8%
MicroBench...st:BM_DIFF_PREDICT_LAMBDA/5001    23.13     29.09   25.7%
MultiSourc...telecomm-FFT/telecomm-fft.test     0.01      0.01   24.6%
MultiSourc...lications/ClamAV/clamscan.test     0.04      0.05   21.8%
                           Geomean difference                  -100.0%
/home/spaits/.local/lib/python3.10/site-packages/pandas/core/nanops.py:1016: RuntimeWarning: invalid value encountered in subtract
  sqr = _ensure_numeric((avg - values) ** 2)
           exec_time                            
l/r              lhs            rhs         diff
count  1173.000000    1173.000000    1149.000000
mean   2071.330440    2088.736758    inf        
std    25427.791777   25650.260153  NaN         
min    0.000000       0.000000      -1.000000   
25%    1.204800       1.205000      -0.010028   
50%    5.687519       5.608971       0.000000   
75%    133.613683     132.687198     0.017685   
max    643010.637615  646757.697936  inf 

text section sizes:

Tests: 1173
Metric: size..text

/home/spaits/repo/llvm-test-suite/build/../utils/compare.py:206: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  name0 = names[0]
Program                                       size..text                
                                              lhs        rhs       diff 
MultiSourc...nchmarks/Olden/treeadd/treeadd      589.00     605.00  2.7%
SingleSour...chmarks/BenchmarkGame/fannkuch     1619.00    1635.00  1.0%
MultiSourc...enchmarks/McCat/17-bintr/bintr     1781.00    1797.00  0.9%
MultiSourc...work-patricia/network-patricia     1999.00    2015.00  0.8%
MultiSourc...chmarks/McCat/04-bisect/bisect     3261.00    3277.00  0.5%
SingleSource/Benchmarks/McGill/chomp            6641.00    6673.00  0.5%
SingleSour.../Benchmarks/Misc-C++/Large/ray     4577.00    4593.00  0.3%
MultiSource/Benchmarks/Olden/bh/bh             12026.00   12058.00  0.3%
MultiSourc...e/Applications/SIBsim4/SIBsim4    45430.00   45494.00  0.1%
Bitcode/Be...hmarks/Halide/blur/halide_blur    34776.00   34824.00  0.1%
MultiSource/Benchmarks/sim/sim                 17775.00   17791.00  0.1%
MultiSourc...e/Benchmarks/MallocBench/gs/gs   152609.00  152737.00  0.1%
Bitcode/Be...ral_grid/halide_bilateral_grid    58680.00   58728.00  0.1%
MultiSourc...hmarks/MallocBench/cfrac/cfrac    20661.00   20677.00  0.1%
MultiSourc...e/Applications/minisat/minisat    21741.00   21757.00  0.1%
                           Geomean difference                       0.0%
          size..text                           
l/r              lhs            rhs        diff
count  1173.000000    1173.000000    318.000000
mean   17475.696505   17479.068201   0.000287  
std    77419.292919   77433.314426   0.001806  
min    0.000000       0.000000       0.000000  
25%    0.000000       0.000000       0.000000  
50%    0.000000       0.000000       0.000000  
75%    786.000000     786.000000     0.000000  
max    906913.000000  907121.000000  0.027165 

I should do some fine tuning:

  • Only enable this optimization from O2 or O3 or Os.
  • If O3 or O3 is enabled, then take instruction latencies into account when doing the instruction moving.
  • If Os (size opt if I am correct) is enabled then do the optimization regardless of latencies.

What do you think would this PR be fine for O3 or O2 or Os?

@bzEq
Copy link
Collaborator

bzEq commented Aug 25, 2024

I'm still concerned about introducing rescheduling instructions in MCP. Is there possibility we enhance pre-RA machine scheduler to achieve the same effect?
For example, changes in llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll can be also achived by disabling pre-RA machine scheduling, i.e., -enable-misched=0.

@spaits
Copy link
Contributor Author

spaits commented Aug 25, 2024

I'm still concerned about introducing rescheduling instructions in MCP. Is there possibility we enhance pre-RA machine scheduler to achieve the same effect?
For example, changes in llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll can be also achived by disabling pre-RA machine scheduling, i.e., -enable-misched=0.

I don't really think the logic of this would fit into the current scheduling mechanism.

  • The scheduler operates on a scheduler region level, not on a basic block level. As far as I know, the scheduler regions are smaller than a basic block.

  • This could be worked around. But the result would be less good, since the copy propagations would happen on smaller scope.

  • We couldn't really integrate the stuff into the scheduler. What we would have is this:
    -- Modify backward propagation, so it doesn't actually propagate, but just gives back potential propagations.
    -- Then call the existing logic for moveInstructionsOutOfTheWayIfWeCan (I promise I will find a better name for it :) ).
    -- All this would happen after scheduling. So basically it would work the same, but at a different place.

In conclusion just the current stuff would be moved, we would have less code reuse.

So I think this logic is a bit incompatible with the scheduler. The only thing that is common between them is the use of a Dependency graph.

Or maybe I could take a whole different approach:

  • Do a copy prop, save the potential copy propagations.
  • When scheduling take these into account somehow.

This would be less general I think, because if a dependency blocks some copies that is not caused by the scheduler it may would not be recognized and moved. Also it would only work in scheduling regions so this may would make it less effective .

@spaits
Copy link
Contributor Author

spaits commented Aug 25, 2024

For example, changes in llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll can be also achived by disabling pre-RA machine scheduling, i.e., -enable-misched=0.

This doesn't work in all the cases. It only works if the scheduler is the one who "spoils" the data dependencies. For example let's see the llvm/test/CodeGen/X86/xmulo.ll test:

Here is one function on which this patch improves:

define zeroext i1 @smuloi8(i8 %v1, i8 %v2, ptr %res) {
  %t = call {i8, i1} @llvm.smul.with.overflow.i8(i8 %v1, i8 %v2)
  %val = extractvalue {i8, i1} %t, 0
  %obit = extractvalue {i8, i1} %t, 1
  store i8 %val, ptr %res
  ret i1 %obit
}

If I compile it without my patch with this command:

bin/llc -disable-peephole -enable-misched=0 -mtriple=x86_64-linux-unknown X86ex.txt -o oldNoPreRASched.s

I get:

smuloi8:                                # @smuloi8
# %bb.0:
	movl	%edi, %eax
	imulb	%sil
	seto	%cl
	movb	%al, (%rdx)
	movl	%ecx, %eax
	retq

Same result happens if we leave the -enable-misched=0 flag.

With my patch with the command:

bin/llc -disable-peephole -mtriple=x86_64-linux-unknown X86ex.txt -o newPreRASched.s

We get:

smuloi8:                                # @smuloi8
# %bb.0:
	movl	%edi, %eax
	imulb	%sil
	movb	%al, (%rdx)
	seto	%al
	retq

So there are more cases that are not closely related to the scheduler.

@qcolombet
Copy link
Collaborator

I have done some benchmarking.

I used this cmake command for the benchmarks:

cmake -DCMAKE_C_COMPILER=llvm-project/build/bin/clang  -DCMAKE_CXX_COMPILER=llvm-project/build/bin/clang++ -DTEST_SUITE_BENCHMARKING_ONLY=1 .. -GNinja

I used this command to compile:

ninja -j1

This is the llvm-lit invcoaction:

llvm-lit -v -j 1 -o resnew21.json . && /home/spaits/repo/llvm-project/build/bin/llvm-lit -v -j 1 -o resnew22.json .

I ran compare with this command:

python3 ../utils/compare.py resnew21.json resnew22.json vs resold21.json resold22.json

And here are the results for compile time:

Tests: 1173
Metric: compile_time

/home/spaits/repo/llvm-test-suite/build/../utils/compare.py:206: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  name0 = names[0]
Program                                       compile_time             
                                              lhs          rhs    diff 
SingleSource/Benchmarks/Misc/pi                 0.14         0.16 13.2%
SingleSource/Benchmarks/Misc/fp-convert         0.24         0.26  6.1%
SingleSour...bench/stencils/fdtd-2d/fdtd-2d     1.63         1.73  6.0%
SingleSour...h/stencils/jacobi-1d/jacobi-1d     0.93         0.98  5.7%
SingleSour...near-algebra/kernels/bicg/bicg     0.81         0.86  5.2%
SingleSour...h/linear-algebra/solvers/lu/lu     1.18         1.22  3.4%
SingleSour...-algebra/solvers/durbin/durbin     0.89         0.92  3.4%
MultiSourc...Benchmarks/Olden/health/health     1.38         1.43  3.4%
SingleSource/Benchmarks/Misc/salsa20            0.56         0.58  3.1%
SingleSour...rks/Polybench/stencils/adi/adi     0.95         0.98  2.8%
SingleSour...bench/medley/nussinov/nussinov     1.13         1.16  2.6%
SingleSource/Benchmarks/Misc/mandel-2           0.28         0.29  2.4%
SingleSour...Benchmarks/Misc/matmul_f64_4x4     0.30         0.31  2.3%
SingleSource/Benchmarks/SmallPT/smallpt         1.95         2.00  2.3%
MultiSourc...arks/FreeBench/distray/distray     1.06         1.08  2.2%
                           Geomean difference                     -1.8%
      compile_time                         
l/r            lhs          rhs        diff
count  1173.000000  1173.000000  284.000000
mean   5.521519     5.432766    -0.018112  
std    35.607681    35.240664    0.026822  
min    0.000000     0.000000    -0.093135  
25%    0.000000     0.000000    -0.033920  
50%    0.000000     0.000000    -0.016928  
75%    0.000000     0.000000    -0.004431  
max    891.948600   888.081400   0.132171  

And for execution time:

Metric: exec_time

/home/spaits/repo/llvm-test-suite/build/../utils/compare.py:206: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  name0 = names[0]
Program                                       exec_time               
                                              lhs       rhs    diff   
SingleSour...Benchmarks/Stanford/Oscar.test     0.00      0.00    inf%
SingleSour...ncils/jacobi-1d/jacobi-1d.test     0.00      0.00    inf%
MultiSourc.../Prolangs-C/bison/mybison.test     0.00      0.00    inf%
MultiSourc...adpcm/rawcaudio/rawcaudio.test     0.00      0.00  140.0%
MultiSourc...abench/jpeg/jpeg-6a/cjpeg.test     0.00      0.00  100.0%
MultiSourc...ks/Prolangs-C++/city/city.test     0.00      0.00   61.5%
MultiSourc...cCat/03-testtrie/testtrie.test     0.00      0.00   60.0%
SingleSour...ootout/Shootout-ackermann.test     0.01      0.01   51.9%
SingleSour...s/BenchmarkGame/recursive.test     0.42      0.59   40.0%
SingleSour...tout-C++/Shootout-C++-ary.test     0.01      0.01   35.0%
SingleSour...++/Shootout-C++-ackermann.test     0.66      0.87   31.0%
SingleSour...out-C++/Shootout-C++-ary2.test     0.01      0.01   25.8%
MicroBench...st:BM_DIFF_PREDICT_LAMBDA/5001    23.13     29.09   25.7%
MultiSourc...telecomm-FFT/telecomm-fft.test     0.01      0.01   24.6%
MultiSourc...lications/ClamAV/clamscan.test     0.04      0.05   21.8%
                           Geomean difference                  -100.0%
/home/spaits/.local/lib/python3.10/site-packages/pandas/core/nanops.py:1016: RuntimeWarning: invalid value encountered in subtract
  sqr = _ensure_numeric((avg - values) ** 2)
           exec_time                            
l/r              lhs            rhs         diff
count  1173.000000    1173.000000    1149.000000
mean   2071.330440    2088.736758    inf        
std    25427.791777   25650.260153  NaN         
min    0.000000       0.000000      -1.000000   
25%    1.204800       1.205000      -0.010028   
50%    5.687519       5.608971       0.000000   
75%    133.613683     132.687198     0.017685   
max    643010.637615  646757.697936  inf 

text section sizes:

Tests: 1173
Metric: size..text

/home/spaits/repo/llvm-test-suite/build/../utils/compare.py:206: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  name0 = names[0]
Program                                       size..text                
                                              lhs        rhs       diff 
MultiSourc...nchmarks/Olden/treeadd/treeadd      589.00     605.00  2.7%
SingleSour...chmarks/BenchmarkGame/fannkuch     1619.00    1635.00  1.0%
MultiSourc...enchmarks/McCat/17-bintr/bintr     1781.00    1797.00  0.9%
MultiSourc...work-patricia/network-patricia     1999.00    2015.00  0.8%
MultiSourc...chmarks/McCat/04-bisect/bisect     3261.00    3277.00  0.5%
SingleSource/Benchmarks/McGill/chomp            6641.00    6673.00  0.5%
SingleSour.../Benchmarks/Misc-C++/Large/ray     4577.00    4593.00  0.3%
MultiSource/Benchmarks/Olden/bh/bh             12026.00   12058.00  0.3%
MultiSourc...e/Applications/SIBsim4/SIBsim4    45430.00   45494.00  0.1%
Bitcode/Be...hmarks/Halide/blur/halide_blur    34776.00   34824.00  0.1%
MultiSource/Benchmarks/sim/sim                 17775.00   17791.00  0.1%
MultiSourc...e/Benchmarks/MallocBench/gs/gs   152609.00  152737.00  0.1%
Bitcode/Be...ral_grid/halide_bilateral_grid    58680.00   58728.00  0.1%
MultiSourc...hmarks/MallocBench/cfrac/cfrac    20661.00   20677.00  0.1%
MultiSourc...e/Applications/minisat/minisat    21741.00   21757.00  0.1%
                           Geomean difference                       0.0%
          size..text                           
l/r              lhs            rhs        diff
count  1173.000000    1173.000000    318.000000
mean   17475.696505   17479.068201   0.000287  
std    77419.292919   77433.314426   0.001806  
min    0.000000       0.000000       0.000000  
25%    0.000000       0.000000       0.000000  
50%    0.000000       0.000000       0.000000  
75%    786.000000     786.000000     0.000000  
max    906913.000000  907121.000000  0.027165 

I should do some fine tuning:

  • Only enable this optimization from O2 or O3 or Os.
  • If O3 or O3 is enabled, then take instruction latencies into account when doing the instruction moving.
  • If Os (size opt if I am correct) is enabled then do the optimization regardless of latencies.

What do you think would this PR be fine for O3 or O2 or Os?

Couple of comments:

  • If I am not mistaken what you are reporting is not CTMark. Your cmake command should have -DTEST_SUITE_SUBDIRS=CTMark
  • Usually the baseline is on the lhs, but here it is the opposite. That's fine but it is surprising at first.
  • How was your compiler compiled? (Release, release + asserts, ...?)

Could you re-run with CTMark?
The current tests are too small to be relevant (sub 1 second for most of them).

@spaits
Copy link
Contributor Author

spaits commented Aug 26, 2024

* If I am not mistaken what you are reporting is not CTMark. Your cmake command should have `-DTEST_SUITE_SUBDIRS=CTMark`

Okay I will change that. I thought the Benchmarking suite includes CT mark based on this: https://llvm.org/docs/TestSuiteGuide.html#common-configuration-options . I used the TEST_SUITE_BENCHMARKING_ONLY flag that is described like this:

Disable tests that are unsuitable for performance measurements. The disabled tests either run for a very short time or are dominated by I/O performance making them unsuitable as compiler performance tests.
* Usually the baseline is on the lhs, but here it is the opposite. That's fine but it is surprising at first.

Will change that for the next measurement.

* How was your compiler compiled? (Release, release + asserts, ...?)

Debug + Dynamically linked. I should do release plus statically linked right?

Could you re-run with CTMark? The current tests are too small to be relevant (sub 1 second for most of them).
I will do that.

@spaits
Copy link
Contributor Author

spaits commented Aug 26, 2024

Here are the updated results. I used this cmake command now:

cmake -DCMAKE_C_COMPILER=/home/spaits/repo/spare-llvm/llvm-project/build/bin/clang -DCMAKE_CXX_FLAGS='-lstdc++ -lrt -lm -lpthread'   -DTEST_SUITE_SUBDIRS=CTMark .. -GNinja

My clang was compiled in release mode.

This time I have run the benchmarks four times for the baseline and four times for my changes and merged them with compare.py.
Here are the results. Now the baseline is on the left.
Compile time:

Metric: compile_time

/home/spaits/repo/llvm-test-suite/build/../utils/compare.py:206: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  name0 = names[0]
Program                                       compile_time             
                                              lhs          rhs    diff 
kimwitu++/kc                                   36.90        41.08 11.3%
mafft/pairlocalalign                           22.59        24.13  6.8%
consumer-typeset/consumer-typeset              27.19        28.21  3.8%
tramp3d-v4/tramp3d-v4                          35.39        35.97  1.6%
Bullet/bullet                                  84.59        85.60  1.2%
lencod/lencod                                  43.72        43.72  0.0%
7zip/7zip-benchmark                           178.00       175.89 -1.2%
ClamAV/clamscan                                45.49        44.81 -1.5%
sqlite3/sqlite3                                14.66        14.25 -2.8%
SPASS/SPASS                                    40.12        38.79 -3.3%
                           Geomean difference                      1.5%
      compile_time                      
l/r            lhs        rhs       diff
count  10.000000    10.00000   10.000000
mean   52.864370    53.24520   0.015967 
std    47.796605    47.02377   0.045956 
min    14.659300    14.25310  -0.033186 
25%    29.238775    30.15100  -0.014311 
50%    38.510750    39.93680   0.006018 
75%    45.049625    44.53395   0.032314 
max    177.996800   175.89140  0.113380 

Execution time:

Metric: exec_time

/home/spaits/repo/llvm-test-suite/build/../utils/compare.py:206: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  name0 = names[0]
Program                                       exec_time              
                                              lhs       rhs    diff  
kimwitu++/kc                                    0.00      0.01 163.3%
tramp3d-v4/tramp3d-v4                           0.06      0.06   5.3%
lencod/lencod                                   1.74      1.82   4.6%
SPASS/SPASS                                     3.39      3.48   2.7%
Bullet/bullet                                   1.56      1.59   2.1%
7zip/7zip-benchmark                             4.71      4.77   1.4%
ClamAV/clamscan                                 0.05      0.05   1.2%
mafft/pairlocalalign                            8.40      8.33  -0.8%
sqlite3/sqlite3                                 1.00      0.99  -1.2%
consumer-typeset/consumer-typeset               0.05      0.05  -2.1%
                           Geomean difference                   11.6%
       exec_time                      
l/r          lhs        rhs       diff
count  10.000000  10.000000  10.000000
mean   2.094720   2.114030   0.176381 
std    2.722439   2.714371   0.512490 
min    0.003000   0.007900  -0.021413 
25%    0.053225   0.054475  -0.003368 
50%    1.279450   1.289500   0.017292 
75%    2.975650   3.064350   0.041438 
max    8.396700   8.326000   1.633333

Text section size:

Metric: size..text

/home/spaits/repo/llvm-test-suite/build/../utils/compare.py:206: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  name0 = names[0]
Program                                       size..text                
                                              lhs        rhs       diff 
mafft/pairlocalalign                          452119.00  452103.00 -0.0%
kimwitu++/kc                                  403971.00  403955.00 -0.0%
lencod/lencod                                 763725.00  763693.00 -0.0%
tramp3d-v4/tramp3d-v4                         884131.00  884083.00 -0.0%
Bullet/bullet                                 726622.00  726558.00 -0.0%
consumer-typeset/consumer-typeset             442577.00  442529.00 -0.0%
7zip/7zip-benchmark                           907121.00  906913.00 -0.0%
ClamAV/clamscan                               540338.00  540178.00 -0.0%
sqlite3/sqlite3                               490639.00  490447.00 -0.0%
SPASS/SPASS                                   505890.00  505586.00 -0.1%
                           Geomean difference                      -0.0%
          size..text                          
l/r              lhs            rhs       diff
count  10.000000      10.000000      10.000000
mean   611713.300000  611604.500000 -0.000189 
std    190314.056507  190310.634149  0.000190 
min    403971.000000  403955.000000 -0.000601 
25%    461749.000000  461689.000000 -0.000279 
50%    523114.000000  522882.000000 -0.000098 
75%    754449.250000  754409.250000 -0.000045 
max    907121.000000  906913.000000 -0.000035 

@spaits
Copy link
Contributor Author

spaits commented Aug 27, 2024

Maybe it would be a good idea to introduce a limit in the dependency checking. So for example at most ten dependencies can be checked. If there are more then we don't do anything.

@spaits
Copy link
Contributor Author

spaits commented Aug 27, 2024

When decreasing the node limit to 10 we no longer have major affect on the compile time:

Metric: compile_time

/home/spaits/repo/llvm-test-suite/build/../utils/compare.py:206: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  name0 = names[0]
Program                                       compile_time             
                                              lhs          rhs    diff 
mafft/pairlocalalign                           22.59        23.23  2.8%
tramp3d-v4/tramp3d-v4                          35.39        36.21  2.3%
sqlite3/sqlite3                                14.66        14.95  2.0%
consumer-typeset/consumer-typeset              27.19        27.54  1.3%
kimwitu++/kc                                   36.90        37.32  1.1%
lencod/lencod                                  43.72        43.99  0.6%
Bullet/bullet                                  84.59        83.49 -1.3%
ClamAV/clamscan                                45.49        44.86 -1.4%
7zip/7zip-benchmark                           178.00       170.98 -3.9%
SPASS/SPASS                                    40.12        38.26 -4.6%
                           Geomean difference                     -0.1%
      compile_time                       
l/r            lhs         rhs       diff
count  10.000000    10.000000   10.000000
mean   52.864370    52.082750  -0.001092 
std    47.796605    45.601328   0.026105 
min    14.659300    14.948100  -0.046405 
25%    29.238775    29.706625  -0.013767 
50%    38.510750    37.789450   0.008828 
75%    45.049625    44.639600   0.017983 
max    177.996800   170.979400  0.028410

And another measurement:

Metric: compile_time

/home/spaits/repo/llvm-test-suite/build/../utils/compare.py:206: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  name0 = names[0]
Program                                       compile_time             
                                              lhs          rhs    diff 
tramp3d-v4/tramp3d-v4                          35.91        36.21  0.8%
7zip/7zip-benchmark                           169.90       170.98  0.6%
Bullet/bullet                                  82.98        83.49  0.6%
sqlite3/sqlite3                                14.87        14.95  0.6%
SPASS/SPASS                                    38.07        38.26  0.5%
lencod/lencod                                  43.81        43.99  0.4%
kimwitu++/kc                                   37.22        37.32  0.3%
ClamAV/clamscan                                44.74        44.86  0.3%
mafft/pairlocalalign                           23.34        23.23 -0.5%
consumer-typeset/consumer-typeset              28.06        27.54 -1.9%
                           Geomean difference                      0.2%
      compile_time                       
l/r            lhs         rhs       diff
count  10.000000    10.000000   10.000000
mean   51.889700    52.082750   0.001741 
std    45.243923    45.601328   0.007972 
min    14.865300    14.948100  -0.018582 
25%    30.023225    29.706625   0.002573 
50%    37.646100    37.789450   0.004540 
75%    44.508800    44.639600   0.006052 
max    169.895800   170.979400  0.008292

Here are code sizes:

Tests: 10
Metric: size..text

/home/spaits/repo/llvm-test-suite/build/../utils/compare.py:206: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  name0 = names[0]
Program                                       size..text                
                                              lhs        rhs       diff 
lencod/lencod                                 763725.00  763725.00  0.0%
mafft/pairlocalalign                          452119.00  452103.00 -0.0%
kimwitu++/kc                                  403971.00  403955.00 -0.0%
tramp3d-v4/tramp3d-v4                         884131.00  884083.00 -0.0%
Bullet/bullet                                 726622.00  726574.00 -0.0%
consumer-typeset/consumer-typeset             442577.00  442529.00 -0.0%
7zip/7zip-benchmark                           907121.00  906945.00 -0.0%
ClamAV/clamscan                               540338.00  540178.00 -0.0%
sqlite3/sqlite3                               490639.00  490447.00 -0.0%
SPASS/SPASS                                   505890.00  505586.00 -0.1%
                           Geomean difference                      -0.0%
          size..text                         
l/r              lhs           rhs       diff
count  10.000000      10.00000      10.000000
mean   611713.300000  611612.50000 -0.000179 
std    190314.056507  190320.06691  0.000195 
min    403971.000000  403955.00000 -0.000601 
25%    461749.000000  461689.00000 -0.000271 
50%    523114.000000  522882.00000 -0.000087 
75%    754449.250000  754437.25000 -0.000043 
max    907121.000000  906945.00000  0.000000

and the exec time:

Tests: 10
Metric: exec_time

/home/spaits/repo/llvm-test-suite/build/../utils/compare.py:206: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  name0 = names[0]
Program                                       exec_time              
                                              lhs       rhs    diff  
sqlite3/sqlite3                                 1.00      1.02   2.5%
ClamAV/clamscan                                 0.05      0.05   2.3%
consumer-typeset/consumer-typeset               0.05      0.05   1.8%
SPASS/SPASS                                     3.42      3.47   1.5%
tramp3d-v4/tramp3d-v4                           0.06      0.06   0.3%
lencod/lencod                                   1.81      1.81   0.2%
Bullet/bullet                                   1.60      1.59  -0.4%
7zip/7zip-benchmark                             4.84      4.80  -0.8%
mafft/pairlocalalign                            8.87      8.77  -1.1%
kimwitu++/kc                                    0.01      0.01 -10.0%
                           Geomean difference                   -0.4%
       exec_time                      
l/r          lhs        rhs       diff
count  10.000000  10.000000  10.000000
mean   2.169180   2.163620  -0.003505 
std    2.856737   2.829455   0.036214 
min    0.009000   0.008100  -0.100000 
25%    0.054050   0.055000  -0.006569 
50%    1.297800   1.307700   0.002756 
75%    3.017175   3.056900   0.016900 
max    8.865500   8.769800   0.025436 

@qcolombet
Copy link
Collaborator

When decreasing the node limit to 10 we no longer have major affect on the compile time:

The impact is still significant on the compile time without a huge impact on the generated code.
What is the reason for tests with faster compile time?

This is surprising to me.

@spaits
Copy link
Contributor Author

spaits commented Aug 29, 2024

The impact is still significant on the compile time without a huge impact on the generated code.
What is the reason for tests with faster compile time?

This is surprising to me.

I don't know. I was surprised too.

So basically the benchmarking is:

  • I compile a version of llvm w my patch
  • I go to the benchmark build dir run cmake
  • run ninja
  • run the lit command four times. Each time a different json file is produced.
  • let the laptop cool down for 10 mins
  • compile a version of llvm wo my patch
    and the others are the sam as before

There is only one cycle that has been done with the new node restricted patch.

That result were compiled with a wo patch version that was produced before (so 4 file vs 4 file)
I saw -0.1 so I decide to dig up an older wo patch masurement set and also compared that to the result of this measurement. There I got + 0.2.

This is why I assumed that it is not much because once its slightly positive once its slightly negative.

@spaits
Copy link
Contributor Author

spaits commented Aug 29, 2024

Maybe another interesting way to do thing is to build the scheduler graph only for the code section between the copy source and copy destination. This way we will only spend resources where it has a high chance of benefiting.

@spaits
Copy link
Contributor Author

spaits commented Aug 29, 2024

Also maybe I shall run the benchmarks 10 to 20 times not just 4 times. And merge and compare those results.

@spaits
Copy link
Contributor Author

spaits commented Aug 30, 2024

I have don yet another benchmarking. Now with each each version the benchmark was compiled five times and for each compilation I was running llvm-lit 4 times.

The following three versions were considered:

There is a compile time regression relative to the state when we have done ddg for basic blocks.

Left (Baseline) Right (DDG for regions)

Compile time:

python3 ../utils/compare.py -m compile_time resold10.json resold11.json resold12.json resold13.json resold14.json resold15.json resold16.json resold17.json resold18.json resold19.json resold110.json resold111.json resold112.json resold113.json resold114.json resold115.json resold116.json resold117.json resold118.json resold119.json  vs resnew10.json resnew11.json resnew12.json resnew13.json resnew14.json resnew15.json resnew16.json resnew17.json resnew18.json resnew19.json resnew110.json resnew111.json resnew112.json resnew113.json resnew114.json resnew115.json resnew116.json resnew117.json resnew118.json resnew119.json
Tests: 10
Metric: compile_time

/home/spaits/repo/llvm-test-suite/build/../utils/compare.py:206: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  name0 = names[0]
Program                                       compile_time             
                                              lhs          rhs    diff 
lencod/lencod                                  42.12        43.52  3.3%
kimwitu++/kc                                   36.16        37.06  2.5%
tramp3d-v4/tramp3d-v4                          35.01        35.65  1.8%
mafft/pairlocalalign                           22.52        22.85  1.4%
Bullet/bullet                                  81.81        82.85  1.3%
7zip/7zip-benchmark                           174.25       176.05  1.0%
consumer-typeset/consumer-typeset              27.11        27.33  0.8%
SPASS/SPASS                                    37.69        37.78  0.3%
ClamAV/clamscan                                44.05        44.03 -0.0%
sqlite3/sqlite3                                14.02        13.97 -0.3%
                           Geomean difference                      1.2%
      compile_time                       
l/r            lhs         rhs       diff
count  10.000000    10.000000   10.000000
mean   51.474390    52.109240   0.012104 
std    46.747508    47.224755   0.011271 
min    14.017200    13.974600  -0.003039 
25%    29.084400    29.413425   0.003998 
50%    36.923200    37.421600   0.011474 
75%    43.570300    43.903125   0.017345 
max    174.252400   176.048400  0.033065

Exec_time:

python3 ../utils/compare.py -m exec_time resold10.json resold11.json resold12.json resold13.json resold14.json resold15.json resold16.json resold17.json resold18.json resold19.json resold110.json resold111.json resold112.json resold113.json resold114.json resold115.json resold116.json resold117.json resold118.json resold119.json  vs resnew10.json resnew11.json resnew12.json resnew13.json resnew14.json resnew15.json resnew16.json resnew17.json resnew18.json resnew19.json resnew110.json resnew111.json resnew112.json resnew113.json resnew114.json resnew115.json resnew116.json resnew117.json resnew118.json resnew119.json
Tests: 10
Metric: exec_time

/home/spaits/repo/llvm-test-suite/build/../utils/compare.py:206: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  name0 = names[0]
Program                                       exec_time              
                                              lhs       rhs    diff  
sqlite3/sqlite3                                 1.01      1.03   2.4%
SPASS/SPASS                                     3.50      3.55   1.3%
ClamAV/clamscan                                 0.05      0.05   1.0%
mafft/pairlocalalign                            8.75      8.82   0.8%
lencod/lencod                                   1.80      1.81   0.7%
tramp3d-v4/tramp3d-v4                           0.06      0.06  -0.6%
Bullet/bullet                                   1.65      1.63  -0.8%
7zip/7zip-benchmark                             4.87      4.82  -1.1%
consumer-typeset/consumer-typeset               0.04      0.04  -6.3%
kimwitu++/kc                                    0.01      0.01 -14.3%
                           Geomean difference                   -1.8%
       exec_time                      
l/r          lhs        rhs       diff
count  10.000000  10.000000  10.000000
mean   2.174090   2.182130  -0.016837 
std    2.833687   2.847120   0.050182 
min    0.007000   0.006000  -0.142857 
25%    0.051375   0.051650  -0.009887 
50%    1.327900   1.333600   0.000346 
75%    3.075775   3.112825   0.009726 
max    8.749700   8.816100   0.023854

Size of text section:

python3 ../utils/compare.py -m size..text resold10.json resold11.json resold12.json resold13.json resold14.json resold15.json resold16.json resold17.json resold18.json resold19.json resold110.json resold111.json resold112.json resold113.json resold114.json resold115.json resold116.json resold117.json resold118.json resold119.json  vs resnew10.json resnew11.json resnew12.json resnew13.json resnew14.json resnew15.json resnew16.json resnew17.json resnew18.json resnew19.json resnew110.json resnew111.json resnew112.json resnew113.json resnew114.json resnew115.json resnew116.json resnew117.json resnew118.json resnew119.json
Tests: 10
Metric: size..text

/home/spaits/repo/llvm-test-suite/build/../utils/compare.py:206: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  name0 = names[0]
Program                                       size..text                
                                              lhs        rhs       diff 
lencod/lencod                                 763725.00  763725.00  0.0%
mafft/pairlocalalign                          452119.00  452103.00 -0.0%
kimwitu++/kc                                  403971.00  403955.00 -0.0%
tramp3d-v4/tramp3d-v4                         884131.00  884083.00 -0.0%
Bullet/bullet                                 726622.00  726574.00 -0.0%
consumer-typeset/consumer-typeset             442577.00  442529.00 -0.0%
7zip/7zip-benchmark                           907121.00  906945.00 -0.0%
ClamAV/clamscan                               540338.00  540178.00 -0.0%
sqlite3/sqlite3                               490639.00  490447.00 -0.0%
SPASS/SPASS                                   505890.00  505586.00 -0.1%
                           Geomean difference                      -0.0%
          size..text                         
l/r              lhs           rhs       diff
count  10.000000      10.00000      10.000000
mean   611713.300000  611612.50000 -0.000179 
std    190314.056507  190320.06691  0.000195 
min    403971.000000  403955.00000 -0.000601 
25%    461749.000000  461689.00000 -0.000271 
50%    523114.000000  522882.00000 -0.000087 
75%    754449.250000  754437.25000 -0.000043 
max    907121.000000  906945.00000  0.000000

I have done these benchmarks on my laptop.
I have a fairly strong laptop with an 12th Gen Intel i7-1265U (12) @ 4.800GHz and 32GBs of ram, but I don't think I can measure stuff like compile time and runtime effectively. Just the heat of the laptop can add or remove whole percents from the compile time and exec time. I tried to conduct these measurements in a way so each benchmark session begins at the same state but it is really hard.
When running llvm-lit only the runtime results change. The compile time is decided when running ninja.
So basically when comparing exec time, then there was really 20 result merged and compared, but when comparing the compile time results there was rather just 5 results merged and compared. So basically after the code size, which is a fairly static thing (you can a code with compiler 100 time with a the same compiler you get the same code size each time) the execution time is the most accurate result.

I don't have better equipment right now. Also I have only checked the x86-64 target that is not that prone to patterns like the one addressed by this patch. I think the best would be to try this out on arm or riscv and do the compilation in a more consistent environment.

Also one more possible improvement:

Since I only build the ddg when needed, now we don't have to deal with register renames after the ddg build so this whole thing can be done in one stage again.
I will try that that may reduce compile time further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants