-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[X86] For minsize memset/memcpy, use byte or double-word accesses #87003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-x86 Author: AtariDreams (AtariDreams) ChangesAssume true when we get to the getMemset code, as it ought to be profitable by then. Just like how getMemcpy works. Full diff: https://github.com/llvm/llvm-project/pull/87003.diff 2 Files Affected:
diff --git a/llvm/lib/Target/X86/X86SelectionDAGInfo.cpp b/llvm/lib/Target/X86/X86SelectionDAGInfo.cpp
index 7c630a2b0da080..50d273e69ada44 100644
--- a/llvm/lib/Target/X86/X86SelectionDAGInfo.cpp
+++ b/llvm/lib/Target/X86/X86SelectionDAGInfo.cpp
@@ -66,8 +66,10 @@ SDValue X86SelectionDAGInfo::EmitTargetCodeForMemset(
// If not DWORD aligned or size is more than the threshold, call the library.
// The libc version is likely to be faster for these cases. It can use the
// address value and run time information about the CPU.
- if (Alignment < Align(4) || !ConstantSize ||
- ConstantSize->getZExtValue() > Subtarget.getMaxInlineSizeThreshold())
+ if (!ConstantSize ||
+ (!AlwaysInline &&
+ (Alignment < Align(4) ||
+ ConstantSize->getZExtValue() > Subtarget.getMaxInlineSizeThreshold())))
return SDValue();
uint64_t SizeVal = ConstantSize->getZExtValue();
@@ -142,7 +144,7 @@ SDValue X86SelectionDAGInfo::EmitTargetCodeForMemset(
DAG.getNode(ISD::ADD, dl, AddrVT, Dst,
DAG.getConstant(Offset, dl, AddrVT)),
Val, DAG.getConstant(BytesLeft, dl, SizeVT), Alignment,
- isVolatile, AlwaysInline,
+ isVolatile, /* AlwaysInline */ true,
/* isTailCall */ false, DstPtrInfo.getWithOffset(Offset));
}
diff --git a/llvm/test/CodeGen/X86/memset-vs-memset-inline.ll b/llvm/test/CodeGen/X86/memset-vs-memset-inline.ll
index b8fdd936b43895..16022c6cbb3934 100644
--- a/llvm/test/CodeGen/X86/memset-vs-memset-inline.ll
+++ b/llvm/test/CodeGen/X86/memset-vs-memset-inline.ll
@@ -28,137 +28,10 @@ define void @regular_memset_calls_external_function(ptr %a, i8 %value) nounwind
define void @inlined_set_doesnt_call_external_function(ptr %a, i8 %value) nounwind {
; CHECK-LABEL: inlined_set_doesnt_call_external_function:
; CHECK: # %bb.0:
-; CHECK-NEXT: movzbl %sil, %ecx
-; CHECK-NEXT: movabsq $72340172838076673, %rax # imm = 0x101010101010101
-; CHECK-NEXT: imulq %rcx, %rax
-; CHECK-NEXT: movq %rax, 1016(%rdi)
-; CHECK-NEXT: movq %rax, 1008(%rdi)
-; CHECK-NEXT: movq %rax, 1000(%rdi)
-; CHECK-NEXT: movq %rax, 992(%rdi)
-; CHECK-NEXT: movq %rax, 984(%rdi)
-; CHECK-NEXT: movq %rax, 976(%rdi)
-; CHECK-NEXT: movq %rax, 968(%rdi)
-; CHECK-NEXT: movq %rax, 960(%rdi)
-; CHECK-NEXT: movq %rax, 952(%rdi)
-; CHECK-NEXT: movq %rax, 944(%rdi)
-; CHECK-NEXT: movq %rax, 936(%rdi)
-; CHECK-NEXT: movq %rax, 928(%rdi)
-; CHECK-NEXT: movq %rax, 920(%rdi)
-; CHECK-NEXT: movq %rax, 912(%rdi)
-; CHECK-NEXT: movq %rax, 904(%rdi)
-; CHECK-NEXT: movq %rax, 896(%rdi)
-; CHECK-NEXT: movq %rax, 888(%rdi)
-; CHECK-NEXT: movq %rax, 880(%rdi)
-; CHECK-NEXT: movq %rax, 872(%rdi)
-; CHECK-NEXT: movq %rax, 864(%rdi)
-; CHECK-NEXT: movq %rax, 856(%rdi)
-; CHECK-NEXT: movq %rax, 848(%rdi)
-; CHECK-NEXT: movq %rax, 840(%rdi)
-; CHECK-NEXT: movq %rax, 832(%rdi)
-; CHECK-NEXT: movq %rax, 824(%rdi)
-; CHECK-NEXT: movq %rax, 816(%rdi)
-; CHECK-NEXT: movq %rax, 808(%rdi)
-; CHECK-NEXT: movq %rax, 800(%rdi)
-; CHECK-NEXT: movq %rax, 792(%rdi)
-; CHECK-NEXT: movq %rax, 784(%rdi)
-; CHECK-NEXT: movq %rax, 776(%rdi)
-; CHECK-NEXT: movq %rax, 768(%rdi)
-; CHECK-NEXT: movq %rax, 760(%rdi)
-; CHECK-NEXT: movq %rax, 752(%rdi)
-; CHECK-NEXT: movq %rax, 744(%rdi)
-; CHECK-NEXT: movq %rax, 736(%rdi)
-; CHECK-NEXT: movq %rax, 728(%rdi)
-; CHECK-NEXT: movq %rax, 720(%rdi)
-; CHECK-NEXT: movq %rax, 712(%rdi)
-; CHECK-NEXT: movq %rax, 704(%rdi)
-; CHECK-NEXT: movq %rax, 696(%rdi)
-; CHECK-NEXT: movq %rax, 688(%rdi)
-; CHECK-NEXT: movq %rax, 680(%rdi)
-; CHECK-NEXT: movq %rax, 672(%rdi)
-; CHECK-NEXT: movq %rax, 664(%rdi)
-; CHECK-NEXT: movq %rax, 656(%rdi)
-; CHECK-NEXT: movq %rax, 648(%rdi)
-; CHECK-NEXT: movq %rax, 640(%rdi)
-; CHECK-NEXT: movq %rax, 632(%rdi)
-; CHECK-NEXT: movq %rax, 624(%rdi)
-; CHECK-NEXT: movq %rax, 616(%rdi)
-; CHECK-NEXT: movq %rax, 608(%rdi)
-; CHECK-NEXT: movq %rax, 600(%rdi)
-; CHECK-NEXT: movq %rax, 592(%rdi)
-; CHECK-NEXT: movq %rax, 584(%rdi)
-; CHECK-NEXT: movq %rax, 576(%rdi)
-; CHECK-NEXT: movq %rax, 568(%rdi)
-; CHECK-NEXT: movq %rax, 560(%rdi)
-; CHECK-NEXT: movq %rax, 552(%rdi)
-; CHECK-NEXT: movq %rax, 544(%rdi)
-; CHECK-NEXT: movq %rax, 536(%rdi)
-; CHECK-NEXT: movq %rax, 528(%rdi)
-; CHECK-NEXT: movq %rax, 520(%rdi)
-; CHECK-NEXT: movq %rax, 512(%rdi)
-; CHECK-NEXT: movq %rax, 504(%rdi)
-; CHECK-NEXT: movq %rax, 496(%rdi)
-; CHECK-NEXT: movq %rax, 488(%rdi)
-; CHECK-NEXT: movq %rax, 480(%rdi)
-; CHECK-NEXT: movq %rax, 472(%rdi)
-; CHECK-NEXT: movq %rax, 464(%rdi)
-; CHECK-NEXT: movq %rax, 456(%rdi)
-; CHECK-NEXT: movq %rax, 448(%rdi)
-; CHECK-NEXT: movq %rax, 440(%rdi)
-; CHECK-NEXT: movq %rax, 432(%rdi)
-; CHECK-NEXT: movq %rax, 424(%rdi)
-; CHECK-NEXT: movq %rax, 416(%rdi)
-; CHECK-NEXT: movq %rax, 408(%rdi)
-; CHECK-NEXT: movq %rax, 400(%rdi)
-; CHECK-NEXT: movq %rax, 392(%rdi)
-; CHECK-NEXT: movq %rax, 384(%rdi)
-; CHECK-NEXT: movq %rax, 376(%rdi)
-; CHECK-NEXT: movq %rax, 368(%rdi)
-; CHECK-NEXT: movq %rax, 360(%rdi)
-; CHECK-NEXT: movq %rax, 352(%rdi)
-; CHECK-NEXT: movq %rax, 344(%rdi)
-; CHECK-NEXT: movq %rax, 336(%rdi)
-; CHECK-NEXT: movq %rax, 328(%rdi)
-; CHECK-NEXT: movq %rax, 320(%rdi)
-; CHECK-NEXT: movq %rax, 312(%rdi)
-; CHECK-NEXT: movq %rax, 304(%rdi)
-; CHECK-NEXT: movq %rax, 296(%rdi)
-; CHECK-NEXT: movq %rax, 288(%rdi)
-; CHECK-NEXT: movq %rax, 280(%rdi)
-; CHECK-NEXT: movq %rax, 272(%rdi)
-; CHECK-NEXT: movq %rax, 264(%rdi)
-; CHECK-NEXT: movq %rax, 256(%rdi)
-; CHECK-NEXT: movq %rax, 248(%rdi)
-; CHECK-NEXT: movq %rax, 240(%rdi)
-; CHECK-NEXT: movq %rax, 232(%rdi)
-; CHECK-NEXT: movq %rax, 224(%rdi)
-; CHECK-NEXT: movq %rax, 216(%rdi)
-; CHECK-NEXT: movq %rax, 208(%rdi)
-; CHECK-NEXT: movq %rax, 200(%rdi)
-; CHECK-NEXT: movq %rax, 192(%rdi)
-; CHECK-NEXT: movq %rax, 184(%rdi)
-; CHECK-NEXT: movq %rax, 176(%rdi)
-; CHECK-NEXT: movq %rax, 168(%rdi)
-; CHECK-NEXT: movq %rax, 160(%rdi)
-; CHECK-NEXT: movq %rax, 152(%rdi)
-; CHECK-NEXT: movq %rax, 144(%rdi)
-; CHECK-NEXT: movq %rax, 136(%rdi)
-; CHECK-NEXT: movq %rax, 128(%rdi)
-; CHECK-NEXT: movq %rax, 120(%rdi)
-; CHECK-NEXT: movq %rax, 112(%rdi)
-; CHECK-NEXT: movq %rax, 104(%rdi)
-; CHECK-NEXT: movq %rax, 96(%rdi)
-; CHECK-NEXT: movq %rax, 88(%rdi)
-; CHECK-NEXT: movq %rax, 80(%rdi)
-; CHECK-NEXT: movq %rax, 72(%rdi)
-; CHECK-NEXT: movq %rax, 64(%rdi)
-; CHECK-NEXT: movq %rax, 56(%rdi)
-; CHECK-NEXT: movq %rax, 48(%rdi)
-; CHECK-NEXT: movq %rax, 40(%rdi)
-; CHECK-NEXT: movq %rax, 32(%rdi)
-; CHECK-NEXT: movq %rax, 24(%rdi)
-; CHECK-NEXT: movq %rax, 16(%rdi)
-; CHECK-NEXT: movq %rax, 8(%rdi)
-; CHECK-NEXT: movq %rax, (%rdi)
+; CHECK-NEXT: movl %esi, %eax
+; CHECK-NEXT: movl $1024, %ecx # imm = 0x400
+; CHECK-NEXT: # kill: def $al killed $al killed $eax
+; CHECK-NEXT: rep;stosb %al, %es:(%rdi)
; CHECK-NEXT: retq
tail call void @llvm.memset.inline.p0.i64(ptr %a, i8 %value, i64 1024, i1 0)
ret void
|
d2acc03
to
71657df
Compare
e88fbe6
to
652ee72
Compare
d6285f9
to
267bb33
Compare
Title should mention "memset". Otherwise its unclear what case you're talking about. |
a265b4d
to
9bf30fe
Compare
ccf10bd
to
5e36b38
Compare
@phoebewang is this adequate? |
Can you split the 2nd commit into one NFC for refactor and one for the change? |
Refactored the memset and memcpy codegen to share the alignment-determining code.
repstosb and repstosd are the same size, but stosd is only done for 0 because the process of multiplying the constant so that it is copied across the bytes of the 32-bit number adds extra instructions that cause the size to increase. For 0, repstosb and repstosd are the same size, but stosd is only done for 0 because the process of multiplying the constant so that it is copied across the bytes of the 32-bit number adds extra instructions that cause the size to increase. For 0, we do not need to do that at all. For memcpy, the same goes, and as a result the minsize check was moved ahead because a jmp to memcpy encoded takes more bytes than repmovsb.
Done! |
@phoebewang @topperc Is this good now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thank you @phoebewang. Can we please merge? |
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/51/builds/4778 Here is the relevant piece of the build log for the reference
|
* commit 'FETCH_HEAD': [X86] combineAndLoadToBZHI - don't do an return early return if we fail to match a load [X86] replace-load-and-with-bzhi.ll - add commuted test cases to show failure to fold [X86] replace-load-and-with-bzhi.ll - cleanup check-prefixes to use X86/X64 for 32/64-bit targets [ExecutionEngine] Avoid repeated hash lookups (NFC) (llvm#111275) [ByteCode] Avoid repeated hash lookups (NFC) (llvm#111273) [StaticAnalyzer] Avoid repeated hash lookups (NFC) (llvm#111272) [CodeGen] Avoid repeated hash lookups (NFC) (llvm#111274) [RISCV] Simplify fixed-vector-fp.ll run lines. NFC [libc++][format][1/3] Adds more benchmarks. (llvm#101803) [X86] combineOrXorWithSETCC - avoid duplicate SDLoc/operands code. NFC. [X86] convertIntLogicToFPLogic - avoid duplicate SDLoc/operands code. NFC. [libc] Clean up some include in `libc`. (llvm#110980) [X86] combineBitOpWithPACK - avoid duplicate SDLoc/operands code. NFC. [X86] combineBitOpWithMOVMSK - avoid duplicate SDLoc/operands code. NFC. [X86] combineBitOpWithShift - avoid duplicate SDLoc/operands code. NFC. [x86] combineMul - use computeKnownBits directly to find MUL_IMM constant splat. [X86] combineSubABS - avoid duplicate SDLoc. NFC. [ValueTypes][RISCV] Add v1bf16 type (llvm#111112) [VPlan] Add additional FOR hoisting test. [clang-tidy] Create bugprone-bitwise-pointer-cast check (llvm#108083) [InstCombine] Canonicalize more geps with constant gep bases and constant offsets. (llvm#110033) [LV] Honor uniform-after-vectorization in setVectorizedCallDecision. [ELF] Pass Ctx & to Arch/ [ELF] Pass Ctx & to Arch/ [libc++] Fix a typo (llvm#111239) [X86] For minsize memset/memcpy, use byte or double-word accesses (llvm#87003) [RISCV] Unify RVBShift_ri and RVBShiftW_ri with Shift_ri and ShiftW_ri. NFC (llvm#111263) Revert "Reapply "[AMDGPU][GlobalISel] Fix load/store of pointer vectors, buffer.*.pN (llvm#110714)" (llvm#111059)" [libc] Add missing include to __support/StringUtil/tables/stdc_errors.h. (llvm#111271) [libc] remove errno.h includes (llvm#110934) [NFC][rtsan] Update docs to include [[clang::blocking]] (llvm#111249) [RISCV] Give ZEXT_H_RV32 and ZEXT_H_RV64 R-type format to match PACK. NFC [mlir][SPIRV] Fix build (2) (llvm#111265) [mlir][SPIRV] Fix build error (llvm#111264) [mlir][NFC] Mark type converter in `populate...` functions as `const` (llvm#111250) [Basic] Avoid repeated hash lookups (NFC) (llvm#111228) [RISCV] Use THShift_ri class instead of RVBShift_ri for TH_TST instruction. NFC [VPlan] Only generate first lane for VPPredInstPHI if no others used. [ELF] Don't call getPPC64TargetInfo outside Driver. NFC [GISel] Don't preserve NSW flag when converting G_MUL of INT_MIN to G_SHL. (llvm#111230) [APInt] Slightly simplify APInt::ashrSlowCase. NFC (llvm#111220) [Sema] Avoid repeated hash lookups (NFC) (llvm#111227) [Affine] Avoid repeated hash lookups (NFC) (llvm#111226) [Driver] Avoid repeated hash lookups (NFC) (llvm#111225) [clang][test] Remove a broken bytecode test [ELF] Pass Ctx & [ELF] Pass Ctx & to Relocations Signed-off-by: kyvangka1610 <[email protected]>
We're hitting an assert after this change:
I'll revert until this gets fixed. |
…sses (#87003)" This caused assertion failures: llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:7736: SDValue getMemsetValue(SDValue, EVT, SelectionDAG &, const SDLoc &): Assertion `C->getAPIntValue().getBitWidth() == 8' failed. See comment on the PR for a reproducer. > repstosb and repstosd are the same size, but stosd is only done for 0 > because the process of multiplying the constant so that it is copied > across the bytes of the 32-bit number adds extra instructions that cause > the size to increase. For 0, repstosb and repstosd are the same size, > but stosd is only done for 0 because the process of multiplying the > constant so that it is copied across the bytes of the 32-bit number adds > extra instructions that cause the size to increase. For 0, we do not > need to do that at all. > > For memcpy, the same goes, and as a result the minsize check was moved > ahead because a jmp to memcpy encoded takes more bytes than repmovsb. This reverts commit 6de5305.
…esses (llvm#87003)" Restore old Val if bytes are left over.
…esses (llvm#87003)" Restore old Val if bytes are left over.
…esses (llvm#87003)" Restore old Val if bytes are left over.
…esses (llvm#87003)" Restore old Val if bytes are left over to prevent an assertion failure.
repstosb and repstosd are the same size, but stosd is only done for 0 because the process of multiplying the constant so that it is copied across the bytes of the 32-bit number adds extra instructions that cause the size to increase. For 0, repstosb and repstosd are the same size, but stosd is only done for 0 because the process of multiplying the constant so that it is copied across the bytes of the 32-bit number adds extra instructions that cause the size to increase. For 0, we do not need to do that at all.
For memcpy, the same goes, and as a result the minsize check was moved ahead because a jmp to memcpy encoded takes more bytes than repmovsb.