-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[IR] Initial introduction of memset_pattern #97583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IR] Initial introduction of memset_pattern #97583
Conversation
@llvm/pr-subscribers-backend-powerpc @llvm/pr-subscribers-llvm-transforms Author: Alex Bradbury (asb) ChangesSupersedes the draft PR #94992, taking a different approach following feedback:
As discussed in the RFC thread, the intent is that the intrinsic will be lowered to loops, a sequence of stores, or libcalls depending on the expected cost and availability of libcalls on the target. Right now, there's just a single lowering path that aims to handle all cases (except hasn't been generalised beyond i128 (equivalent to memset_pattern16 yet)). My intent would be to clean this up, then follow up with additional PRs that add additional optimisations when possible (e.g. when libcalls are available, when arguments are known to be constant etc). Patch is 42.45 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/97583.diff 12 Files Affected:
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index c98332d3a24fc..2641dc1a45348 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -15231,6 +15231,62 @@ The behavior of '``llvm.memset.inline.*``' is equivalent to the behavior of
'``llvm.memset.*``', but the generated code is guaranteed not to call any
external functions.
+.. _int_memset_pattern:
+
+'``llvm.memset_pattern``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+This is an overloaded intrinsic. You can use ``llvm.memset_pattern`` on
+any integer bit width and for different address spaces. Not all targets
+support all bit widths however.
+
+::
+
+ declare void @llvm.memset_pattern.p0.i64.i128(ptr <dest>, i128 <val>,
+ i64 <len>, i1 <isvolatile>)
+
+Overview:
+"""""""""
+
+The '``llvm.memset_pattern.*``' intrinsics fill a block of memory with
+a particular value. This may be expanded to an inline loop, a sequence of
+stores, or a libcall depending on what is available for the target and the
+expected performance and code size impact.
+
+Arguments:
+""""""""""
+
+The first argument is a pointer to the destination to fill, the second
+is the value with which to fill it, the third argument is an integer
+argument specifying the number of bytes to fill, and the fourth is a boolean
+indicating a volatile access.
+
+The :ref:`align <attr_align>` parameter attribute can be provided
+for the first argument.
+
+If the ``isvolatile`` parameter is ``true``, the
+``llvm.memset_pattern`` call is a :ref:`volatile operation <volatile>`. The
+detailed access behavior is not very cleanly specified and it is unwise to
+depend on it.
+
+Semantics:
+""""""""""
+
+The '``llvm.memset_pattern.*``' intrinsics fill "len" bytes of memory
+starting at the destination location. If the argument is known to be aligned
+to some boundary, this can be specified as an attribute on the argument.
+
+If ``<len>`` is not an integer multiple of the pattern width in bytes, then any
+remainder bytes will be copied from ``<val>``.
+If ``<len>`` is 0, it is no-op modulo the behavior of attributes attached to
+the arguments.
+If ``<len>`` is not a well-defined value, the behavior is undefined.
+If ``<len>`` is not zero, ``<dest>`` should be well-defined, otherwise the
+behavior is undefined.
+
.. _int_sqrt:
'``llvm.sqrt.*``' Intrinsic
diff --git a/llvm/include/llvm/IR/InstVisitor.h b/llvm/include/llvm/IR/InstVisitor.h
index 311e0ac47ddfa..aa4f0f36e4ed7 100644
--- a/llvm/include/llvm/IR/InstVisitor.h
+++ b/llvm/include/llvm/IR/InstVisitor.h
@@ -208,6 +208,7 @@ class InstVisitor {
RetTy visitDbgInfoIntrinsic(DbgInfoIntrinsic &I){ DELEGATE(IntrinsicInst); }
RetTy visitMemSetInst(MemSetInst &I) { DELEGATE(MemIntrinsic); }
RetTy visitMemSetInlineInst(MemSetInlineInst &I){ DELEGATE(MemSetInst); }
+ RetTy visitMemSetPatternInst(MemSetPatternInst &I) { DELEGATE(MemSetInst); }
RetTy visitMemCpyInst(MemCpyInst &I) { DELEGATE(MemTransferInst); }
RetTy visitMemCpyInlineInst(MemCpyInlineInst &I){ DELEGATE(MemCpyInst); }
RetTy visitMemMoveInst(MemMoveInst &I) { DELEGATE(MemTransferInst); }
@@ -295,6 +296,8 @@ class InstVisitor {
case Intrinsic::memset: DELEGATE(MemSetInst);
case Intrinsic::memset_inline:
DELEGATE(MemSetInlineInst);
+ case Intrinsic::memset_pattern:
+ DELEGATE(MemSetPatternInst);
case Intrinsic::vastart: DELEGATE(VAStartInst);
case Intrinsic::vaend: DELEGATE(VAEndInst);
case Intrinsic::vacopy: DELEGATE(VACopyInst);
diff --git a/llvm/include/llvm/IR/IntrinsicInst.h b/llvm/include/llvm/IR/IntrinsicInst.h
index 3963a5c8ab8f9..72439ebc14a54 100644
--- a/llvm/include/llvm/IR/IntrinsicInst.h
+++ b/llvm/include/llvm/IR/IntrinsicInst.h
@@ -1171,6 +1171,7 @@ class MemIntrinsic : public MemIntrinsicBase<MemIntrinsic> {
case Intrinsic::memmove:
case Intrinsic::memset:
case Intrinsic::memset_inline:
+ case Intrinsic::memset_pattern:
case Intrinsic::memcpy_inline:
return true;
default:
@@ -1182,7 +1183,8 @@ class MemIntrinsic : public MemIntrinsicBase<MemIntrinsic> {
}
};
-/// This class wraps the llvm.memset and llvm.memset.inline intrinsics.
+/// This class wraps the llvm.memset, llvm.memset.inline, and
+/// llvm.memset_pattern intrinsics.
class MemSetInst : public MemSetBase<MemIntrinsic> {
public:
// Methods for support type inquiry through isa, cast, and dyn_cast:
@@ -1190,6 +1192,7 @@ class MemSetInst : public MemSetBase<MemIntrinsic> {
switch (I->getIntrinsicID()) {
case Intrinsic::memset:
case Intrinsic::memset_inline:
+ case Intrinsic::memset_pattern:
return true;
default:
return false;
@@ -1215,6 +1218,21 @@ class MemSetInlineInst : public MemSetInst {
}
};
+/// This class wraps the llvm.memset.pattern intrinsic.
+class MemSetPatternInst : public MemSetInst {
+public:
+ ConstantInt *getLength() const {
+ return cast<ConstantInt>(MemSetInst::getLength());
+ }
+ // Methods for support type inquiry through isa, cast, and dyn_cast:
+ static bool classof(const IntrinsicInst *I) {
+ return I->getIntrinsicID() == Intrinsic::memset_pattern;
+ }
+ static bool classof(const Value *V) {
+ return isa<IntrinsicInst>(V) && classof(cast<IntrinsicInst>(V));
+ }
+};
+
/// This class wraps the llvm.memcpy/memmove intrinsics.
class MemTransferInst : public MemTransferBase<MemIntrinsic> {
public:
@@ -1294,6 +1312,7 @@ class AnyMemIntrinsic : public MemIntrinsicBase<AnyMemIntrinsic> {
case Intrinsic::memmove:
case Intrinsic::memset:
case Intrinsic::memset_inline:
+ case Intrinsic::memset_pattern:
case Intrinsic::memcpy_element_unordered_atomic:
case Intrinsic::memmove_element_unordered_atomic:
case Intrinsic::memset_element_unordered_atomic:
@@ -1316,6 +1335,7 @@ class AnyMemSetInst : public MemSetBase<AnyMemIntrinsic> {
switch (I->getIntrinsicID()) {
case Intrinsic::memset:
case Intrinsic::memset_inline:
+ case Intrinsic::memset_pattern:
case Intrinsic::memset_element_unordered_atomic:
return true;
default:
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index c7d383a5d0c0c..09e759e0e25b3 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -1003,6 +1003,14 @@ def int_memset_inline
NoCapture<ArgIndex<0>>, WriteOnly<ArgIndex<0>>,
ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<3>>]>;
+// Memset variant that writes a given pattern.
+def int_memset_pattern
+ : Intrinsic<[],
+ [llvm_anyptr_ty, llvm_anyint_ty, llvm_anyint_ty, llvm_i1_ty],
+ [IntrWriteMem, IntrArgMemOnly, IntrWillReturn, IntrNoFree, IntrNoCallback,
+ NoCapture<ArgIndex<0>>, WriteOnly<ArgIndex<0>>,
+ ImmArg<ArgIndex<3>>], "llvm.memset_pattern">;
+
// FIXME: Add version of these floating point intrinsics which allow non-default
// rounding modes and FP exception handling.
diff --git a/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp b/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
index 0777acf633187..59e06dd92be14 100644
--- a/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
+++ b/llvm/lib/CodeGen/PreISelIntrinsicLowering.cpp
@@ -263,6 +263,13 @@ bool PreISelIntrinsicLowering::expandMemIntrinsicUses(Function &F) const {
break;
}
+ case Intrinsic::memset_pattern: {
+ auto *Memset = cast<MemSetPatternInst>(Inst);
+ expandMemSetAsLoop(Memset);
+ Changed = true;
+ Memset->eraseFromParent();
+ break;
+ }
default:
llvm_unreachable("unhandled intrinsic");
}
@@ -280,6 +287,7 @@ bool PreISelIntrinsicLowering::lowerIntrinsics(Module &M) const {
case Intrinsic::memcpy:
case Intrinsic::memmove:
case Intrinsic::memset:
+ case Intrinsic::memset_pattern:
Changed |= expandMemIntrinsicUses(F);
break;
case Intrinsic::load_relative:
diff --git a/llvm/lib/IR/Verifier.cpp b/llvm/lib/IR/Verifier.cpp
index c98f61d555140..a0bef394ecee3 100644
--- a/llvm/lib/IR/Verifier.cpp
+++ b/llvm/lib/IR/Verifier.cpp
@@ -5446,7 +5446,8 @@ void Verifier::visitIntrinsicCall(Intrinsic::ID ID, CallBase &Call) {
case Intrinsic::memcpy_inline:
case Intrinsic::memmove:
case Intrinsic::memset:
- case Intrinsic::memset_inline: {
+ case Intrinsic::memset_inline:
+ case Intrinsic::memset_pattern: {
break;
}
case Intrinsic::memcpy_element_unordered_atomic:
diff --git a/llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp b/llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp
index d84e9f094e03a..3adcc51a81ed5 100644
--- a/llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp
+++ b/llvm/lib/Transforms/Utils/LowerMemIntrinsics.cpp
@@ -455,6 +455,106 @@ static void createMemMoveLoop(Instruction *InsertBefore, Value *SrcAddr,
ElseTerm->eraseFromParent();
}
+static void createMemSetPatternLoop(Instruction *InsertBefore, Value *DstAddr,
+ Value *CopyLen, Value *SetValue,
+ Align DstAlign, bool IsVolatile) {
+
+ // To start with, let's assume SetValue is an i128 and bail out if it's not.
+ if (SetValue->getType()->getScalarSizeInBits() != 128) {
+ report_fatal_error("Only 128-bit variant supported for now");
+ }
+ unsigned PatternSize = SetValue->getType()->getScalarSizeInBits() / 8;
+
+ Type *TypeOfCopyLen = CopyLen->getType();
+ BasicBlock *OrigBB = InsertBefore->getParent();
+ Function *F = OrigBB->getParent();
+ const DataLayout &DL = F->getDataLayout();
+
+ BasicBlock *NewBB = OrigBB->splitBasicBlock(InsertBefore, "split");
+ BasicBlock *LoopBB =
+ BasicBlock::Create(F->getContext(), "storeloop", F, NewBB);
+ BasicBlock *RemCheckBB =
+ BasicBlock::Create(F->getContext(), "remcheck", F, NewBB);
+ BasicBlock *RemainderLoopBB =
+ BasicBlock::Create(F->getContext(), "remainderloop", F, NewBB);
+ IRBuilder<> Builder(OrigBB->getTerminator());
+
+ ConstantInt *CILoopOpSize =
+ ConstantInt::get(dyn_cast<IntegerType>(TypeOfCopyLen), PatternSize);
+ Value *RuntimeLoopCount =
+ getRuntimeLoopCount(DL, Builder, CopyLen, CILoopOpSize, PatternSize);
+ Value *RuntimeRemainder =
+ getRuntimeLoopRemainder(DL, Builder, CopyLen, CILoopOpSize, PatternSize);
+
+ Builder.CreateCondBr(Builder.CreateICmpEQ(ConstantInt::get(TypeOfCopyLen, 0),
+ RuntimeLoopCount),
+ RemCheckBB, LoopBB);
+ OrigBB->getTerminator()->eraseFromParent();
+
+ IRBuilder<> LoopBuilder(LoopBB);
+ PHINode *CurrentDst = LoopBuilder.CreatePHI(DstAddr->getType(), 0);
+ CurrentDst->addIncoming(DstAddr, OrigBB);
+ PHINode *LoopCount = LoopBuilder.CreatePHI(TypeOfCopyLen, 0);
+ LoopCount->addIncoming(RuntimeLoopCount, OrigBB);
+
+ // Create the store instruction for the pattern
+ LoopBuilder.CreateAlignedStore(SetValue, CurrentDst, DstAlign, IsVolatile);
+
+ Value *NextDst = LoopBuilder.CreateInBoundsGEP(
+ SetValue->getType(), CurrentDst,
+ ConstantInt::get(TypeOfCopyLen, PatternSize));
+ CurrentDst->addIncoming(NextDst, LoopBB);
+
+ Value *NewLoopCount =
+ LoopBuilder.CreateSub(LoopCount, ConstantInt::get(TypeOfCopyLen, 1));
+ LoopCount->addIncoming(NewLoopCount, LoopBB);
+
+ LoopBuilder.CreateCondBr(
+ LoopBuilder.CreateICmpNE(NewLoopCount,
+ ConstantInt::get(TypeOfCopyLen, 0)),
+ LoopBB, RemCheckBB);
+
+ IRBuilder<> RemCheckBuilder(RemCheckBB, RemCheckBB->begin());
+ // Branch to the end if there are no remainder bytes.
+ PHINode *RemainderDstPHI = RemCheckBuilder.CreatePHI(NextDst->getType(), 0);
+ RemainderDstPHI->addIncoming(DstAddr, OrigBB);
+ RemainderDstPHI->addIncoming(NextDst, LoopBB);
+ RemCheckBuilder.CreateCondBr(
+ RemCheckBuilder.CreateICmpEQ(RuntimeRemainder,
+ ConstantInt::get(TypeOfCopyLen, 0)),
+ NewBB, RemainderLoopBB);
+
+ // Remainder loop
+ IRBuilder<> RemainderLoopBuilder(RemainderLoopBB);
+ PHINode *ByteIndex = RemainderLoopBuilder.CreatePHI(TypeOfCopyLen, 0);
+ ByteIndex->addIncoming(ConstantInt::get(TypeOfCopyLen, 0), RemCheckBB);
+ Type *TypeOfSetValue = SetValue->getType();
+ PHINode *ShiftedValue = RemainderLoopBuilder.CreatePHI(TypeOfSetValue, 0);
+ ShiftedValue->addIncoming(SetValue, RemCheckBB);
+
+ // Get the byte to store
+ Value *ByteToStore = RemainderLoopBuilder.CreateTrunc(
+ ShiftedValue, RemainderLoopBuilder.getInt8Ty());
+
+ // Store the byte
+ RemainderLoopBuilder.CreateStore(
+ ByteToStore,
+ RemainderLoopBuilder.CreateInBoundsGEP(RemainderLoopBuilder.getInt8Ty(),
+ RemainderDstPHI, ByteIndex),
+ IsVolatile);
+
+ Value *NewByteIndex = RemainderLoopBuilder.CreateAdd(
+ ByteIndex, ConstantInt::get(TypeOfCopyLen, 1));
+ ByteIndex->addIncoming(NewByteIndex, RemainderLoopBB);
+ Value *NewShiftedValue = RemainderLoopBuilder.CreateLShr(
+ ShiftedValue, ConstantInt::get(TypeOfSetValue, 8));
+ ShiftedValue->addIncoming(NewShiftedValue, RemainderLoopBB);
+
+ RemainderLoopBuilder.CreateCondBr(
+ RemainderLoopBuilder.CreateICmpULT(NewByteIndex, RuntimeRemainder),
+ RemainderLoopBB, NewBB);
+}
+
static void createMemSetLoop(Instruction *InsertBefore, Value *DstAddr,
Value *CopyLen, Value *SetValue, Align DstAlign,
bool IsVolatile) {
@@ -590,6 +690,16 @@ bool llvm::expandMemMoveAsLoop(MemMoveInst *Memmove,
}
void llvm::expandMemSetAsLoop(MemSetInst *Memset) {
+ if (isa<MemSetPatternInst>(Memset)) {
+ return createMemSetPatternLoop(
+ /* InsertBefore */ Memset,
+ /* DstAddr */ Memset->getRawDest(),
+ /* CopyLen */ Memset->getLength(),
+ /* SetValue */ Memset->getValue(),
+ /* Alignment */ Memset->getDestAlign().valueOrOne(),
+ Memset->isVolatile());
+ }
+
createMemSetLoop(/* InsertBefore */ Memset,
/* DstAddr */ Memset->getRawDest(),
/* CopyLen */ Memset->getLength(),
diff --git a/llvm/test/CodeGen/RISCV/memset-pattern.ll b/llvm/test/CodeGen/RISCV/memset-pattern.ll
new file mode 100644
index 0000000000000..ea50ae0b56e40
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/memset-pattern.ll
@@ -0,0 +1,591 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=riscv32 -mattr=+m \
+; RUN: | FileCheck %s --check-prefixes=RV32-BOTH,RV32
+; RUN: llc < %s -mtriple=riscv64 -mattr=+m \
+; RUN: | FileCheck %s --check-prefixes=RV64-BOTH,RV64
+; RUN: llc < %s -mtriple=riscv32 -mattr=+m,+unaligned-scalar-mem \
+; RUN: | FileCheck %s --check-prefixes=RV32-BOTH,RV32-FAST
+; RUN: llc < %s -mtriple=riscv64 -mattr=+m,+unaligned-scalar-mem \
+; RUN: | FileCheck %s --check-prefixes=RV64-BOTH,RV64-FAST
+
+define void @memset_1(ptr %a, i128 %value) nounwind {
+; RV32-BOTH-LABEL: memset_1:
+; RV32-BOTH: # %bb.0:
+; RV32-BOTH-NEXT: lw a1, 0(a1)
+; RV32-BOTH-NEXT: sb a1, 0(a0)
+; RV32-BOTH-NEXT: ret
+;
+; RV64-BOTH-LABEL: memset_1:
+; RV64-BOTH: # %bb.0:
+; RV64-BOTH-NEXT: sb a1, 0(a0)
+; RV64-BOTH-NEXT: ret
+ tail call void @llvm.memset_pattern.p0.i64.i128(ptr %a, i128 %value, i64 1, i1 0)
+ ret void
+}
+
+define void @memset_2(ptr %a, i128 %value) nounwind {
+; RV32-LABEL: memset_2:
+; RV32: # %bb.0:
+; RV32-NEXT: lw a1, 0(a1)
+; RV32-NEXT: sb a1, 0(a0)
+; RV32-NEXT: srli a1, a1, 8
+; RV32-NEXT: sb a1, 1(a0)
+; RV32-NEXT: ret
+;
+; RV64-LABEL: memset_2:
+; RV64: # %bb.0:
+; RV64-NEXT: sb a1, 0(a0)
+; RV64-NEXT: srli a1, a1, 8
+; RV64-NEXT: sb a1, 1(a0)
+; RV64-NEXT: ret
+;
+; RV32-FAST-LABEL: memset_2:
+; RV32-FAST: # %bb.0:
+; RV32-FAST-NEXT: lw a1, 0(a1)
+; RV32-FAST-NEXT: sh a1, 0(a0)
+; RV32-FAST-NEXT: ret
+;
+; RV64-FAST-LABEL: memset_2:
+; RV64-FAST: # %bb.0:
+; RV64-FAST-NEXT: sh a1, 0(a0)
+; RV64-FAST-NEXT: ret
+ tail call void @llvm.memset_pattern.p0.i64.i128(ptr %a, i128 %value, i64 2, i1 0)
+ ret void
+}
+
+define void @memset_3(ptr %a, i128 %value) nounwind {
+; RV32-LABEL: memset_3:
+; RV32: # %bb.0:
+; RV32-NEXT: lw a1, 0(a1)
+; RV32-NEXT: sb a1, 0(a0)
+; RV32-NEXT: srli a2, a1, 8
+; RV32-NEXT: sb a2, 1(a0)
+; RV32-NEXT: srli a1, a1, 16
+; RV32-NEXT: sb a1, 2(a0)
+; RV32-NEXT: ret
+;
+; RV64-LABEL: memset_3:
+; RV64: # %bb.0:
+; RV64-NEXT: sb a1, 0(a0)
+; RV64-NEXT: srli a2, a1, 8
+; RV64-NEXT: sb a2, 1(a0)
+; RV64-NEXT: srli a1, a1, 16
+; RV64-NEXT: sb a1, 2(a0)
+; RV64-NEXT: ret
+;
+; RV32-FAST-LABEL: memset_3:
+; RV32-FAST: # %bb.0:
+; RV32-FAST-NEXT: lw a1, 0(a1)
+; RV32-FAST-NEXT: sh a1, 0(a0)
+; RV32-FAST-NEXT: srli a1, a1, 16
+; RV32-FAST-NEXT: sb a1, 2(a0)
+; RV32-FAST-NEXT: ret
+;
+; RV64-FAST-LABEL: memset_3:
+; RV64-FAST: # %bb.0:
+; RV64-FAST-NEXT: sh a1, 0(a0)
+; RV64-FAST-NEXT: srli a1, a1, 16
+; RV64-FAST-NEXT: sb a1, 2(a0)
+; RV64-FAST-NEXT: ret
+ tail call void @llvm.memset_pattern.p0.i64.i128(ptr %a, i128 %value, i64 3, i1 0)
+ ret void
+}
+
+define void @memset_4(ptr %a, i128 %value) nounwind {
+; RV32-LABEL: memset_4:
+; RV32: # %bb.0:
+; RV32-NEXT: lw a1, 0(a1)
+; RV32-NEXT: sb a1, 0(a0)
+; RV32-NEXT: srli a2, a1, 24
+; RV32-NEXT: sb a2, 3(a0)
+; RV32-NEXT: srli a2, a1, 16
+; RV32-NEXT: sb a2, 2(a0)
+; RV32-NEXT: srli a1, a1, 8
+; RV32-NEXT: sb a1, 1(a0)
+; RV32-NEXT: ret
+;
+; RV64-LABEL: memset_4:
+; RV64: # %bb.0:
+; RV64-NEXT: sb a1, 0(a0)
+; RV64-NEXT: srli a2, a1, 24
+; RV64-NEXT: sb a2, 3(a0)
+; RV64-NEXT: srli a2, a1, 16
+; RV64-NEXT: sb a2, 2(a0)
+; RV64-NEXT: srli a1, a1, 8
+; RV64-NEXT: sb a1, 1(a0)
+; RV64-NEXT: ret
+;
+; RV32-FAST-LABEL: memset_4:
+; RV32-FAST: # %bb.0:
+; RV32-FAST-NEXT: lw a1, 0(a1)
+; RV32-FAST-NEXT: sw a1, 0(a0)
+; RV32-FAST-NEXT: ret
+;
+; RV64-FAST-LABEL: memset_4:
+; RV64-FAST: # %bb.0:
+; RV64-FAST-NEXT: sw a1, 0(a0)
+; RV64-FAST-NEXT: ret
+ tail call void @llvm.memset_pattern.p0.i64.i128(ptr %a, i128 %value, i64 4, i1 0)
+ ret void
+}
+
+define void @memset_5(ptr %a, i128 %value) nounwind {
+; RV32-LABEL: memset_5:
+; RV32: # %bb.0:
+; RV32-NEXT: lw a2, 0(a1)
+; RV32-NEXT: lw a1, 4(a1)
+; RV32-NEXT: sb a2, 0(a0)
+; RV32-NEXT: sb a1, 4(a0)
+; RV32-NEXT: srli a1, a2, 24
+; RV32-NEXT: sb a1, 3(a0)
+; RV32-NEXT: srli a1, a2, 16
+; RV32-NEXT: sb a1, 2(a0)
+; RV32-NEXT: srli a2, a2, 8
+; RV32-NEXT: sb a2, 1(a0)
+; RV32-NEXT: ret
+;
+; RV64-LABEL: memset_5:
+; RV64: # %bb.0:
+; RV64-NEXT: sb a1, 0(a0)
+; RV64-NEXT: srli a2, a1, 24
+; RV64-NEXT: sb a2, 3(a0)
+; RV64-NEXT: srli a2, a1, 16
+; RV64-NEXT: sb a2, 2(a0)
+; RV64-NEXT: srli a2, a1, 8
+; RV64-NEXT: sb a2, 1(a0)
+; RV64-NEXT: srli a1, a1, 32
+; RV64-NEXT: sb a1, 4(a0)
+; RV64-NEXT: ret
+;
+; RV32-FAST-LABEL: memset_5:
+; RV32-FAST: # %bb.0:
+; RV32-FAST-NEXT: lw a2, 4(a1)
+; RV32-FAST-NEXT: lw a1, 0(a1)
+; RV32-FAST-NEXT: sb a2, 4(a0)
+; RV32-FAST-NEXT: sw a1, 0(a0)
+; RV32-FAST-NEXT: ret
+;
+; RV64-FAST-LABEL: memset_5:
+; RV64-FAST: # %bb.0:
+; RV64-FAST-NEXT: sw a1, 0(a0)
+; RV64-FAST-NEXT: srli a1, a1, 32
+; RV64-FAST-NEXT: sb a1, 4(a0)
+; RV64-FAST-NEXT: ret
+ tail call void @llvm.memset_pattern.p0.i64.i128(ptr %a, i128 %value, i64 5, i1 0)
+ ret void
+}
+
+define void @memset_6(ptr %a, i128 %value) nounwind {
+; RV32-LABEL: memset_6:
+; RV32: # %bb.0:
+; RV32-NEXT: lw a2, 4(a1)
+; RV32-NEXT: lw a1, 0(a1)
+; RV32-NEXT: sb a2, 4(a0)
+; RV32-NEXT: sb a1, 0(a0)
+; RV32-NEXT: srli a2, a2, 8
+; RV32-NEXT: sb a2, 5(a0)
+; RV32-NEXT: srli a2, a1, 24
+; RV32-NEXT: sb a2, 3(a0)
+; RV32-NEXT: srli a2, a1, 16
+; RV32-NEXT: sb a2, 2(a0)
+; RV32-NEXT: srli a1, a1, 8
+; RV32-NEXT: sb a1, 1(a0)
+; RV32-NEXT: ret
+;
+; RV64-LABEL: memset_6:
+; RV64: # %bb.0:
+; RV64-NEXT: sb a1, 0(a0)
+; RV64-NEXT: srli a2, a1, 40
+; RV64-NEXT: sb a2, 5(a0)
+; RV64-NEXT: srli a2, a1, 32
+; RV64-NEXT: sb a2, 4(a0)
+; RV64-NEXT: srli a2, a1, 24
+; RV64-NEXT: sb a2, 3(a0)
+; RV64-NEXT: srli a2, a1, 16
+; RV64-NEXT: sb a2, 2(...
[truncated]
|
Does this need to care about big endian vs little endian? I notice LoopIdiomRecognize.cpp only handles little endian targets. |
…disabled by default) When not disabled, produce the memset_pattern intrinsic that was introduced in llvm#97583.
6560a5c
to
627f1ef
Compare
You're right, I've adjusted the current implementation to bail out if targeting a big endian system. With additional tests added and some cleanup now done, I think this is ready for review as is. |
Ping - review comments greatly appreciated. |
Loked at this again now the 19.x branch has taken place, and all comments are now addressed I believe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I'm still not completely sold on is that the size is specified in bytes. I think the only advantage of doing that is that it directly maps onto the apple libcalls. On the flip side, it has the disadvantage that we may end up generating an unnecessary tail loop.
However, the larger conceptual problem I see is that it may prevent future generalization (though I kind of wonder whether that should not be supported right away) to non-integer types. memset.pattern is a good opportunity to perform stores with the correct types. E.g. consider filling memory with non-integral pointers -- in that case you can't just perform the memset after a ptrtoint. And a store of only part of a non-integral pointer would be pretty ill-defined. To a lesser degree this problem also exists for integral pointers, due to provenance considerations.
I'd rather not end up having to need another separate intrinsic to satisfy that case.
Also simplify/remove tests that are no longer relevant given we don't need to worry about writes of a partial pattern.
I think this is a good call - thank you. I've gone ahead and made that change. One awkwardness I want to flag for review is that MemSetPatternInst still inherits from MemSetInst which inherits from |
@@ -1003,6 +1003,14 @@ def int_memset_inline | |||
NoCapture<ArgIndex<0>>, WriteOnly<ArgIndex<0>>, | |||
ImmArg<ArgIndex<3>>]>; | |||
|
|||
// Memset variant that writes a given pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment what the operands are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added with inline comments (there's not a totally consistent pattern for describing args - many intrinsics have no description at all, but I see some other examples in the file using inline comments for the args)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really we ought to have tablegen mandatory doc strings
def int_memset_pattern | ||
: Intrinsic<[], | ||
[llvm_anyptr_ty, llvm_anyint_ty, llvm_anyint_ty, llvm_i1_ty], | ||
[IntrWriteMem, IntrArgMemOnly, IntrWillReturn, IntrNoFree, IntrNoCallback, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DefaultAttrIntrinsic would hide most of these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dug through this and unfortunately I can't make use of DefaultAttrIntrinsic
because nosync
isn't necessarily true. https://reviews.llvm.org/D86021 switched over memset to using DefaultAttrIntrinsic but this was later backed out in a888e49 due to nosync
not applying unconditionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
This is an overloaded intrinsic. You can use | ||
``llvm.experimental.memset.pattern`` on any integer bit width that is an | ||
integral number of bytes and for different address spaces. Not all targets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a Verifier check to check that the size is integral number of bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now added this check to the verifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually that was my bad. There was some discussion about this before and in the end non-integral bitwidths should work and are tested in llvm/test/Transforms/PreISelIntrinsicLowering/RISCV/memset-pattern.ll
. I've now pushed a fix to LangRef to remove the stated restriction.
…al number of bytes" This reverts commit 78bad3b.
I think I'm ok to merge with Philip's LGTM and I believe all outstanding comments have been addressed, but I'll give this another day or so as it's been open so long. Please speak up if there's anything you feel wasn't fully reflected in the current code. |
the target: i.e. on little endian targets, the least significant byte of the | ||
pattern is first in memory, while the most significant byte is first in memory | ||
for big endian targets. The memory address is incremented by the allocation | ||
size of the type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the size is not an integral number of bytes, what happens with the padding bits?
Do they get unspecified / undefined values or do they preserve their values?
Should this be clarified in LangRef?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior is the same as whatever stores of the type would do.
Possibly we should just explicit refer to store semantics here and reduce the description to something like this:
The '
llvm.experimental.memset.pattern*
' intrinsic stores the provided value<count>>
times, incrementing by the allocation size of the type each time. The stores follows the usual semantics ofstore
instructions, including regarding endianness and padding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's much clearer - I've adopted that wording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
the target: i.e. on little endian targets, the least significant byte of the | ||
pattern is first in memory, while the most significant byte is first in memory | ||
for big endian targets. The memory address is incremented by the allocation | ||
size of the type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior is the same as whatever stores of the type would do.
Possibly we should just explicit refer to store semantics here and reduce the description to something like this:
The '
llvm.experimental.memset.pattern*
' intrinsic stores the provided value<count>>
times, incrementing by the allocation size of the type each time. The stores follows the usual semantics ofstore
instructions, including regarding endianness and padding.
static bool classof(const Value *V) { | ||
return isa<IntrinsicInst>(V) && classof(cast<IntrinsicInst>(V)); | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need both MemSetPatternIntrinsic and MemSetPatternInst?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mirroring the pattern followed by other mem intrinsics, and although it's not super pretty, having both classes like this (as for the standard MemSet intrinsics) seems the way that reduces copy and paste of accessor code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm okay, I thought this might only be needed for cases where we have both atomic and non-atomic variants.
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/185/builds/8572 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/175/builds/8577 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/137/builds/8684 Here is the relevant piece of the build log for the reference
|
…#97583) Relands 7ff3a9a after regenerating the test case. Supersedes the draft PR #94992, taking a different approach following feedback: * Lower in PreISelIntrinsicLowering * Don't require that the number of bytes to set is a compile-time constant * Define llvm.memset_pattern rather than llvm.memset_pattern.inline As discussed in the [RFC thread](https://discourse.llvm.org/t/rfc-introducing-an-llvm-memset-pattern-inline-intrinsic/79496), the intent is that the intrinsic will be lowered to loops, a sequence of stores, or libcalls depending on the expected cost and availability of libcalls on the target. Right now, there's just a single lowering path that aims to handle all cases. My intent would be to follow up with additional PRs that add additional optimisations when possible (e.g. when libcalls are available, when arguments are known to be constant etc).
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/153/builds/14776 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/33/builds/6533 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/16/builds/8935 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/56/builds/12314 Here is the relevant piece of the build log for the reference
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/60/builds/12817 Here is the relevant piece of the build log for the reference
|
Supersedes the draft PR #94992, taking a different approach following feedback:
As discussed in the RFC thread, the intent is that the intrinsic will be lowered to loops, a sequence of stores, or libcalls depending on the expected cost and availability of libcalls on the target. Right now, there's just a single lowering path that aims to handle all cases. My intent would be to follow up with additional PRs that add additional optimisations when possible (e.g. when libcalls are available, when arguments are known to be constant etc).