Skip to content

[memprof] Use std::vector<Frame> instead of llvm::SmallVector<Frame> (NFC) #94432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 6, 2024

Conversation

kazutakahirata
Copy link
Contributor

@kazutakahirata kazutakahirata commented Jun 5, 2024

This patch replaces llvm::SmallVector with std::vector.

llvm::SmallVector sets aside one inline element. Meanwhile,
when I sort all call stacks by their lengths, the length at the first
percentile is already 2. That is, 99 percent of call stacks do not
take advantage of the inline element.

Using std::vector reduces the cycle and instruction counts by
11% and 22%, respectively, with "llvm-profdata show" modified to
deserialize all MemProfRecords.

…(NFC)

This patch replaces llvm::SmallVector<Frame> with std::vector<Frame>.

llvm::SmallVector<Frame> sets aside one inline element.  Meanwhile,
when I sort all call stacks by their lengths, the length at the first
percentile is already 2.  That is, 99 percent of call stacks do not
take advantage of the inline element.

Using std::vector<Frame> reduces the cycle and instruction counts by
4.5% and 10.3%, respectively, with "llvm-profdata show" modified to
deserialize all MemProfRecords.
@llvmbot llvmbot added PGO Profile Guided Optimizations llvm:transforms labels Jun 5, 2024
@llvmbot
Copy link
Member

llvmbot commented Jun 5, 2024

@llvm/pr-subscribers-pgo

@llvm/pr-subscribers-llvm-transforms

Author: Kazu Hirata (kazutakahirata)

Changes

This patch replaces llvm::SmallVector<Frame> with std::vector<Frame>.

llvm::SmallVector<Frame> sets aside one inline element. Meanwhile,
when I sort all call stacks by their lengths, the length at the first
percentile is already 2. That is, 99 percent of call stacks do not
take advantage of the inline element.

Using std::vector<Frame> reduces the cycle and instruction counts by
4.5% and 10.3%, respectively, with "llvm-profdata show" modified to
deserialize all MemProfRecords.


Full diff: https://github.com/llvm/llvm-project/pull/94432.diff

3 Files Affected:

  • (modified) llvm/include/llvm/ProfileData/MemProf.h (+9-10)
  • (modified) llvm/lib/ProfileData/MemProf.cpp (+1-2)
  • (modified) llvm/lib/Transforms/Instrumentation/MemProfiler.cpp (+2-2)
diff --git a/llvm/include/llvm/ProfileData/MemProf.h b/llvm/include/llvm/ProfileData/MemProf.h
index 406144d9db1e8..667b9fa033d75 100644
--- a/llvm/include/llvm/ProfileData/MemProf.h
+++ b/llvm/include/llvm/ProfileData/MemProf.h
@@ -364,7 +364,7 @@ struct IndexedAllocationInfo {
 // be used for temporary in-memory instances.
 struct AllocationInfo {
   // Same as IndexedAllocationInfo::CallStack with the frame contents inline.
-  llvm::SmallVector<Frame> CallStack;
+  std::vector<Frame> CallStack;
   // Same as IndexedAllocationInfo::Info;
   PortableMemInfoBlock Info;
 
@@ -446,8 +446,7 @@ struct IndexedMemProfRecord {
   // Convert IndexedMemProfRecord to MemProfRecord.  Callback is used to
   // translate CallStackId to call stacks with frames inline.
   MemProfRecord toMemProfRecord(
-      llvm::function_ref<llvm::SmallVector<Frame>(const CallStackId)> Callback)
-      const;
+      llvm::function_ref<std::vector<Frame>(const CallStackId)> Callback) const;
 
   // Returns the GUID for the function name after canonicalization. For
   // memprof, we remove any .llvm suffix added by LTO. MemProfRecords are
@@ -462,7 +461,7 @@ struct MemProfRecord {
   // Same as IndexedMemProfRecord::AllocSites with frame contents inline.
   llvm::SmallVector<AllocationInfo> AllocSites;
   // Same as IndexedMemProfRecord::CallSites with frame contents inline.
-  llvm::SmallVector<llvm::SmallVector<Frame>> CallSites;
+  llvm::SmallVector<std::vector<Frame>> CallSites;
 
   MemProfRecord() = default;
   MemProfRecord(
@@ -472,7 +471,7 @@ struct MemProfRecord {
       AllocSites.emplace_back(IndexedAI, IdToFrameCallback);
     }
     for (const ArrayRef<FrameId> Site : Record.CallSites) {
-      llvm::SmallVector<Frame> Frames;
+      std::vector<Frame> Frames;
       for (const FrameId Id : Site) {
         Frames.push_back(IdToFrameCallback(Id));
       }
@@ -490,7 +489,7 @@ struct MemProfRecord {
 
     if (!CallSites.empty()) {
       OS << "    CallSites:\n";
-      for (const llvm::SmallVector<Frame> &Frames : CallSites) {
+      for (const std::vector<Frame> &Frames : CallSites) {
         for (const Frame &F : Frames) {
           OS << "    -\n";
           F.printYAML(OS);
@@ -844,8 +843,8 @@ template <typename MapTy> struct CallStackIdConverter {
   CallStackIdConverter(const CallStackIdConverter &) = delete;
   CallStackIdConverter &operator=(const CallStackIdConverter &) = delete;
 
-  llvm::SmallVector<Frame> operator()(CallStackId CSId) {
-    llvm::SmallVector<Frame> Frames;
+  std::vector<Frame> operator()(CallStackId CSId) {
+    std::vector<Frame> Frames;
     auto CSIter = Map.find(CSId);
     if (CSIter == Map.end()) {
       LastUnmappedId = CSId;
@@ -886,8 +885,8 @@ struct LinearCallStackIdConverter {
                              std::function<Frame(LinearFrameId)> FrameIdToFrame)
       : CallStackBase(CallStackBase), FrameIdToFrame(FrameIdToFrame) {}
 
-  llvm::SmallVector<Frame> operator()(LinearCallStackId LinearCSId) {
-    llvm::SmallVector<Frame> Frames;
+  std::vector<Frame> operator()(LinearCallStackId LinearCSId) {
+    std::vector<Frame> Frames;
 
     const unsigned char *Ptr =
         CallStackBase +
diff --git a/llvm/lib/ProfileData/MemProf.cpp b/llvm/lib/ProfileData/MemProf.cpp
index 1d9860e0ea7e8..2b227a65c1d8f 100644
--- a/llvm/lib/ProfileData/MemProf.cpp
+++ b/llvm/lib/ProfileData/MemProf.cpp
@@ -338,8 +338,7 @@ IndexedMemProfRecord::deserialize(const MemProfSchema &Schema,
 }
 
 MemProfRecord IndexedMemProfRecord::toMemProfRecord(
-    llvm::function_ref<llvm::SmallVector<Frame>(const CallStackId)> Callback)
-    const {
+    llvm::function_ref<std::vector<Frame>(const CallStackId)> Callback) const {
   MemProfRecord Record;
 
   Record.AllocSites.reserve(AllocSites.size());
diff --git a/llvm/lib/Transforms/Instrumentation/MemProfiler.cpp b/llvm/lib/Transforms/Instrumentation/MemProfiler.cpp
index c0a3bf8464d2d..d70c6a7a0a152 100644
--- a/llvm/lib/Transforms/Instrumentation/MemProfiler.cpp
+++ b/llvm/lib/Transforms/Instrumentation/MemProfiler.cpp
@@ -759,7 +759,7 @@ static void readMemprof(Module &M, Function &F,
   std::map<uint64_t, std::set<const AllocationInfo *>> LocHashToAllocInfo;
   // For the callsites we need to record the index of the associated frame in
   // the frame array (see comments below where the map entries are added).
-  std::map<uint64_t, std::set<std::pair<const SmallVector<Frame> *, unsigned>>>
+  std::map<uint64_t, std::set<std::pair<const std::vector<Frame> *, unsigned>>>
       LocHashToCallSites;
   for (auto &AI : MemProfRec->AllocSites) {
     // Associate the allocation info with the leaf frame. The later matching
@@ -815,7 +815,7 @@ static void readMemprof(Module &M, Function &F,
       // and another callsite).
       std::map<uint64_t, std::set<const AllocationInfo *>>::iterator
           AllocInfoIter;
-      std::map<uint64_t, std::set<std::pair<const SmallVector<Frame> *,
+      std::map<uint64_t, std::set<std::pair<const std::vector<Frame> *,
                                             unsigned>>>::iterator CallSitesIter;
       for (const DILocation *DIL = I.getDebugLoc(); DIL != nullptr;
            DIL = DIL->getInlinedAt()) {

@MaskRay
Copy link
Member

MaskRay commented Jun 5, 2024

Consider SmallVector<Frame, 0>? sizeof(SmallVector<X, 0>) < sizeof(vector<X>)

@kazutakahirata
Copy link
Contributor Author

Consider SmallVector<Frame, 0>? sizeof(SmallVector<X, 0>) < sizeof(vector<X>)

That's super interesting. Thanks! I just tried SmallVector<Frame, 0>. The cycle count goes up by 2.3% while the instruction goes down slightly by 0.09% (all relative to SmallVector<Frame>). My llvm-profdata is not built with ThinLTO, so maybe a certain call hierarchy is not collapsed!?

@teresajohnson
Copy link
Contributor

This patch replaces llvm::SmallVector with std::vector.

llvm::SmallVector sets aside one inline element.

I thought it allocated a few elements (I guess more than 2 from what you found below)?

Meanwhile, when I sort all call stacks by their lengths, the length at the first percentile is already 2. That is, 99 percent of call stacks do not take advantage of the inline element.

I'm surprised that almost all are <=2: the allocation call contexts should almost always be longer than 2 frames in practice. Is this dominated by the CallSites frames? I can imagine those are short. Maybe that should use std::vector (or SmallVector<...,0>), and the AllocationInfo CallStack remain SmallVector?

@kazutakahirata
Copy link
Contributor Author

This patch replaces llvm::SmallVector with std::vector.
llvm::SmallVector sets aside one inline element.

I thought it allocated a few elements (I guess more than 2 from what you found below)?

No, by default, the number of inlined elements is the maximum of:

  • 1
  • the number of elements that would fit within 64 bytes, including SmallVector's data pointer, the length field, etc.

So, SmallVector<Frame> has only one inlined element because sizeof(Frame) == 32 (after #94655).

Meanwhile, when I sort all call stacks by their lengths, the length at the first percentile is already 2. That is, 99 percent of call stacks do not take advantage of the inline element.

I'm surprised that almost all are <=2: the allocation call contexts should almost always be longer than 2 frames in practice. Is this dominated by the CallSites frames? I can imagine those are short. Maybe that should use std::vector (or SmallVector<...,0>), and the AllocationInfo CallStack remain SmallVector?

I think you've got the percentiles backwards. I'm sorting the call stacks in the ascending order of their lengths. Here are the stats for call stack lengths:

  • the 1st percentile: 2 frames
  • the 5th percentile: 5 frames
  • the median: 72 frames

So, we rarely use the inlined element of SmallVector<Frame>.

With the smaller baseline cycle/instruction counts thanks to #94655, which just landed, using std::vector<Frame> here reduces the cycle and instruction counts by 11% and 22%, respectively. I've updated the commit message of this PR accordingly.

@teresajohnson
Copy link
Contributor

This patch replaces llvm::SmallVector with std::vector.
llvm::SmallVector sets aside one inline element.

I thought it allocated a few elements (I guess more than 2 from what you found below)?

No, by default, the number of inlined elements is the maximum of:

  • 1
  • the number of elements that would fit within 64 bytes, including SmallVector's data pointer, the length field, etc.

So, SmallVector<Frame> has only one inlined element because sizeof(Frame) == 32 (after #94655).

Meanwhile, when I sort all call stacks by their lengths, the length at the first percentile is already 2. That is, 99 percent of call stacks do not take advantage of the inline element.

I'm surprised that almost all are <=2: the allocation call contexts should almost always be longer than 2 frames in practice. Is this dominated by the CallSites frames? I can imagine those are short. Maybe that should use std::vector (or SmallVector<...,0>), and the AllocationInfo CallStack remain SmallVector?

I think you've got the percentiles backwards. I'm sorting the call stacks in the ascending order of their lengths. Here are the stats for call stack lengths:

  • the 1st percentile: 2 frames
  • the 5th percentile: 5 frames
  • the median: 72 frames

So, we rarely use the inlined element of SmallVector<Frame>.

Oh got it, yep I mis-read.

With the smaller baseline cycle/instruction counts thanks to #94655, which just landed, using std::vector<Frame> here reduces the cycle and instruction counts by 11% and 22%, respectively. I've updated the commit message of this PR accordingly.

@kazutakahirata kazutakahirata merged commit 4a918f0 into llvm:main Jun 6, 2024
10 checks passed
@kazutakahirata kazutakahirata deleted the memprof_std_vector branch June 6, 2024 21:24
@dwblaikie
Copy link
Collaborator

My recollection was that even SmallVector<T, 0> had some advantages over std::vector ( https://llvm.org/docs/ProgrammersManual.html#llvm-adt-smallvector-h mentions some possible reasons for that) - so maybe worth using that instead?

@kazutakahirata
Copy link
Contributor Author

My recollection was that even SmallVector<T, 0> had some advantages over std::vector ( https://llvm.org/docs/ProgrammersManual.html#llvm-adt-smallvector-h mentions some possible reasons for that) - so maybe worth using that instead?

I've tried SmallVector<T, 0>, but that didn't improve the performance. Now, my llvm-profdata is not built with FDO+ThinLTO, so there may be some missed inlining opportunities or something that would make SmallVector<T, 0> a better choice than std::vector<T> here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm:transforms PGO Profile Guided Optimizations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants