[memprof] Use std::vector<Frame> instead of llvm::SmallVector<Frame> (NFC) #94432

kazutakahirata · 2024-06-05T06:15:17Z

This patch replaces llvm::SmallVector with std::vector.

llvm::SmallVector sets aside one inline element. Meanwhile,
when I sort all call stacks by their lengths, the length at the first
percentile is already 2. That is, 99 percent of call stacks do not
take advantage of the inline element.

Using std::vector reduces the cycle and instruction counts by
11% and 22%, respectively, with "llvm-profdata show" modified to
deserialize all MemProfRecords.

…(NFC) This patch replaces llvm::SmallVector<Frame> with std::vector<Frame>. llvm::SmallVector<Frame> sets aside one inline element. Meanwhile, when I sort all call stacks by their lengths, the length at the first percentile is already 2. That is, 99 percent of call stacks do not take advantage of the inline element. Using std::vector<Frame> reduces the cycle and instruction counts by 4.5% and 10.3%, respectively, with "llvm-profdata show" modified to deserialize all MemProfRecords.

llvmbot · 2024-06-05T06:15:51Z

@llvm/pr-subscribers-pgo

@llvm/pr-subscribers-llvm-transforms

Author: Kazu Hirata (kazutakahirata)

Changes

This patch replaces llvm::SmallVector<Frame> with std::vector<Frame>.

llvm::SmallVector<Frame> sets aside one inline element. Meanwhile,
when I sort all call stacks by their lengths, the length at the first
percentile is already 2. That is, 99 percent of call stacks do not
take advantage of the inline element.

Using std::vector<Frame> reduces the cycle and instruction counts by
4.5% and 10.3%, respectively, with "llvm-profdata show" modified to
deserialize all MemProfRecords.

Full diff: https://github.com/llvm/llvm-project/pull/94432.diff

3 Files Affected:

(modified) llvm/include/llvm/ProfileData/MemProf.h (+9-10)
(modified) llvm/lib/ProfileData/MemProf.cpp (+1-2)
(modified) llvm/lib/Transforms/Instrumentation/MemProfiler.cpp (+2-2)

diff --git a/llvm/include/llvm/ProfileData/MemProf.h b/llvm/include/llvm/ProfileData/MemProf.h
index 406144d9db1e8..667b9fa033d75 100644
--- a/llvm/include/llvm/ProfileData/MemProf.h
+++ b/llvm/include/llvm/ProfileData/MemProf.h
@@ -364,7 +364,7 @@ struct IndexedAllocationInfo {
 // be used for temporary in-memory instances.
 struct AllocationInfo {
   // Same as IndexedAllocationInfo::CallStack with the frame contents inline.
-  llvm::SmallVector<Frame> CallStack;
+  std::vector<Frame> CallStack;
   // Same as IndexedAllocationInfo::Info;
   PortableMemInfoBlock Info;
 
@@ -446,8 +446,7 @@ struct IndexedMemProfRecord {
   // Convert IndexedMemProfRecord to MemProfRecord.  Callback is used to
   // translate CallStackId to call stacks with frames inline.
   MemProfRecord toMemProfRecord(
-      llvm::function_ref<llvm::SmallVector<Frame>(const CallStackId)> Callback)
-      const;
+      llvm::function_ref<std::vector<Frame>(const CallStackId)> Callback) const;
 
   // Returns the GUID for the function name after canonicalization. For
   // memprof, we remove any .llvm suffix added by LTO. MemProfRecords are
@@ -462,7 +461,7 @@ struct MemProfRecord {
   // Same as IndexedMemProfRecord::AllocSites with frame contents inline.
   llvm::SmallVector<AllocationInfo> AllocSites;
   // Same as IndexedMemProfRecord::CallSites with frame contents inline.
-  llvm::SmallVector<llvm::SmallVector<Frame>> CallSites;
+  llvm::SmallVector<std::vector<Frame>> CallSites;
 
   MemProfRecord() = default;
   MemProfRecord(
@@ -472,7 +471,7 @@ struct MemProfRecord {
       AllocSites.emplace_back(IndexedAI, IdToFrameCallback);
     }
     for (const ArrayRef<FrameId> Site : Record.CallSites) {
-      llvm::SmallVector<Frame> Frames;
+      std::vector<Frame> Frames;
       for (const FrameId Id : Site) {
         Frames.push_back(IdToFrameCallback(Id));
       }
@@ -490,7 +489,7 @@ struct MemProfRecord {
 
     if (!CallSites.empty()) {
       OS << "    CallSites:\n";
-      for (const llvm::SmallVector<Frame> &Frames : CallSites) {
+      for (const std::vector<Frame> &Frames : CallSites) {
         for (const Frame &F : Frames) {
           OS << "    -\n";
           F.printYAML(OS);
@@ -844,8 +843,8 @@ template <typename MapTy> struct CallStackIdConverter {
   CallStackIdConverter(const CallStackIdConverter &) = delete;
   CallStackIdConverter &operator=(const CallStackIdConverter &) = delete;
 
-  llvm::SmallVector<Frame> operator()(CallStackId CSId) {
-    llvm::SmallVector<Frame> Frames;
+  std::vector<Frame> operator()(CallStackId CSId) {
+    std::vector<Frame> Frames;
     auto CSIter = Map.find(CSId);
     if (CSIter == Map.end()) {
       LastUnmappedId = CSId;
@@ -886,8 +885,8 @@ struct LinearCallStackIdConverter {
                              std::function<Frame(LinearFrameId)> FrameIdToFrame)
       : CallStackBase(CallStackBase), FrameIdToFrame(FrameIdToFrame) {}
 
-  llvm::SmallVector<Frame> operator()(LinearCallStackId LinearCSId) {
-    llvm::SmallVector<Frame> Frames;
+  std::vector<Frame> operator()(LinearCallStackId LinearCSId) {
+    std::vector<Frame> Frames;
 
     const unsigned char *Ptr =
         CallStackBase +
diff --git a/llvm/lib/ProfileData/MemProf.cpp b/llvm/lib/ProfileData/MemProf.cpp
index 1d9860e0ea7e8..2b227a65c1d8f 100644
--- a/llvm/lib/ProfileData/MemProf.cpp
+++ b/llvm/lib/ProfileData/MemProf.cpp
@@ -338,8 +338,7 @@ IndexedMemProfRecord::deserialize(const MemProfSchema &Schema,
 }
 
 MemProfRecord IndexedMemProfRecord::toMemProfRecord(
-    llvm::function_ref<llvm::SmallVector<Frame>(const CallStackId)> Callback)
-    const {
+    llvm::function_ref<std::vector<Frame>(const CallStackId)> Callback) const {
   MemProfRecord Record;
 
   Record.AllocSites.reserve(AllocSites.size());
diff --git a/llvm/lib/Transforms/Instrumentation/MemProfiler.cpp b/llvm/lib/Transforms/Instrumentation/MemProfiler.cpp
index c0a3bf8464d2d..d70c6a7a0a152 100644
--- a/llvm/lib/Transforms/Instrumentation/MemProfiler.cpp
+++ b/llvm/lib/Transforms/Instrumentation/MemProfiler.cpp
@@ -759,7 +759,7 @@ static void readMemprof(Module &M, Function &F,
   std::map<uint64_t, std::set<const AllocationInfo *>> LocHashToAllocInfo;
   // For the callsites we need to record the index of the associated frame in
   // the frame array (see comments below where the map entries are added).
-  std::map<uint64_t, std::set<std::pair<const SmallVector<Frame> *, unsigned>>>
+  std::map<uint64_t, std::set<std::pair<const std::vector<Frame> *, unsigned>>>
       LocHashToCallSites;
   for (auto &AI : MemProfRec->AllocSites) {
     // Associate the allocation info with the leaf frame. The later matching
@@ -815,7 +815,7 @@ static void readMemprof(Module &M, Function &F,
       // and another callsite).
       std::map<uint64_t, std::set<const AllocationInfo *>>::iterator
           AllocInfoIter;
-      std::map<uint64_t, std::set<std::pair<const SmallVector<Frame> *,
+      std::map<uint64_t, std::set<std::pair<const std::vector<Frame> *,
                                             unsigned>>>::iterator CallSitesIter;
       for (const DILocation *DIL = I.getDebugLoc(); DIL != nullptr;
            DIL = DIL->getInlinedAt()) {

MaskRay · 2024-06-05T06:38:26Z

Consider SmallVector<Frame, 0>? sizeof(SmallVector<X, 0>) < sizeof(vector<X>)

kazutakahirata · 2024-06-05T07:39:04Z

Consider SmallVector<Frame, 0>? sizeof(SmallVector<X, 0>) < sizeof(vector<X>)

That's super interesting. Thanks! I just tried SmallVector<Frame, 0>. The cycle count goes up by 2.3% while the instruction goes down slightly by 0.09% (all relative to SmallVector<Frame>). My llvm-profdata is not built with ThinLTO, so maybe a certain call hierarchy is not collapsed!?

teresajohnson · 2024-06-06T15:11:54Z

This patch replaces llvm::SmallVector with std::vector.

llvm::SmallVector sets aside one inline element.

I thought it allocated a few elements (I guess more than 2 from what you found below)?

Meanwhile, when I sort all call stacks by their lengths, the length at the first percentile is already 2. That is, 99 percent of call stacks do not take advantage of the inline element.

I'm surprised that almost all are <=2: the allocation call contexts should almost always be longer than 2 frames in practice. Is this dominated by the CallSites frames? I can imagine those are short. Maybe that should use std::vector (or SmallVector<...,0>), and the AllocationInfo CallStack remain SmallVector?

kazutakahirata · 2024-06-06T20:23:24Z

This patch replaces llvm::SmallVector with std::vector.
llvm::SmallVector sets aside one inline element.

I thought it allocated a few elements (I guess more than 2 from what you found below)?

No, by default, the number of inlined elements is the maximum of:

1
the number of elements that would fit within 64 bytes, including SmallVector's data pointer, the length field, etc.

So, SmallVector<Frame> has only one inlined element because sizeof(Frame) == 32 (after #94655).

Meanwhile, when I sort all call stacks by their lengths, the length at the first percentile is already 2. That is, 99 percent of call stacks do not take advantage of the inline element.

I'm surprised that almost all are <=2: the allocation call contexts should almost always be longer than 2 frames in practice. Is this dominated by the CallSites frames? I can imagine those are short. Maybe that should use std::vector (or SmallVector<...,0>), and the AllocationInfo CallStack remain SmallVector?

I think you've got the percentiles backwards. I'm sorting the call stacks in the ascending order of their lengths. Here are the stats for call stack lengths:

the 1st percentile: 2 frames
the 5th percentile: 5 frames
the median: 72 frames

So, we rarely use the inlined element of SmallVector<Frame>.

With the smaller baseline cycle/instruction counts thanks to #94655, which just landed, using std::vector<Frame> here reduces the cycle and instruction counts by 11% and 22%, respectively. I've updated the commit message of this PR accordingly.

teresajohnson · 2024-06-06T20:34:18Z

This patch replaces llvm::SmallVector with std::vector.
llvm::SmallVector sets aside one inline element.

I thought it allocated a few elements (I guess more than 2 from what you found below)?

No, by default, the number of inlined elements is the maximum of:

1

the number of elements that would fit within 64 bytes, including SmallVector's data pointer, the length field, etc.

So, SmallVector<Frame> has only one inlined element because sizeof(Frame) == 32 (after #94655).

Meanwhile, when I sort all call stacks by their lengths, the length at the first percentile is already 2. That is, 99 percent of call stacks do not take advantage of the inline element.

I'm surprised that almost all are <=2: the allocation call contexts should almost always be longer than 2 frames in practice. Is this dominated by the CallSites frames? I can imagine those are short. Maybe that should use std::vector (or SmallVector<...,0>), and the AllocationInfo CallStack remain SmallVector?

I think you've got the percentiles backwards. I'm sorting the call stacks in the ascending order of their lengths. Here are the stats for call stack lengths:

the 1st percentile: 2 frames

the 5th percentile: 5 frames

the median: 72 frames

So, we rarely use the inlined element of SmallVector<Frame>.

Oh got it, yep I mis-read.

With the smaller baseline cycle/instruction counts thanks to #94655, which just landed, using std::vector<Frame> here reduces the cycle and instruction counts by 11% and 22%, respectively. I've updated the commit message of this PR accordingly.

dwblaikie · 2024-06-17T23:12:44Z

My recollection was that even SmallVector<T, 0> had some advantages over std::vector ( https://llvm.org/docs/ProgrammersManual.html#llvm-adt-smallvector-h mentions some possible reasons for that) - so maybe worth using that instead?

kazutakahirata · 2024-06-25T19:59:57Z

My recollection was that even SmallVector<T, 0> had some advantages over std::vector ( https://llvm.org/docs/ProgrammersManual.html#llvm-adt-smallvector-h mentions some possible reasons for that) - so maybe worth using that instead?

I've tried SmallVector<T, 0>, but that didn't improve the performance. Now, my llvm-profdata is not built with FDO+ThinLTO, so there may be some missed inlining opportunities or something that would make SmallVector<T, 0> a better choice than std::vector<T> here.

kazutakahirata requested review from teresajohnson and snehasish June 5, 2024 06:15

llvmbot added PGO Profile Guided Optimizations llvm:transforms labels Jun 5, 2024

teresajohnson approved these changes Jun 6, 2024

View reviewed changes

kazutakahirata merged commit 4a918f0 into llvm:main Jun 6, 2024
10 checks passed

kazutakahirata deleted the memprof_std_vector branch June 6, 2024 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[memprof] Use std::vector<Frame> instead of llvm::SmallVector<Frame> (NFC) #94432

[memprof] Use std::vector<Frame> instead of llvm::SmallVector<Frame> (NFC) #94432

Uh oh!

kazutakahirata commented Jun 5, 2024 •

edited

Loading

Uh oh!

llvmbot commented Jun 5, 2024 •

edited

Loading

Uh oh!

MaskRay commented Jun 5, 2024

Uh oh!

kazutakahirata commented Jun 5, 2024

Uh oh!

teresajohnson commented Jun 6, 2024

Uh oh!

kazutakahirata commented Jun 6, 2024

Uh oh!

teresajohnson commented Jun 6, 2024

Uh oh!

Uh oh!

dwblaikie commented Jun 17, 2024

Uh oh!

kazutakahirata commented Jun 25, 2024

Uh oh!

Uh oh!

[memprof] Use std::vector<Frame> instead of llvm::SmallVector<Frame> (NFC) #94432

[memprof] Use std::vector<Frame> instead of llvm::SmallVector<Frame> (NFC) #94432

Uh oh!

Conversation

kazutakahirata commented Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaskRay commented Jun 5, 2024

Uh oh!

kazutakahirata commented Jun 5, 2024

Uh oh!

teresajohnson commented Jun 6, 2024

Uh oh!

kazutakahirata commented Jun 6, 2024

Uh oh!

teresajohnson commented Jun 6, 2024

Uh oh!

Uh oh!

dwblaikie commented Jun 17, 2024

Uh oh!

kazutakahirata commented Jun 25, 2024

Uh oh!

Uh oh!

kazutakahirata commented Jun 5, 2024 •

edited

Loading

llvmbot commented Jun 5, 2024 •

edited

Loading