Skip to content

[memprof] Dump call site matching information #125130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 7, 2025

Conversation

kazutakahirata
Copy link
Contributor

@kazutakahirata kazutakahirata commented Jan 30, 2025

MemProfiler.cpp annotates the IR with the memory profile so that we
can later duplicate context. This patch dumps the entire inline call stack
for each call site match.

MemProfiler.cpp annotates the IR with the memory profile so that we
can later duplicate context.  Dumping the call site matching
information here allows us to analyze how well we manage to annotate
the IR.  Specifically, this patch dumps:

- the full stack ID (to identify the profile call stack)
- the index within the profile call stack where we start matching
- the size of InlinedCallStack

This way, we get to see what part of profile call stack we are
matching, not just one frame somewhere in the profile call stack.

Now, obtaining the full stack ID requires a little bit of refactoring.
This patch modifies the value type of LocHashToCallSites so that it
contains the full stack as well as the starting index of a match.
Essentially, this patch partially reverts:

  commit 7c294eb
  Author: Kazu Hirata <[email protected]>
  Date:   Sat Dec 14 00:03:27 2024 -0800
@llvmbot
Copy link
Member

llvmbot commented Jan 30, 2025

@llvm/pr-subscribers-pgo

@llvm/pr-subscribers-llvm-transforms

Author: Kazu Hirata (kazutakahirata)

Changes

MemProfiler.cpp annotates the IR with the memory profile so that we
can later duplicate context. Dumping the call site matching
information here allows us to analyze how well we manage to annotate
the IR. Specifically, this patch dumps:

  • the full stack ID (to identify the profile call stack)
  • the index within the profile call stack where we start matching
  • the size of InlinedCallStack

This way, we get to see what part of profile call stack we are
matching, not just one frame somewhere in the profile call stack.

Now, obtaining the full stack ID requires a little bit of refactoring.
This patch modifies the value type of LocHashToCallSites so that it
contains the full stack as well as the starting index of a match.
Essentially, this patch partially reverts:

commit 7c294eb
Author: Kazu Hirata <[email protected]>
Date: Sat Dec 14 00:03:27 2024 -0800


Full diff: https://github.com/llvm/llvm-project/pull/125130.diff

1 Files Affected:

  • (modified) llvm/lib/Transforms/Instrumentation/MemProfiler.cpp (+15-6)
diff --git a/llvm/lib/Transforms/Instrumentation/MemProfiler.cpp b/llvm/lib/Transforms/Instrumentation/MemProfiler.cpp
index 91c48338d032089..66d7466a92c3b39 100644
--- a/llvm/lib/Transforms/Instrumentation/MemProfiler.cpp
+++ b/llvm/lib/Transforms/Instrumentation/MemProfiler.cpp
@@ -1034,13 +1034,15 @@ readMemprof(Module &M, Function &F, IndexedInstrProfReader *MemProfReader,
   std::map<uint64_t, std::set<const AllocationInfo *>> LocHashToAllocInfo;
   // A hash function for std::unordered_set<ArrayRef<Frame>> to work.
   struct CallStackHash {
-    size_t operator()(ArrayRef<Frame> CS) const {
-      return computeFullStackId(CS);
+    size_t operator()(const std::pair<ArrayRef<Frame>, unsigned> &CS) const {
+      auto &[CallStack, Idx] = CS;
+      return computeFullStackId(ArrayRef<Frame>(CallStack).drop_front(Idx));
     }
   };
   // For the callsites we need to record slices of the frame array (see comments
   // below where the map entries are added).
-  std::map<uint64_t, std::unordered_set<ArrayRef<Frame>, CallStackHash>>
+  std::map<uint64_t, std::unordered_set<std::pair<ArrayRef<Frame>, unsigned>,
+                                        CallStackHash>>
       LocHashToCallSites;
   for (auto &AI : MemProfRec->AllocSites) {
     NumOfMemProfAllocContextProfiles++;
@@ -1058,7 +1060,7 @@ readMemprof(Module &M, Function &F, IndexedInstrProfReader *MemProfReader,
     unsigned Idx = 0;
     for (auto &StackFrame : CS) {
       uint64_t StackId = computeStackId(StackFrame);
-      LocHashToCallSites[StackId].insert(ArrayRef<Frame>(CS).drop_front(Idx++));
+      LocHashToCallSites[StackId].emplace(CS, Idx++);
       ProfileHasColumns |= StackFrame.Column;
       // Once we find this function, we can stop recording.
       if (StackFrame.Function == FuncGUID)
@@ -1201,15 +1203,22 @@ readMemprof(Module &M, Function &F, IndexedInstrProfReader *MemProfReader,
       // instruction's leaf location in the callsites map and not the allocation
       // map.
       assert(CallSitesIter != LocHashToCallSites.end());
-      for (auto CallStackIdx : CallSitesIter->second) {
+      for (auto &[ProfileCallStack, Idx] : CallSitesIter->second) {
         // If we found and thus matched all frames on the call, create and
         // attach call stack metadata.
-        if (stackFrameIncludesInlinedCallStack(CallStackIdx,
+        if (stackFrameIncludesInlinedCallStack(ProfileCallStack.drop_front(Idx),
                                                InlinedCallStack)) {
           NumOfMemProfMatchedCallSites++;
           addCallsiteMetadata(I, InlinedCallStack, Ctx);
           // Only need to find one with a matching call stack and add a single
           // callsite metadata.
+
+          // Dump call site matching information upon request.
+          if (ClPrintMemProfMatchInfo) {
+            uint64_t FullStackId = computeFullStackId(ProfileCallStack);
+            errs() << "MemProf callsite " << FullStackId << " " << Idx << " "
+                   << InlinedCallStack.size() << "\n";
+          }
           break;
         }
       }

Copy link
Contributor

@teresajohnson teresajohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably clarify in the description that this will just be the full profiled context of frames on the interior callsite (i.e. including all possible inlined frames). Unlike in the allocation case where we have the whole allocation context.

Also, can you add a test?

@llvmbot llvmbot added the PGO Profile Guided Optimizations label Feb 5, 2025
@kazutakahirata
Copy link
Contributor Author

You should probably clarify in the description that this will just be the full profiled context of frames on the interior callsite (i.e. including all possible inlined frames). Unlike in the allocation case where we have the whole allocation context.

Thank you for pointing this out! I've switched to printing the entire inline call stack per line.

Also, can you add a test?

Done in the latest revision.

Copy link

github-actions bot commented Feb 5, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@@ -0,0 +1,114 @@
; Tests that the compiler dumps call site matches upon request.
;
; The test case is generated from:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use Elaborated Tests to generate the yaml and the IR?
https://llvm.org/docs/TestingGuide.html#elaborated-tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer! I wasn't aware of the gen feature, but the ability to regenerate the whole thing is very appealing. That said, I like cleaned-up IR in this particular case because I care about debugging information being human readable. I'll definitely keep "Elaborated Tested" in my tool box. Thanks again!

...
;--- memprof-dump-matched-call-site.ll
; CHECK: MemProf notcold context with id 3894143216621363392 has total profiled size 4 is matched
; CHECK: MemProf callsite match for inline call stack 4745611964195289084 10616861955219347331
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will be impossible to compare against the allocation match messages which use the full context hash - should we (optionally?) emit the constituent stack ids of the allocations as well?

Copy link
Contributor Author

@kazutakahirata kazutakahirata Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we cannot compare these stack IDs against anything else dumped by the compiler. I am thinking about an external tool to dump the full call stack along its constituent stack IDs.

If we are interested in assessing how well we are matching call sites, we need an external tool to know the universe (i.e. the list of all full call stacks) so that we can compute the set complement -- "universe" - "all call site matches" (roughly speaking). We could ask one invocation of the compiler to dump the entire profile contents, including constituent stack IDs, but picking one compiler invocation feels a bit weird. It's cleaner to have a separate tool to dump the profile IMHO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other alternative is to have the compiler optionally dump out this correspondence for each matched allocation, however, there is already a need to deduplicate these across the compilations and that would increase the volume quite a bit. The upside is that we wouldn't need another tool and tool invocation. The other upside is that the compiler trims unneeded parts of the context during matching, which would reduce the volume and the need for redoing this analysis externally. But that can be considered for a follow on change if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@teresajohnson can you elaborate on this -- compiler trims unneeded parts of the context during matching? Are you referring to this FIXME?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I meant the trimming done in CallStackTrie::buildAndAttachMIBMetadata called here:

bool MemprofMDAttached = AllocTrie.buildAndAttachMIBMetadata(CI);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted with @teresajohnson about this offline -- the fact that it trims unneeded contexts is orthogonal to how we want to evaluate the matching quality so we can continue the current approach. We can evaluate whether more precision is required if simple heuristics for prioritization (e.g. look at N frames from the leaf) don't yield good results.

Copy link
Contributor

@teresajohnson teresajohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

...
;--- memprof-dump-matched-call-site.ll
; CHECK: MemProf notcold context with id 3894143216621363392 has total profiled size 4 is matched
; CHECK: MemProf callsite match for inline call stack 4745611964195289084 10616861955219347331
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other alternative is to have the compiler optionally dump out this correspondence for each matched allocation, however, there is already a need to deduplicate these across the compilations and that would increase the volume quite a bit. The upside is that we wouldn't need another tool and tool invocation. The other upside is that the compiler trims unneeded parts of the context during matching, which would reduce the volume and the need for redoing this analysis externally. But that can be considered for a follow on change if needed.

@kazutakahirata kazutakahirata merged commit b7feccb into llvm:main Feb 7, 2025
8 checks passed
@kazutakahirata kazutakahirata deleted the memprof_dump_callsite branch February 7, 2025 07:37
Icohedron pushed a commit to Icohedron/llvm-project that referenced this pull request Feb 11, 2025
MemProfiler.cpp annotates the IR with the memory profile so that we
can later duplicate context. This patch dumps the entire inline call
stack
for each call site match.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm:transforms PGO Profile Guided Optimizations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants