Skip to content

[LLVM][DecoderEmitter] Add option to use function table in decodeToMCInst #144814

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jurahul
Copy link
Contributor

@jurahul jurahul commented Jun 18, 2025

Add option use-fn-table-in-decode-to-mcinst to use a table of function pointers instead of a switch case in the generated decodeToMCInst function.

When the number of switch cases in this function is large, the generated code takes a long time to compile in release builds. Using a table of function pointers instead improves the compile time significantly (~3x speedup in compiling the code in a downstream target). This option will allow targets to opt into this mode if they desire for better build times.

Tested with check-llvm-mc with the option enabled by default.

@jurahul jurahul force-pushed the decoder_to_mcinst_lambda branch 4 times, most recently from 98602f8 to 4e39dae Compare June 19, 2025 14:46
@jurahul
Copy link
Contributor Author

jurahul commented Jun 20, 2025

Note: Its possible for further refinements here, like using lambda only when the # of switch cases is > a threshold, and also adding a benchmark (using AMDGPU as a target, which has ~1700 cases here) to measure perf delta between switch-case vs lambda version (I am assuming the lambda version is slower), but I am thinking that can come in future based on needs after this basic feature is adopted and we have some more data.

Also noting that this is code is (AFAIK) not a part of the "regular" compilation flow (i.e., not built into llc).

@jurahul jurahul marked this pull request as ready for review June 20, 2025 15:31
@jurahul jurahul requested a review from topperc June 20, 2025 15:31
@llvmbot
Copy link
Member

llvmbot commented Jun 20, 2025

@llvm/pr-subscribers-tablegen

Author: Rahul Joshi (jurahul)

Changes

Add option use-lambda-in-decode-to-mcinst to use a table of lambdas instead of a switch case in the generated decodeToMCInst function.

When the number of switch cases in this function is large, the generated code takes a long time to compile in release builds. Using a table of lambdas instead improves the compile time significantly (~3x speedup in compiling the code in a downstream target). This option will allow targets to opt into this mode if they desire for better build times.

Tested with check-llvm-mc with the option enabled by default.


Full diff: https://github.com/llvm/llvm-project/pull/144814.diff

2 Files Affected:

  • (added) llvm/test/TableGen/DecoderEmitterLambda.td (+84)
  • (modified) llvm/utils/TableGen/DecoderEmitter.cpp (+47-8)
diff --git a/llvm/test/TableGen/DecoderEmitterLambda.td b/llvm/test/TableGen/DecoderEmitterLambda.td
new file mode 100644
index 0000000000000..4926c8d7def66
--- /dev/null
+++ b/llvm/test/TableGen/DecoderEmitterLambda.td
@@ -0,0 +1,84 @@
+// RUN: llvm-tblgen -gen-disassembler -use-lambda-in-decode-to-mcinst -I %p/../../include %s | FileCheck %s
+
+include "llvm/Target/Target.td"
+
+def archInstrInfo : InstrInfo { }
+
+def arch : Target {
+  let InstructionSet = archInstrInfo;
+}
+
+let Namespace = "arch" in {
+  def R0 : Register<"r0">;
+  def R1 : Register<"r1">;
+  def R2 : Register<"r2">;
+  def R3 : Register<"r3">;
+}
+def Regs : RegisterClass<"Regs", [i32], 32, (add R0, R1, R2, R3)>;
+
+class TestInstruction : Instruction {
+  let Size = 1;
+  let OutOperandList = (outs);
+  field bits<8> Inst;
+  field bits<8> SoftFail = 0;
+}
+
+// Define instructions to generate 4 cases in decodeToMCInst.
+// Lower 2 bits define the number of operands. Each register operand
+// needs 2 bits to encode.
+
+// An instruction with no inputs. Encoded with lower 2 bits = 0 and upper
+// 6 bits = 0 as well.
+def Inst0 : TestInstruction {
+  let Inst = 0x0;
+  let InOperandList = (ins);
+  let AsmString = "Inst0";
+}
+
+// An instruction with a single input. Encoded with lower 2 bits = 1 and the
+// single input in bits 2-3.
+def Inst1 : TestInstruction {
+  bits<2> r0;
+  let Inst{1-0} = 1;
+  let Inst{3-2} = r0;
+  let InOperandList = (ins Regs:$r0);
+  let AsmString = "Inst1";
+}
+
+// An instruction with two inputs. Encoded with lower 2 bits = 2 and the
+// inputs in bits 2-3 and 4-5.
+def Inst2 : TestInstruction {
+  bits<2> r0;
+  bits<2> r1;
+  let Inst{1-0} = 2;
+  let Inst{3-2} = r0;
+  let Inst{5-4} = r1;
+  let InOperandList = (ins Regs:$r0, Regs:$r1);
+  let AsmString = "Inst2";
+}
+
+// An instruction with three inputs. Encoded with lower 2 bits = 3 and the
+// inputs in bits 2-3 and 4-5 and 6-7.
+def Inst3 : TestInstruction {
+  bits<2> r0;
+  bits<2> r1;
+  bits<2> r2;
+  let Inst{1-0} = 3;
+  let Inst{3-2} = r0;
+  let Inst{5-4} = r1;
+  let Inst{7-6} = r2;
+  let InOperandList = (ins Regs:$r0, Regs:$r1, Regs:$r2);
+  let AsmString = "Inst3";
+}
+
+// CHECK-LABEL: decodeToMCInst
+// CHECK: decodeLambda0 =
+// CHECK: decodeLambda1 =
+// CHECK: decodeLambda2 =
+// CHECK: decodeLambda3 =
+// CHECK: decodeLambdaTable[]
+// CHECK-NEXT: decodeLambda0
+// CHECK-NEXT: decodeLambda1
+// CHECK-NEXT: decodeLambda2
+// CHECK-NEXT: decodeLambda3
+// CHECK: return decodeLambdaTable[Idx]
diff --git a/llvm/utils/TableGen/DecoderEmitter.cpp b/llvm/utils/TableGen/DecoderEmitter.cpp
index 37814113b467a..4d5225e21680b 100644
--- a/llvm/utils/TableGen/DecoderEmitter.cpp
+++ b/llvm/utils/TableGen/DecoderEmitter.cpp
@@ -83,6 +83,13 @@ static cl::opt<bool> LargeTable(
              "in the table instead of the default 16 bits."),
     cl::init(false), cl::cat(DisassemblerEmitterCat));
 
+static cl::opt<bool> UseLambdaInDecodetoMCInst(
+    "use-lambda-in-decode-to-mcinst",
+    cl::desc("Use a table of lambdas instead of a switch case in the\n"
+             "generated `decodeToMCInst` function. Helps improve compile time\n"
+             "of the generated code."),
+    cl::init(false), cl::cat(DisassemblerEmitterCat));
+
 STATISTIC(NumEncodings, "Number of encodings considered");
 STATISTIC(NumEncodingsLackingDisasm,
           "Number of encodings without disassembler info");
@@ -1082,15 +1089,47 @@ void DecoderEmitter::emitDecoderFunction(formatted_raw_ostream &OS,
      << "using TmpType = "
         "std::conditional_t<std::is_integral<InsnType>::"
         "value, InsnType, uint64_t>;\n";
-  OS << Indent << "TmpType tmp;\n";
-  OS << Indent << "switch (Idx) {\n";
-  OS << Indent << "default: llvm_unreachable(\"Invalid index!\");\n";
-  for (const auto &[Index, Decoder] : enumerate(Decoders)) {
-    OS << Indent << "case " << Index << ":\n";
-    OS << Decoder;
-    OS << Indent + 2 << "return S;\n";
+
+  if (UseLambdaInDecodetoMCInst) {
+    // Emit one lambda for each case first.
+    for (const auto &[Index, Decoder] : enumerate(Decoders)) {
+      OS << Indent << "auto decodeLambda" << Index << " = [](DecodeStatus S,\n"
+         << Indent << "                   InsnType insn, MCInst &MI,\n"
+         << Indent << "                   uint64_t Address, \n"
+         << Indent << "                   const MCDisassembler *Decoder,\n"
+         << Indent << "                   bool &DecodeComplete) {\n";
+      OS << Indent + 2 << "[[maybe_unused]] TmpType tmp;\n";
+      OS << Decoder;
+      OS << Indent + 2 << "return S;\n";
+      OS << Indent << "};\n";
+    }
+    // Build a table of lambdas.
+
+    OS << R"(
+  using LambdaTy =
+      function_ref<DecodeStatus(DecodeStatus, InsnType, MCInst &, uint64_t,
+                                const MCDisassembler *, bool &)>;
+    )";
+    OS << Indent << "const static LambdaTy decodeLambdaTable[] = {\n";
+    for (size_t Index : llvm::seq(Decoders.size()))
+      OS << Indent + 2 << "decodeLambda" << Index << ",\n";
+    OS << Indent << "};\n";
+    OS << Indent << "if (Idx >= " << Decoders.size() << ")\n";
+    OS << Indent + 2 << "llvm_unreachable(\"Invalid index!\");\n";
+    OS << Indent
+       << "return decodeLambdaTable[Idx](S, insn, MI, Address, Decoder, "
+          "DecodeComplete);\n";
+  } else {
+    OS << Indent << "TmpType tmp;\n";
+    OS << Indent << "switch (Idx) {\n";
+    OS << Indent << "default: llvm_unreachable(\"Invalid index!\");\n";
+    for (const auto &[Index, Decoder] : enumerate(Decoders)) {
+      OS << Indent << "case " << Index << ":\n";
+      OS << Decoder;
+      OS << Indent + 2 << "return S;\n";
+    }
+    OS << Indent << "}\n";
   }
-  OS << Indent << "}\n";
   Indent -= 2;
   OS << Indent << "}\n";
 }

Copy link
Member

@mshockwave mshockwave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a table of lambdas instead improves the compile time significantly (~3x speedup in compiling the code in a downstream target)

Do you know why? My guess is that it'll spend lots of time on control flow related optimizations, while the lambda / function approach scales better in terms of compilation time.

Aside from that, I think the question now becomes whether the LLVM you built runs slower. Because instead of plain switch cases you start to pay the price of function calls, and inlining might not always kick in (even in a -O3 build) given the sheer number of callees to be inlined. Have you done any related measurements regarding this?

@jurahul
Copy link
Contributor Author

jurahul commented Jun 21, 2025

I did not look into why exactly the compile time is large with switch case, but I am guessing it's just that there is a single function with a large IR and making lambdas (which won't get inlined) avoid that case. As expected, it probably comes with some perf cost, both due to the indirect function call as well as any cross-switch-case optimizations that may be happening in the switch version. See my note about further refinement that includes a comment about setting up a benchmark to measure. For our use case, a 3x reduction in compile time might be good to adopt it in some builds (for example, pre-commit CI can use this, and post-commit builds can still use switch case).

…oMCInst

Add option `use-lambda-in-decode-to-mcinst` to use a table of lambdas
instead of a switch case in the generated `decodeToMCInst` function.

When the number of switch cases in this function is large, the generated
code takes a long time to compile in release builds. Using a table of
lambdas instead improves the compile time significantly (~3x speedup
in compiling the code in a downstream target). This option will allow
targets to opt into this mode if they desire for better build times.

Tested with `check-llvm-mc` with the option enabled by default.
@jurahul
Copy link
Contributor Author

jurahul commented Jun 21, 2025

I did a couple of things here after converting the PR to create static functions.

(a) I measured the compile time as reported by clang on our downstream code (I am compiling using clang-18 as that's what I have installed. It seems in the switch version, the 3 worst offenders for compile time are:

  663.5226 ( 82.0%)   0.0003 (  0.0%)  663.5229 ( 81.8%)  663.5452 ( 81.8%)  Two-Address instruction pass
  149.9866 ( 47.4%)   0.0008 (  1.3%)  149.9874 ( 47.4%)  149.9874 ( 47.4%)  SimplifyCFGPass
  120.5514 ( 38.1%)   0.0103 ( 17.3%)  120.5618 ( 38.1%)  120.6282 ( 38.1%)  SROAPass

===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 1148.0450 seconds (1148.2543 wall clock)

Total compile time is ~19 minutes. Unfortunately, it's not possible for me to file a bug report with the input as it contains non-public stuff. It may be worth seeing if a new version of clang improves things, but in any case, we need to support tools that folks might use, so we will still need this fix for improved build times.

(b) I compiled the file with https://github.com/llvm/llvm-project/releases/tag/llvmorg-20.1.7 binaries and I see the following improvements in the switch version:

 702.9634 ( 84.5%)   0.0001 (  0.0%)  702.9634 ( 84.5%)  703.0168 ( 84.5%)  Two-Address instruction pass
  65.8698 ( 41.9%)   0.0005 (  0.4%)  65.8702 ( 41.9%)  65.9420 ( 41.9%)  SROAPass
  47.4754 ( 30.2%)   0.0006 (  0.5%)  47.4760 ( 30.2%)  47.5398 ( 30.2%)  SimplifyCFGPass 

===-------------------------------------------------------------------------===
                               Clang time report
===-------------------------------------------------------------------------===
  Total Execution Time: 1004.9997 seconds (1005.5482 wall clock)

So compile time has improved from 19 mins to 16.7 mins, but the issue still persists in Two-Address instruction pass.

(c) I setup some profiling code in llvm-mc that will try to disassemble each byte pattern some large number of times and profile the loop using the TimeTraceScope API and ran it with 3 AMDGPU llvm-mc unit tests (since it has the largest number of cases ~1500 in the upstream code), and I see the following (first run with function pointers, second run with switch). It seems the switch-case version is actually slower than the function pointer version. It could be measurement noise, but the signal seems consistent. The changes for profiling are here for reference (including the time.sh script): main...jurahul:llvm-project:llvm_mc_profile

$ source time.sh 
                "avg ms": 539
                "avg ms": 604
                "avg ms": 219
$ source time.sh
                "avg ms": 571
                "avg ms": 632
                "avg ms": 231

@jurahul jurahul force-pushed the decoder_to_mcinst_lambda branch from 4e39dae to e3eb094 Compare June 21, 2025 17:28
@jurahul jurahul changed the title [LLVM][DecoderEmitter] Add option to use lambdas in decodeToMCInst [LLVM][DecoderEmitter] Add option to use funcion table in decodeToMCInst Jun 21, 2025
@jurahul jurahul changed the title [LLVM][DecoderEmitter] Add option to use funcion table in decodeToMCInst [LLVM][DecoderEmitter] Add option to use function table in decodeToMCInst Jun 21, 2025
@jayfoad
Copy link
Contributor

jayfoad commented Jun 23, 2025

(a) I measured the compile time as reported by clang on our downstream code (I am compiling using clang-18 as that's what I have installed.

When I was working on #117351 I tested various clang versions from 11 onwards and I found that clang-18 and clang-19 were by far the slowest. clang "trunk" (as it was in November 2024) was much better, like over 10x faster in some cases of extremely large switches.

@jurahul
Copy link
Contributor Author

jurahul commented Jun 23, 2025

(a) I measured the compile time as reported by clang on our downstream code (I am compiling using clang-18 as that's what I have installed.

When I was working on #117351 I tested various clang versions from 11 onwards and I found that clang-18 and clang-19 were by far the slowest. clang "trunk" (as it was in November 2024) was much better, like over 10x faster in some cases of extremely large switches.

I am happy to try that as well. Apart from building it, any other way to get the trunk version of clang as a downloadable package?

@mshockwave
Copy link
Member

( 81.8%) Two-Address instruction pass

This one, IIRC, scales linearly by the number of instructions. So it seems like the compilation speed slow down was primarily caused by the fact that the function is just too big.

( 47.4%) SimplifyCFGPass

This is kind of expected

I setup some profiling code in llvm-mc that will try to disassemble each byte pattern some large number of times and profile the loop using the TimeTraceScope API and ran it with 3 AMDGPU llvm-mc unit tests (since it has the largest number of cases ~1500 in the upstream code), and I see the following (first run with function pointers, second run with switch). It seems the switch-case version is actually slower than the function pointer version. It could be measurement noise, but the signal seems consistent.

Thanks for the experiments!

Copy link
Member

@mshockwave mshockwave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jurahul
Copy link
Contributor Author

jurahul commented Jun 23, 2025

( 81.8%) Two-Address instruction pass

This one, IIRC, scales linearly by the number of instructions. So it seems like the compilation speed slow down was primarily caused by the fact that the function is just too big.

There is still some mystery here since the # of instructions generated is the same in both cases (?). In the sense, the amount of code is the same, just split across functions (at a very coarse level). This to me indicates that either there is some code duplication happening in the switch-case version leading to super-linear increase in code size before two-address pass, or that the pass itself may be super-linear. But that would need more investigation with the actual test case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants