[LLVM][DecoderEmitter] Add option to use function table in decodeToMCInst #144814

jurahul · 2025-06-18T23:48:35Z

Add option use-fn-table-in-decode-to-mcinst to use a table of function pointers instead of a switch case in the generated decodeToMCInst function.

When the number of switch cases in this function is large, the generated code takes a long time to compile in release builds. Using a table of function pointers instead improves the compile time significantly (~3x speedup in compiling the code in a downstream target). This option will allow targets to opt into this mode if they desire for better build times.

Tested with check-llvm-mc with the option enabled by default.

jurahul · 2025-06-20T15:30:46Z

Note: Its possible for further refinements here, like using lambda only when the # of switch cases is > a threshold, and also adding a benchmark (using AMDGPU as a target, which has ~1700 cases here) to measure perf delta between switch-case vs lambda version (I am assuming the lambda version is slower), but I am thinking that can come in future based on needs after this basic feature is adopted and we have some more data.

Also noting that this is code is (AFAIK) not a part of the "regular" compilation flow (i.e., not built into llc).

llvmbot · 2025-06-20T15:31:30Z

@llvm/pr-subscribers-tablegen

Author: Rahul Joshi (jurahul)

Changes

Add option use-lambda-in-decode-to-mcinst to use a table of lambdas instead of a switch case in the generated decodeToMCInst function.

When the number of switch cases in this function is large, the generated code takes a long time to compile in release builds. Using a table of lambdas instead improves the compile time significantly (~3x speedup in compiling the code in a downstream target). This option will allow targets to opt into this mode if they desire for better build times.

Tested with check-llvm-mc with the option enabled by default.

Full diff: https://github.com/llvm/llvm-project/pull/144814.diff

2 Files Affected:

(added) llvm/test/TableGen/DecoderEmitterLambda.td (+84)
(modified) llvm/utils/TableGen/DecoderEmitter.cpp (+47-8)

diff --git a/llvm/test/TableGen/DecoderEmitterLambda.td b/llvm/test/TableGen/DecoderEmitterLambda.td
new file mode 100644
index 0000000000000..4926c8d7def66
--- /dev/null
+++ b/llvm/test/TableGen/DecoderEmitterLambda.td
@@ -0,0 +1,84 @@
+// RUN: llvm-tblgen -gen-disassembler -use-lambda-in-decode-to-mcinst -I %p/../../include %s | FileCheck %s
+
+include "llvm/Target/Target.td"
+
+def archInstrInfo : InstrInfo { }
+
+def arch : Target {
+  let InstructionSet = archInstrInfo;
+}
+
+let Namespace = "arch" in {
+  def R0 : Register<"r0">;
+  def R1 : Register<"r1">;
+  def R2 : Register<"r2">;
+  def R3 : Register<"r3">;
+}
+def Regs : RegisterClass<"Regs", [i32], 32, (add R0, R1, R2, R3)>;
+
+class TestInstruction : Instruction {
+  let Size = 1;
+  let OutOperandList = (outs);
+  field bits<8> Inst;
+  field bits<8> SoftFail = 0;
+}
+
+// Define instructions to generate 4 cases in decodeToMCInst.
+// Lower 2 bits define the number of operands. Each register operand
+// needs 2 bits to encode.
+
+// An instruction with no inputs. Encoded with lower 2 bits = 0 and upper
+// 6 bits = 0 as well.
+def Inst0 : TestInstruction {
+  let Inst = 0x0;
+  let InOperandList = (ins);
+  let AsmString = "Inst0";
+}
+
+// An instruction with a single input. Encoded with lower 2 bits = 1 and the
+// single input in bits 2-3.
+def Inst1 : TestInstruction {
+  bits<2> r0;
+  let Inst{1-0} = 1;
+  let Inst{3-2} = r0;
+  let InOperandList = (ins Regs:$r0);
+  let AsmString = "Inst1";
+}
+
+// An instruction with two inputs. Encoded with lower 2 bits = 2 and the
+// inputs in bits 2-3 and 4-5.
+def Inst2 : TestInstruction {
+  bits<2> r0;
+  bits<2> r1;
+  let Inst{1-0} = 2;
+  let Inst{3-2} = r0;
+  let Inst{5-4} = r1;
+  let InOperandList = (ins Regs:$r0, Regs:$r1);
+  let AsmString = "Inst2";
+}
+
+// An instruction with three inputs. Encoded with lower 2 bits = 3 and the
+// inputs in bits 2-3 and 4-5 and 6-7.
+def Inst3 : TestInstruction {
+  bits<2> r0;
+  bits<2> r1;
+  bits<2> r2;
+  let Inst{1-0} = 3;
+  let Inst{3-2} = r0;
+  let Inst{5-4} = r1;
+  let Inst{7-6} = r2;
+  let InOperandList = (ins Regs:$r0, Regs:$r1, Regs:$r2);
+  let AsmString = "Inst3";
+}
+
+// CHECK-LABEL: decodeToMCInst
+// CHECK: decodeLambda0 =
+// CHECK: decodeLambda1 =
+// CHECK: decodeLambda2 =
+// CHECK: decodeLambda3 =
+// CHECK: decodeLambdaTable[]
+// CHECK-NEXT: decodeLambda0
+// CHECK-NEXT: decodeLambda1
+// CHECK-NEXT: decodeLambda2
+// CHECK-NEXT: decodeLambda3
+// CHECK: return decodeLambdaTable[Idx]
diff --git a/llvm/utils/TableGen/DecoderEmitter.cpp b/llvm/utils/TableGen/DecoderEmitter.cpp
index 37814113b467a..4d5225e21680b 100644
--- a/llvm/utils/TableGen/DecoderEmitter.cpp
+++ b/llvm/utils/TableGen/DecoderEmitter.cpp
@@ -83,6 +83,13 @@ static cl::opt<bool> LargeTable(
              "in the table instead of the default 16 bits."),
     cl::init(false), cl::cat(DisassemblerEmitterCat));
 
+static cl::opt<bool> UseLambdaInDecodetoMCInst(
+    "use-lambda-in-decode-to-mcinst",
+    cl::desc("Use a table of lambdas instead of a switch case in the\n"
+             "generated `decodeToMCInst` function. Helps improve compile time\n"
+             "of the generated code."),
+    cl::init(false), cl::cat(DisassemblerEmitterCat));
+
 STATISTIC(NumEncodings, "Number of encodings considered");
 STATISTIC(NumEncodingsLackingDisasm,
           "Number of encodings without disassembler info");
@@ -1082,15 +1089,47 @@ void DecoderEmitter::emitDecoderFunction(formatted_raw_ostream &OS,
      << "using TmpType = "
         "std::conditional_t<std::is_integral<InsnType>::"
         "value, InsnType, uint64_t>;\n";
-  OS << Indent << "TmpType tmp;\n";
-  OS << Indent << "switch (Idx) {\n";
-  OS << Indent << "default: llvm_unreachable(\"Invalid index!\");\n";
-  for (const auto &[Index, Decoder] : enumerate(Decoders)) {
-    OS << Indent << "case " << Index << ":\n";
-    OS << Decoder;
-    OS << Indent + 2 << "return S;\n";
+
+  if (UseLambdaInDecodetoMCInst) {
+    // Emit one lambda for each case first.
+    for (const auto &[Index, Decoder] : enumerate(Decoders)) {
+      OS << Indent << "auto decodeLambda" << Index << " = [](DecodeStatus S,\n"
+         << Indent << "                   InsnType insn, MCInst &MI,\n"
+         << Indent << "                   uint64_t Address, \n"
+         << Indent << "                   const MCDisassembler *Decoder,\n"
+         << Indent << "                   bool &DecodeComplete) {\n";
+      OS << Indent + 2 << "[[maybe_unused]] TmpType tmp;\n";
+      OS << Decoder;
+      OS << Indent + 2 << "return S;\n";
+      OS << Indent << "};\n";
+    }
+    // Build a table of lambdas.
+
+    OS << R"(
+  using LambdaTy =
+      function_ref<DecodeStatus(DecodeStatus, InsnType, MCInst &, uint64_t,
+                                const MCDisassembler *, bool &)>;
+    )";
+    OS << Indent << "const static LambdaTy decodeLambdaTable[] = {\n";
+    for (size_t Index : llvm::seq(Decoders.size()))
+      OS << Indent + 2 << "decodeLambda" << Index << ",\n";
+    OS << Indent << "};\n";
+    OS << Indent << "if (Idx >= " << Decoders.size() << ")\n";
+    OS << Indent + 2 << "llvm_unreachable(\"Invalid index!\");\n";
+    OS << Indent
+       << "return decodeLambdaTable[Idx](S, insn, MI, Address, Decoder, "
+          "DecodeComplete);\n";
+  } else {
+    OS << Indent << "TmpType tmp;\n";
+    OS << Indent << "switch (Idx) {\n";
+    OS << Indent << "default: llvm_unreachable(\"Invalid index!\");\n";
+    for (const auto &[Index, Decoder] : enumerate(Decoders)) {
+      OS << Indent << "case " << Index << ":\n";
+      OS << Decoder;
+      OS << Indent + 2 << "return S;\n";
+    }
+    OS << Indent << "}\n";
   }
-  OS << Indent << "}\n";
   Indent -= 2;
   OS << Indent << "}\n";
 }

llvm/utils/TableGen/DecoderEmitter.cpp

mshockwave

Using a table of lambdas instead improves the compile time significantly (~3x speedup in compiling the code in a downstream target)

Do you know why? My guess is that it'll spend lots of time on control flow related optimizations, while the lambda / function approach scales better in terms of compilation time.

Aside from that, I think the question now becomes whether the LLVM you built runs slower. Because instead of plain switch cases you start to pay the price of function calls, and inlining might not always kick in (even in a -O3 build) given the sheer number of callees to be inlined. Have you done any related measurements regarding this?

llvm/utils/TableGen/DecoderEmitter.cpp

jurahul · 2025-06-21T04:37:14Z

I did not look into why exactly the compile time is large with switch case, but I am guessing it's just that there is a single function with a large IR and making lambdas (which won't get inlined) avoid that case. As expected, it probably comes with some perf cost, both due to the indirect function call as well as any cross-switch-case optimizations that may be happening in the switch version. See my note about further refinement that includes a comment about setting up a benchmark to measure. For our use case, a 3x reduction in compile time might be good to adopt it in some builds (for example, pre-commit CI can use this, and post-commit builds can still use switch case).

…oMCInst Add option `use-lambda-in-decode-to-mcinst` to use a table of lambdas instead of a switch case in the generated `decodeToMCInst` function. When the number of switch cases in this function is large, the generated code takes a long time to compile in release builds. Using a table of lambdas instead improves the compile time significantly (~3x speedup in compiling the code in a downstream target). This option will allow targets to opt into this mode if they desire for better build times. Tested with `check-llvm-mc` with the option enabled by default.

jurahul · 2025-06-21T17:14:40Z

I did a couple of things here after converting the PR to create static functions.

(a) I measured the compile time as reported by clang on our downstream code (I am compiling using clang-18 as that's what I have installed. It seems in the switch version, the 3 worst offenders for compile time are:

  663.5226 ( 82.0%)   0.0003 (  0.0%)  663.5229 ( 81.8%)  663.5452 ( 81.8%)  Two-Address instruction pass
  149.9866 ( 47.4%)   0.0008 (  1.3%)  149.9874 ( 47.4%)  149.9874 ( 47.4%)  SimplifyCFGPass
  120.5514 ( 38.1%)   0.0103 ( 17.3%)  120.5618 ( 38.1%)  120.6282 ( 38.1%)  SROAPass

===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 1148.0450 seconds (1148.2543 wall clock)

Total compile time is ~19 minutes. Unfortunately, it's not possible for me to file a bug report with the input as it contains non-public stuff. It may be worth seeing if a new version of clang improves things, but in any case, we need to support tools that folks might use, so we will still need this fix for improved build times.

(b) I compiled the file with https://github.com/llvm/llvm-project/releases/tag/llvmorg-20.1.7 binaries and I see the following improvements in the switch version:

 702.9634 ( 84.5%)   0.0001 (  0.0%)  702.9634 ( 84.5%)  703.0168 ( 84.5%)  Two-Address instruction pass
  65.8698 ( 41.9%)   0.0005 (  0.4%)  65.8702 ( 41.9%)  65.9420 ( 41.9%)  SROAPass
  47.4754 ( 30.2%)   0.0006 (  0.5%)  47.4760 ( 30.2%)  47.5398 ( 30.2%)  SimplifyCFGPass 

===-------------------------------------------------------------------------===
                               Clang time report
===-------------------------------------------------------------------------===
  Total Execution Time: 1004.9997 seconds (1005.5482 wall clock)

So compile time has improved from 19 mins to 16.7 mins, but the issue still persists in Two-Address instruction pass.

(c) I setup some profiling code in llvm-mc that will try to disassemble each byte pattern some large number of times and profile the loop using the TimeTraceScope API and ran it with 3 AMDGPU llvm-mc unit tests (since it has the largest number of cases ~1500 in the upstream code), and I see the following (first run with function pointers, second run with switch). It seems the switch-case version is actually slower than the function pointer version. It could be measurement noise, but the signal seems consistent. The changes for profiling are here for reference (including the time.sh script): main...jurahul:llvm-project:llvm_mc_profile

$ source time.sh 
                "avg ms": 539
                "avg ms": 604
                "avg ms": 219
$ source time.sh
                "avg ms": 571
                "avg ms": 632
                "avg ms": 231

jayfoad · 2025-06-23T08:39:57Z

(a) I measured the compile time as reported by clang on our downstream code (I am compiling using clang-18 as that's what I have installed.

When I was working on #117351 I tested various clang versions from 11 onwards and I found that clang-18 and clang-19 were by far the slowest. clang "trunk" (as it was in November 2024) was much better, like over 10x faster in some cases of extremely large switches.

jurahul · 2025-06-23T12:41:52Z

(a) I measured the compile time as reported by clang on our downstream code (I am compiling using clang-18 as that's what I have installed.

When I was working on #117351 I tested various clang versions from 11 onwards and I found that clang-18 and clang-19 were by far the slowest. clang "trunk" (as it was in November 2024) was much better, like over 10x faster in some cases of extremely large switches.

I am happy to try that as well. Apart from building it, any other way to get the trunk version of clang as a downloadable package?

mshockwave · 2025-06-23T16:37:17Z

( 81.8%) Two-Address instruction pass

This one, IIRC, scales linearly by the number of instructions. So it seems like the compilation speed slow down was primarily caused by the fact that the function is just too big.

( 47.4%) SimplifyCFGPass

This is kind of expected

I setup some profiling code in llvm-mc that will try to disassemble each byte pattern some large number of times and profile the loop using the TimeTraceScope API and ran it with 3 AMDGPU llvm-mc unit tests (since it has the largest number of cases ~1500 in the upstream code), and I see the following (first run with function pointers, second run with switch). It seems the switch-case version is actually slower than the function pointer version. It could be measurement noise, but the signal seems consistent.

Thanks for the experiments!

mshockwave

LGTM

llvm/test/TableGen/DecoderEmitterFnTable.td

jurahul · 2025-06-23T16:57:49Z

( 81.8%) Two-Address instruction pass

This one, IIRC, scales linearly by the number of instructions. So it seems like the compilation speed slow down was primarily caused by the fact that the function is just too big.

There is still some mystery here since the # of instructions generated is the same in both cases (?). In the sense, the amount of code is the same, just split across functions (at a very coarse level). This to me indicates that either there is some code duplication happening in the switch-case version leading to super-linear increase in code size before two-address pass, or that the pass itself may be super-linear. But that would need more investigation with the actual test case.

jurahul force-pushed the decoder_to_mcinst_lambda branch 4 times, most recently from 98602f8 to 4e39dae Compare June 19, 2025 14:46

jurahul marked this pull request as ready for review June 20, 2025 15:31

jurahul requested a review from topperc June 20, 2025 15:31

llvmbot added the tablegen label Jun 20, 2025

jurahul requested review from jayfoad, mshockwave and s-barannikov June 20, 2025 15:31

s-barannikov reviewed Jun 20, 2025

View reviewed changes

llvm/utils/TableGen/DecoderEmitter.cpp Outdated Show resolved Hide resolved

mshockwave reviewed Jun 20, 2025

View reviewed changes

llvm/utils/TableGen/DecoderEmitter.cpp Outdated Show resolved Hide resolved

Use function of static tables instead of lambda

e3eb094

jurahul force-pushed the decoder_to_mcinst_lambda branch from 4e39dae to e3eb094 Compare June 21, 2025 17:28

jurahul requested review from s-barannikov and mshockwave June 21, 2025 18:19

jurahul changed the title ~~[LLVM][DecoderEmitter] Add option to use lambdas in decodeToMCInst~~ [LLVM][DecoderEmitter] Add option to use funcion table in decodeToMCInst Jun 21, 2025

jurahul changed the title ~~[LLVM][DecoderEmitter] Add option to use funcion table in decodeToMCInst~~ [LLVM][DecoderEmitter] Add option to use function table in decodeToMCInst Jun 21, 2025

mshockwave approved these changes Jun 23, 2025

View reviewed changes

llvm/test/TableGen/DecoderEmitterFnTable.td Outdated Show resolved Hide resolved

Review feedback

9ed847b

mshockwave approved these changes Jun 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LLVM][DecoderEmitter] Add option to use function table in decodeToMCInst #144814

[LLVM][DecoderEmitter] Add option to use function table in decodeToMCInst #144814

jurahul commented Jun 18, 2025 •

edited

Loading

Uh oh!

jurahul commented Jun 20, 2025 •

edited

Loading

Uh oh!

llvmbot commented Jun 20, 2025

Uh oh!

Uh oh!

mshockwave left a comment •

edited

Loading

Uh oh!

Uh oh!

jurahul commented Jun 21, 2025

Uh oh!

jurahul commented Jun 21, 2025 •

edited

Loading

Uh oh!

jayfoad commented Jun 23, 2025

Uh oh!

jurahul commented Jun 23, 2025

Uh oh!

mshockwave commented Jun 23, 2025

Uh oh!

mshockwave left a comment

Uh oh!

Uh oh!

jurahul commented Jun 23, 2025

Uh oh!

Uh oh!

[LLVM][DecoderEmitter] Add option to use function table in decodeToMCInst #144814

Are you sure you want to change the base?

[LLVM][DecoderEmitter] Add option to use function table in decodeToMCInst #144814

Conversation

jurahul commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jurahul commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jun 20, 2025

Uh oh!

Uh oh!

mshockwave left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jurahul commented Jun 21, 2025

Uh oh!

jurahul commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jayfoad commented Jun 23, 2025

Uh oh!

jurahul commented Jun 23, 2025

Uh oh!

mshockwave commented Jun 23, 2025

Uh oh!

mshockwave left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jurahul commented Jun 23, 2025

Uh oh!

Uh oh!

jurahul commented Jun 18, 2025 •

edited

Loading

jurahul commented Jun 20, 2025 •

edited

Loading

mshockwave left a comment •

edited

Loading

jurahul commented Jun 21, 2025 •

edited

Loading