-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[LLVM][DecoderEmitter] Add option to use function table in decodeToMCInst #144814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
98602f8
to
4e39dae
Compare
Note: Its possible for further refinements here, like using lambda only when the # of switch cases is > a threshold, and also adding a benchmark (using AMDGPU as a target, which has ~1700 cases here) to measure perf delta between switch-case vs lambda version (I am assuming the lambda version is slower), but I am thinking that can come in future based on needs after this basic feature is adopted and we have some more data. Also noting that this is code is (AFAIK) not a part of the "regular" compilation flow (i.e., not built into llc). |
@llvm/pr-subscribers-tablegen Author: Rahul Joshi (jurahul) ChangesAdd option When the number of switch cases in this function is large, the generated code takes a long time to compile in release builds. Using a table of lambdas instead improves the compile time significantly (~3x speedup in compiling the code in a downstream target). This option will allow targets to opt into this mode if they desire for better build times. Tested with Full diff: https://github.com/llvm/llvm-project/pull/144814.diff 2 Files Affected:
diff --git a/llvm/test/TableGen/DecoderEmitterLambda.td b/llvm/test/TableGen/DecoderEmitterLambda.td
new file mode 100644
index 0000000000000..4926c8d7def66
--- /dev/null
+++ b/llvm/test/TableGen/DecoderEmitterLambda.td
@@ -0,0 +1,84 @@
+// RUN: llvm-tblgen -gen-disassembler -use-lambda-in-decode-to-mcinst -I %p/../../include %s | FileCheck %s
+
+include "llvm/Target/Target.td"
+
+def archInstrInfo : InstrInfo { }
+
+def arch : Target {
+ let InstructionSet = archInstrInfo;
+}
+
+let Namespace = "arch" in {
+ def R0 : Register<"r0">;
+ def R1 : Register<"r1">;
+ def R2 : Register<"r2">;
+ def R3 : Register<"r3">;
+}
+def Regs : RegisterClass<"Regs", [i32], 32, (add R0, R1, R2, R3)>;
+
+class TestInstruction : Instruction {
+ let Size = 1;
+ let OutOperandList = (outs);
+ field bits<8> Inst;
+ field bits<8> SoftFail = 0;
+}
+
+// Define instructions to generate 4 cases in decodeToMCInst.
+// Lower 2 bits define the number of operands. Each register operand
+// needs 2 bits to encode.
+
+// An instruction with no inputs. Encoded with lower 2 bits = 0 and upper
+// 6 bits = 0 as well.
+def Inst0 : TestInstruction {
+ let Inst = 0x0;
+ let InOperandList = (ins);
+ let AsmString = "Inst0";
+}
+
+// An instruction with a single input. Encoded with lower 2 bits = 1 and the
+// single input in bits 2-3.
+def Inst1 : TestInstruction {
+ bits<2> r0;
+ let Inst{1-0} = 1;
+ let Inst{3-2} = r0;
+ let InOperandList = (ins Regs:$r0);
+ let AsmString = "Inst1";
+}
+
+// An instruction with two inputs. Encoded with lower 2 bits = 2 and the
+// inputs in bits 2-3 and 4-5.
+def Inst2 : TestInstruction {
+ bits<2> r0;
+ bits<2> r1;
+ let Inst{1-0} = 2;
+ let Inst{3-2} = r0;
+ let Inst{5-4} = r1;
+ let InOperandList = (ins Regs:$r0, Regs:$r1);
+ let AsmString = "Inst2";
+}
+
+// An instruction with three inputs. Encoded with lower 2 bits = 3 and the
+// inputs in bits 2-3 and 4-5 and 6-7.
+def Inst3 : TestInstruction {
+ bits<2> r0;
+ bits<2> r1;
+ bits<2> r2;
+ let Inst{1-0} = 3;
+ let Inst{3-2} = r0;
+ let Inst{5-4} = r1;
+ let Inst{7-6} = r2;
+ let InOperandList = (ins Regs:$r0, Regs:$r1, Regs:$r2);
+ let AsmString = "Inst3";
+}
+
+// CHECK-LABEL: decodeToMCInst
+// CHECK: decodeLambda0 =
+// CHECK: decodeLambda1 =
+// CHECK: decodeLambda2 =
+// CHECK: decodeLambda3 =
+// CHECK: decodeLambdaTable[]
+// CHECK-NEXT: decodeLambda0
+// CHECK-NEXT: decodeLambda1
+// CHECK-NEXT: decodeLambda2
+// CHECK-NEXT: decodeLambda3
+// CHECK: return decodeLambdaTable[Idx]
diff --git a/llvm/utils/TableGen/DecoderEmitter.cpp b/llvm/utils/TableGen/DecoderEmitter.cpp
index 37814113b467a..4d5225e21680b 100644
--- a/llvm/utils/TableGen/DecoderEmitter.cpp
+++ b/llvm/utils/TableGen/DecoderEmitter.cpp
@@ -83,6 +83,13 @@ static cl::opt<bool> LargeTable(
"in the table instead of the default 16 bits."),
cl::init(false), cl::cat(DisassemblerEmitterCat));
+static cl::opt<bool> UseLambdaInDecodetoMCInst(
+ "use-lambda-in-decode-to-mcinst",
+ cl::desc("Use a table of lambdas instead of a switch case in the\n"
+ "generated `decodeToMCInst` function. Helps improve compile time\n"
+ "of the generated code."),
+ cl::init(false), cl::cat(DisassemblerEmitterCat));
+
STATISTIC(NumEncodings, "Number of encodings considered");
STATISTIC(NumEncodingsLackingDisasm,
"Number of encodings without disassembler info");
@@ -1082,15 +1089,47 @@ void DecoderEmitter::emitDecoderFunction(formatted_raw_ostream &OS,
<< "using TmpType = "
"std::conditional_t<std::is_integral<InsnType>::"
"value, InsnType, uint64_t>;\n";
- OS << Indent << "TmpType tmp;\n";
- OS << Indent << "switch (Idx) {\n";
- OS << Indent << "default: llvm_unreachable(\"Invalid index!\");\n";
- for (const auto &[Index, Decoder] : enumerate(Decoders)) {
- OS << Indent << "case " << Index << ":\n";
- OS << Decoder;
- OS << Indent + 2 << "return S;\n";
+
+ if (UseLambdaInDecodetoMCInst) {
+ // Emit one lambda for each case first.
+ for (const auto &[Index, Decoder] : enumerate(Decoders)) {
+ OS << Indent << "auto decodeLambda" << Index << " = [](DecodeStatus S,\n"
+ << Indent << " InsnType insn, MCInst &MI,\n"
+ << Indent << " uint64_t Address, \n"
+ << Indent << " const MCDisassembler *Decoder,\n"
+ << Indent << " bool &DecodeComplete) {\n";
+ OS << Indent + 2 << "[[maybe_unused]] TmpType tmp;\n";
+ OS << Decoder;
+ OS << Indent + 2 << "return S;\n";
+ OS << Indent << "};\n";
+ }
+ // Build a table of lambdas.
+
+ OS << R"(
+ using LambdaTy =
+ function_ref<DecodeStatus(DecodeStatus, InsnType, MCInst &, uint64_t,
+ const MCDisassembler *, bool &)>;
+ )";
+ OS << Indent << "const static LambdaTy decodeLambdaTable[] = {\n";
+ for (size_t Index : llvm::seq(Decoders.size()))
+ OS << Indent + 2 << "decodeLambda" << Index << ",\n";
+ OS << Indent << "};\n";
+ OS << Indent << "if (Idx >= " << Decoders.size() << ")\n";
+ OS << Indent + 2 << "llvm_unreachable(\"Invalid index!\");\n";
+ OS << Indent
+ << "return decodeLambdaTable[Idx](S, insn, MI, Address, Decoder, "
+ "DecodeComplete);\n";
+ } else {
+ OS << Indent << "TmpType tmp;\n";
+ OS << Indent << "switch (Idx) {\n";
+ OS << Indent << "default: llvm_unreachable(\"Invalid index!\");\n";
+ for (const auto &[Index, Decoder] : enumerate(Decoders)) {
+ OS << Indent << "case " << Index << ":\n";
+ OS << Decoder;
+ OS << Indent + 2 << "return S;\n";
+ }
+ OS << Indent << "}\n";
}
- OS << Indent << "}\n";
Indent -= 2;
OS << Indent << "}\n";
}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a table of lambdas instead improves the compile time significantly (~3x speedup in compiling the code in a downstream target)
Do you know why? My guess is that it'll spend lots of time on control flow related optimizations, while the lambda / function approach scales better in terms of compilation time.
Aside from that, I think the question now becomes whether the LLVM you built runs slower. Because instead of plain switch cases you start to pay the price of function calls, and inlining might not always kick in (even in a -O3 build) given the sheer number of callees to be inlined. Have you done any related measurements regarding this?
I did not look into why exactly the compile time is large with switch case, but I am guessing it's just that there is a single function with a large IR and making lambdas (which won't get inlined) avoid that case. As expected, it probably comes with some perf cost, both due to the indirect function call as well as any cross-switch-case optimizations that may be happening in the switch version. See my note about further refinement that includes a comment about setting up a benchmark to measure. For our use case, a 3x reduction in compile time might be good to adopt it in some builds (for example, pre-commit CI can use this, and post-commit builds can still use switch case). |
…oMCInst Add option `use-lambda-in-decode-to-mcinst` to use a table of lambdas instead of a switch case in the generated `decodeToMCInst` function. When the number of switch cases in this function is large, the generated code takes a long time to compile in release builds. Using a table of lambdas instead improves the compile time significantly (~3x speedup in compiling the code in a downstream target). This option will allow targets to opt into this mode if they desire for better build times. Tested with `check-llvm-mc` with the option enabled by default.
I did a couple of things here after converting the PR to create static functions. (a) I measured the compile time as reported by clang on our downstream code (I am compiling using clang-18 as that's what I have installed. It seems in the switch version, the 3 worst offenders for compile time are:
Total compile time is ~19 minutes. Unfortunately, it's not possible for me to file a bug report with the input as it contains non-public stuff. It may be worth seeing if a new version of clang improves things, but in any case, we need to support tools that folks might use, so we will still need this fix for improved build times. (b) I compiled the file with https://github.com/llvm/llvm-project/releases/tag/llvmorg-20.1.7 binaries and I see the following improvements in the switch version:
So compile time has improved from 19 mins to 16.7 mins, but the issue still persists in Two-Address instruction pass. (c) I setup some profiling code in llvm-mc that will try to disassemble each byte pattern some large number of times and profile the loop using the
|
4e39dae
to
e3eb094
Compare
When I was working on #117351 I tested various clang versions from 11 onwards and I found that clang-18 and clang-19 were by far the slowest. clang "trunk" (as it was in November 2024) was much better, like over 10x faster in some cases of extremely large switches. |
I am happy to try that as well. Apart from building it, any other way to get the trunk version of clang as a downloadable package? |
This one, IIRC, scales linearly by the number of instructions. So it seems like the compilation speed slow down was primarily caused by the fact that the function is just too big.
This is kind of expected
Thanks for the experiments! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There is still some mystery here since the # of instructions generated is the same in both cases (?). In the sense, the amount of code is the same, just split across functions (at a very coarse level). This to me indicates that either there is some code duplication happening in the switch-case version leading to super-linear increase in code size before two-address pass, or that the pass itself may be super-linear. But that would need more investigation with the actual test case. |
Add option
use-fn-table-in-decode-to-mcinst
to use a table of function pointers instead of a switch case in the generateddecodeToMCInst
function.When the number of switch cases in this function is large, the generated code takes a long time to compile in release builds. Using a table of function pointers instead improves the compile time significantly (~3x speedup in compiling the code in a downstream target). This option will allow targets to opt into this mode if they desire for better build times.
Tested with
check-llvm-mc
with the option enabled by default.