-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[llvm][CodeGen] Add a new software pipeliner 'Window Scheduler' #84443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit implements the Window Scheduler as described in the RFC: https://discourse.llvm.org/t/rfc-window-scheduling-algorithm-for-machinepipeliner-in-llvm/74718 This Window Scheduler implements the window algorithm designed by Steven Muchnick in the book "Advanced Compiler Design And Implementation", with some improvements: 1. Copy 3 times of the loop kernel and construct the corresponding DAG to identify dependencies between MIs; 2. Use heuristic algorithm to obtain a set of window offsets. The window algorithm is equivalent to modulo scheduling algorithm with a stage of 2. It is mainly applied in targets where hardware resource conflicts are severe, and the SMS algorithm often fails in such cases. On our own DSA, this window algorithm typically can achieve a performance improvement of over 10%. Co-authored-by: Kai Yan <[email protected]> Co-authored-by: Ran Xiao <[email protected]>
body: | | ||
bb.0.entry: | ||
successors: %bb.2(0x30000000), %bb.1(0x50000000) | ||
liveins: $r0, $r1, $r2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you run this through -run-pass=none to compact the register numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
declare <32 x i32> @llvm.hexagon.V6.vsubw.128B(<32 x i32>, <32 x i32>) | ||
|
||
attributes #0 = { "target-features"="+hvx-length128b,+hvxv69,+v66,-long-calls" } | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the IR references in the MMOs relevant to the scheduling test? If not, can you drop the IR section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context.MF = MF; | ||
Context.MLI = MLI; | ||
Context.MDT = MDT; | ||
Context.PassConfig = &getAnalysis<TargetPassConfig>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, two ScheduleDAGs are used here: one for analyzing the dependencies between all instructions after copying, and the other for scheduling instructions in the window. The difference is that the former does not need to consider register pressure.
@@ -199,6 +199,9 @@ class TargetSubtargetInfo : public MCSubtargetInfo { | |||
/// True if the subtarget should run MachinePipeliner | |||
virtual bool enableMachinePipeliner() const { return true; }; | |||
|
|||
/// True if the subtarget should run WindowScheduler. | |||
virtual bool enableWindowScheduler() const { return true; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On our own DSA, this window algorithm typically can achieve a performance
improvement of over 10%.
Could you please share some performance data (e.g., SPEC benchmarks) on other non-VLIW architectures? IIRC AArch64 and PowerPC also support MachinePipeliner
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverse ping :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Over the past few days, we have specifically tested the performance of the Software Pipeliner on aarch64. The test hardware and software environment is as follows: Apple M1 Pro 32GB, Docker 25.0.3, Ubuntu 22.04.4 LTS, GCC 11.4.0, and SPEC2006 1.2. We ran 11 integer benchmarks in ref mode, each 5 times. The final result shows that the base performance is 51.2, and the performance with software pipelining enabled is 51.0, which are almost the same.
Although this result is strongly related to the local test conditions, we believe the relative results are credible. This is because there are very few loops in SPEC that meet the criteria for applying software pipelining, and even fewer loops with long computation times (which aligns with the original design intention of SPEC, "Computer Architecture: A Quantitative Approach" 1.11). Therefore, we still believe that the software pipelining algorithm should play a major role in DSP or DSA.
Hello everyone, do you have some new review coimments? Thank you for your time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to do a deeper look
llvm/lib/CodeGen/WindowScheduler.cpp
Outdated
|
||
WindowScheduler::WindowScheduler(MachineSchedContext *C, MachineLoop &ML) | ||
: Context(C), MF(C->MF), MBB(ML.getHeader()), Loop(ML) { | ||
Subtarget = &(MF->getSubtarget()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subtarget = &(MF->getSubtarget()); | |
Subtarget = &MF->getSubtarget(); |
Can do this in initializer list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
for (auto Def : PhiDefs) | ||
if (MI.readsRegister(Def, TRI)) { | ||
LLVM_DEBUG( | ||
dbgs() | ||
<< "Consecutive phis are not allowed in window scheduling!\n"); | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wouldn't be valid IR, so why is this validating this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the current scheduler operates on post-SSA MIR, so the scheduler never encounters phis in the first place. Is this running somewhere different / earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this algorithm is used in the MachinePipeliner pass, where PHIs are present.
llvm/lib/CodeGen/WindowScheduler.cpp
Outdated
// Step 1: Performing the first copy of MBB instructions, excluding | ||
// terminators. At the same time, we back up the anti-register of phis. | ||
// DefPairs hold the old and new define register pairs. | ||
std::map<Register, Register> DefPairs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid std::map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
llvm/lib/CodeGen/WindowScheduler.cpp
Outdated
// %1 = phi i32 [%2, %BB.1], [%7, %BB.3] | ||
// The new phi is: | ||
// %1 = phi i32 [%2, %BB.1], [%11, %BB.3] | ||
for (auto &Phi : MBB->phis()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Braces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated,thank you!
if (Phi.readsRegister(DefRegPair.first, TRI)) | ||
Phi.substituteRegister(DefRegPair.first, DefRegPair.second, 0, *TRI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just go direct to substituteRegister?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I have understood your comment correctly. Let me elaborate on this part of the code.
In this section of the algorithm, DefPairs also includes the substitute registers for the phi-defined registers. Therefore it must be constrained to replacing the registers that are read by phi.
assert(SearchRatio <= 100 && "SearchRatio should be equal or less than 100!"); | ||
unsigned MaxIdx = SchedInstrNum * SearchRatio / 100; | ||
unsigned Step = SearchNum > 0 && SearchNum <= MaxIdx ? MaxIdx / SearchNum : 1; | ||
SmallVector<unsigned> SearchIndexes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just produce this computed list when the vector would be consumed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed a very good question!
Our main consideration is to facilitate modifications in downstream targets. These indexes are crucial for the performance of the window algorithm, and we have implemented target-specific search algorithm on our own DSA. We also recommend that other target developers, if possible, consider adopting more complex search algorithms.
So, we would still prefer to have these logics encapsulated in a separate function:)
llvm/lib/CodeGen/WindowScheduler.cpp
Outdated
LLVM_DEBUG(dbgs() << "\tCycle " << CurCycle << " [S." | ||
<< getOriStage(getOriMI(&MI), Offset) << "]: "); | ||
LLVM_DEBUG(MI.dump()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just directly << MI in the first debug statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
llvm/lib/CodeGen/WindowScheduler.cpp
Outdated
LLVM_DEBUG(dbgs() << "\tCycle range [0, " << LateCycle << "] "); | ||
LLVM_DEBUG(Phi.dump()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, just << Phi above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
llvm/lib/CodeGen/WindowScheduler.cpp
Outdated
Stages[MI] = std::get<2>(Info); | ||
LLVM_DEBUG(dbgs() << "\tCycle " << Cycles[MI] << " [S." << Stages[MI] | ||
<< "]: "); | ||
LLVM_DEBUG(MI->dump()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
@@ -0,0 +1,124 @@ | |||
# REQUIRES: asserts | |||
# RUN: llc --march=hexagon %s -run-pass=pipeliner -O2 -debug-only=pipeliner \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-O2 won't do anything here, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, thank you!
Updated
Thank you for your time. Are there any new review comments or feedback? @arsenm @dtcxzyw @davemgreen @bcahoon @ytmukai @jayfoad @chenzheng1030 |
llvm/lib/CodeGen/WindowScheduler.cpp
Outdated
SmallVector<Register, 8> PhiDefs; | ||
auto PLI = TII->analyzeLoopForPipelining(MBB); | ||
for (auto &MI : *MBB) { | ||
if (MI.isDebugInstr() || MI.isTerminator()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this should be upgraded to isMetaInst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
for (auto Def : PhiDefs) | ||
if (MI.readsRegister(Def, TRI)) { | ||
LLVM_DEBUG( | ||
dbgs() | ||
<< "Consecutive phis are not allowed in window scheduling!\n"); | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the current scheduler operates on post-SSA MIR, so the scheduler never encounters phis in the first place. Is this running somewhere different / earlier?
llvm/lib/CodeGen/WindowScheduler.cpp
Outdated
"window scheduling!\n"); | ||
return false; | ||
} | ||
for (auto &Def : MI.defs()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be worried about all_defs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
return false; | ||
} | ||
for (auto &Def : MI.defs()) | ||
if (Def.isReg() && Def.getReg().isPhysical()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this ignore dead defs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we believe that dead defs do not need a special handling process.
(MI->isTerminator() && Cnt < DuplicateNum - 1)) | ||
continue; | ||
auto *NewMI = MF->CloneMachineInstr(MI); | ||
DenseMap<Register, Register> NewDefs; | ||
// New defines are updated. | ||
for (auto MO : NewMI->defs()) | ||
for (auto MO : NewMI->all_defs()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these type of changes should have tests to go with them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, we have added test cases for dead def, implicit-def, and meta instruction.
Does anyone have any more comments? We really hope that this algorithm can be merged into the mainline so that other DSA developers can use it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know much about scheduling, but I think this has waited long enough for more comments
Could someone please help merge our PR? Thank you very much! |
@dtcxzyw hi,yingwei,looks like our multiple commits were not squashed together~ |
Do you mean the commit message? |
|
Emm, you need to update the PR description before requesting a merge :) |
Sorry for the trouble. How do you suggest we handle it now? |
Just keep it as is. Please remember to do it next time :) |
But squash-and-merge is the only enabled mode for the repository. How did you submit it? |
I see it squashed in the repo, so what is the issue? |
Sorry, my mistake. I was looking at the wrong branch. There's no issue. Thank you, everyone! |
I am seeing a crash when building the Linux kernel for Hexagon after this change. A C and LLVM IR reproducer from struct khazad_ctx {
long long E[1];
long D[];
};
long T7[] = {};
void *khazad_setkey_in_key___trans_tmp_1;
int khazad_setkey_in_key_r;
void khazad_setkey_in_key() {
struct khazad_ctx *ctx = khazad_setkey_in_key___trans_tmp_1;
unsigned *key = (unsigned *)khazad_setkey_in_key;
long long K1 = key[3];
for (; khazad_setkey_in_key_r; khazad_setkey_in_key_r++) {
K1 = ctx->E[khazad_setkey_in_key_r];
ctx->D[khazad_setkey_in_key_r] = T7[K1 >> 32] ^ T7[K1 >> 4 & 5];
}
} target datalayout = "e-m:e-p:32:32:32-a:0-n16:32-i64:64:64-i32:32:32-i16:16:16-i1:8:8-f32:32:32-f64:64:64-v32:32:32-v64:64:64-v512:512:512-v1024:1024:1024-v2048:2048:2048"
target triple = "hexagon-unknown-linux"
define void @khazad_setkey_in_key(ptr %0, ptr %1) {
br label %3
3: ; preds = %3, %2
%4 = phi i32 [ %16, %3 ], [ 0, %2 ]
%5 = load i64, ptr %0, align 8
%6 = trunc i64 %5 to i32
store i32 %6, ptr %1, align 4
%7 = lshr i64 %5, 32
%8 = trunc i64 %7 to i32
%9 = getelementptr [0 x i32], ptr null, i32 0, i32 %8
%10 = load i32, ptr %9, align 4
%11 = lshr i32 %6, 1
%12 = and i32 %11, 1
%13 = getelementptr [0 x i32], ptr null, i32 0, i32 %12
%14 = load i32, ptr %13, align 4
%15 = xor i32 %14, %10
store i32 %15, ptr %0, align 4
%16 = add i32 %4, 1
%17 = icmp eq i32 %4, 0
br i1 %17, label %18, label %3
18: ; preds = %3
ret void
} @ a144bf2
@ b6bf402
|
Okay, we will handle it right away. |
Here is the patch with the corresponding solution. Please try it out to see if it resolves the issue. Sorry for the inconvenience.@nathanchance |
No worries, thanks for the forward fix! I found another crash that is not resolved with it. void *poll_for_response_data;
long poll_for_response_response_0;
int poll_for_response_crc_err_retries;
void poll_for_response(char size) {
int i;
char *data_byte = poll_for_response_data;
retry:
if (poll_for_response_crc_err_retries)
goto fail;
{
unsigned val = poll_for_response_response_0;
i = 0;
for (; i < size; i++)
data_byte[-i - 1] = val >>= 8;
}
goto retry;
fail:;
}
target datalayout = "e-m:e-p:32:32:32-a:0-n16:32-i64:64:64-i32:32:32-i16:16:16-i1:8:8-f32:32:32-f64:64:64-v32:32:32-v64:64:64-v512:512:512-v1024:1024:1024-v2048:2048:2048"
target triple = "hexagon-unknown-linux"
define void @poll_for_response(i32 %0, ptr %1) {
br label %4
3: ; preds = %4
ret void
4: ; preds = %4, %2
%5 = phi i32 [ 0, %4 ], [ %0, %2 ]
%6 = phi i32 [ 1, %4 ], [ 0, %2 ]
%7 = phi i32 [ %11, %4 ], [ 0, %2 ]
%8 = lshr i32 %5, 1
%9 = trunc i32 %8 to i8
store i8 %9, ptr %1, align 1
%10 = getelementptr i8, ptr %1, i32 %6
store i8 0, ptr %10, align 1
%11 = add i32 %7, 1
%12 = icmp eq i32 %7, 1
br i1 %12, label %3, label %4
}
|
We have addressed this issue in this patch #95900. Thank you~ |
|
||
Changed = swingModuloScheduler(L); | ||
if (useWindowScheduler(Changed)) | ||
Changed = runWindowScheduler(L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we run one scheduler after another? This came up in a discussion today at the vectorizer meeting. cc: @ayalz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your review comment. Let me explain the current design considerations:
- We understand that both SMS (Swing Modulo Scheduling) and WS (Window Scheduling) belong to the category of software pipelining algorithms, and the conditions for determining their feasibility are the same. To avoid redundant checks, we have placed both in the MachinePipeliner.
- The basic principle of both SMS and WS scheduling is to fold the loop multiple times to obtain the kernel. The advantage of SMS is that it can fold more times, i.e., the stage can be greater than 2. On the other hand, the advantage of WS is that it is less affected by resource conflicts and can always get a scheduling result. Therefore, performing WS after SMS fails can be seen as an enhanced algorithm for targets with many resource conflicts.
I hope my explanation addresses your concerns. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, as an example, our VLIW target experiences more hardware conflicts due to accurate modeling. This results in a higher failure rate for SMS. Therefore, we have opted to directly use the WS algorithm.
…#84443) This commit implements the Window Scheduler as described in the RFC: https://discourse.llvm.org/t/rfc-window-scheduling-algorithm-for-machinepipeliner-in-llvm/74718 This Window Scheduler implements the window algorithm designed by Steven Muchnick in the book "Advanced Compiler Design And Implementation", with some improvements: 1. Copy 3 times of the loop kernel and construct the corresponding DAG to identify dependencies between MIs; 2. Use heuristic algorithm to obtain a set of window offsets. The window algorithm is equivalent to modulo scheduling algorithm with a stage of 2. It is mainly applied in targets where hardware resource conflicts are severe, and the SMS algorithm often fails in such cases. On our own DSA, this window algorithm typically can achieve a performance improvement of over 10%. Co-authored-by: Kai Yan <[email protected]> Co-authored-by: Ran Xiao <[email protected]> --------- Co-authored-by: Kai Yan <[email protected]> Co-authored-by: Ran Xiao <[email protected]>
This commit implements the Window Scheduler as described in the RFC:
https://discourse.llvm.org/t/rfc-window-scheduling-algorithm-for-machinepipeliner-in-llvm/74718
This Window Scheduler implements the window algorithm designed by
Steven Muchnick in the book "Advanced Compiler Design And Implementation",
with some improvements:
to identify dependencies between MIs;
The window algorithm is equivalent to modulo scheduling algorithm with a
stage of 2. It is mainly applied in targets where hardware resource
conflicts are severe, and the SMS algorithm often fails in such cases.
On our own DSA, this window algorithm typically can achieve a performance
improvement of over 10%.
Co-authored-by: Kai Yan [email protected]
Co-authored-by: Ran Xiao [email protected]