Skip to content

Commit 7b409b7

Browse files
committed
Merge branch 'main' into lld-relr-fix
2 parents c485fc5 + d0fee98 commit 7b409b7

File tree

2,472 files changed

+144842
-59152
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

2,472 files changed

+144842
-59152
lines changed

.github/CODEOWNERS

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -64,8 +64,8 @@ clang/test/AST/Interp/ @tbaederr
6464
/mlir/Dialect/*/Transforms/Bufferize.cpp @matthias-springer
6565

6666
# Linalg Dialect in MLIR.
67-
/mlir/include/mlir/Dialect/Linalg/* @dcaballe @nicolasvasilache @rengolin
68-
/mlir/lib/Dialect/Linalg/* @dcaballe @nicolasvasilache @rengolin
67+
/mlir/include/mlir/Dialect/Linalg @dcaballe @nicolasvasilache @rengolin
68+
/mlir/lib/Dialect/Linalg @dcaballe @nicolasvasilache @rengolin
6969
/mlir/lib/Dialect/Linalg/Transforms/DecomposeLinalgOps.cpp @MaheshRavishankar @nicolasvasilache
7070
/mlir/lib/Dialect/Linalg/Transforms/DropUnitDims.cpp @MaheshRavishankar @nicolasvasilache
7171
/mlir/lib/Dialect/Linalg/Transforms/ElementwiseOpFusion.cpp @MaheshRavishankar @nicolasvasilache
@@ -85,8 +85,8 @@ clang/test/AST/Interp/ @tbaederr
8585
/mlir/**/*VectorToSCF* @banach-space @dcaballe @matthias-springer @nicolasvasilache
8686
/mlir/**/*VectorToLLVM* @banach-space @dcaballe @nicolasvasilache
8787
/mlir/**/*X86Vector* @aartbik @dcaballe @nicolasvasilache
88-
/mlir/include/mlir/Dialect/Vector/* @dcaballe @nicolasvasilache
89-
/mlir/lib/Dialect/Vector/* @dcaballe @nicolasvasilache
88+
/mlir/include/mlir/Dialect/Vector @dcaballe @nicolasvasilache
89+
/mlir/lib/Dialect/Vector @dcaballe @nicolasvasilache
9090
/mlir/lib/Dialect/Vector/Transforms/* @hanhanW @nicolasvasilache
9191
/mlir/lib/Dialect/Vector/Transforms/VectorEmulateNarrowType.cpp @MaheshRavishankar @nicolasvasilache
9292
/mlir/**/*EmulateNarrowType* @dcaballe @hanhanW
@@ -120,6 +120,9 @@ clang/test/AST/Interp/ @tbaederr
120120
/mlir/**/LLVMIR/**/BasicPtxBuilderInterface* @grypp
121121
/mlir/**/NVVM* @grypp
122122

123+
# MLIR Index Dialect
124+
/mlir/**/Index* @mogball
125+
123126
# MLIR Python Bindings
124127
/mlir/test/python/ @ftynse @makslevental @stellaraccident
125128
/mlir/python/ @ftynse @makslevental @stellaraccident
@@ -141,3 +144,8 @@ clang/test/AST/Interp/ @tbaederr
141144

142145
# ExtractAPI
143146
/clang/**/ExtractAPI @daniel-grumberg
147+
148+
# DWARFLinker, dwarfutil, dsymutil
149+
/llvm/**/DWARFLinker/ @JDevlieghere
150+
/llvm/**/dsymutil/ @JDevlieghere
151+
/llvm/**/llvm-dwarfutil/ @JDevlieghere

.github/workflows/issue-write.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ on:
55
workflows:
66
- "Check code formatting"
77
- "Check for private emails used in PRs"
8+
- "PR Request Release Note"
89
types:
910
- completed
1011

@@ -92,7 +93,11 @@ jobs:
9293
9394
var pr_number = 0;
9495
gql_result.repository.ref.associatedPullRequests.nodes.forEach((pr) => {
95-
if (pr.baseRepository.owner.login = context.repo.owner && pr.state == 'OPEN') {
96+
97+
// The largest PR number is the one we care about. The only way
98+
// to have more than one associated pull requests is if all the
99+
// old pull requests are in the closed state.
100+
if (pr.baseRepository.owner.login = context.repo.owner && pr.number > pr_number) {
96101
pr_number = pr.number;
97102
}
98103
});

.github/workflows/pr-request-release-note.yml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@ name: PR Request Release Note
22

33
permissions:
44
contents: read
5-
pull-requests: write
65

76
on:
87
pull_request:
@@ -41,3 +40,10 @@ jobs:
4140
--token "$GITHUB_TOKEN" \
4241
request-release-note \
4342
--pr-number ${{ github.event.pull_request.number}}
43+
44+
- uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8 #v4.3.0
45+
if: always()
46+
with:
47+
name: workflow-args
48+
path: |
49+
comments

bolt/docs/CommandLineArgumentReference.md

Lines changed: 75 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,14 @@
5656

5757
Allow processing of stripped binaries
5858

59+
- `--alt-inst-feature-size=<uint>`
60+
61+
Size of feature field in .altinstructions
62+
63+
- `--alt-inst-has-padlen`
64+
65+
Specify that .altinstructions has padlen field
66+
5967
- `--asm-dump[=<dump folder>]`
6068

6169
Dump function into assembly
@@ -78,6 +86,16 @@
7886
in the input is decoded and re-encoded. If the resulting bytes do not match
7987
the input, a warning message is printed.
8088

89+
- `--comp-dir-override=<string>`
90+
91+
Overrides DW_AT_comp_dir, and provides an alterantive base location, which is
92+
used with DW_AT_dwo_name to construct a path to *.dwo files.
93+
94+
- `--create-debug-names-section`
95+
96+
Creates .debug_names section, if the input binary doesn't have it already, for
97+
DWARF5 CU/TUs.
98+
8199
- `--cu-processing-batch-size=<uint>`
82100

83101
Specifies the size of batches for processing CUs. Higher number has better
@@ -93,7 +111,7 @@
93111

94112
- `--debug-skeleton-cu`
95113

96-
Prints out offsetrs for abbrev and debu_info of Skeleton CUs that get patched.
114+
Prints out offsets for abbrev and debug_info of Skeleton CUs that get patched.
97115

98116
- `--deterministic-debuginfo`
99117

@@ -104,6 +122,10 @@
104122

105123
Add basic block instructions as tool tips on nodes
106124

125+
- `--dump-alt-instructions`
126+
127+
Dump Linux alternative instructions info
128+
107129
- `--dump-cg=<string>`
108130

109131
Dump callgraph to the given file
@@ -117,10 +139,34 @@
117139
Dump function CFGs to graphviz format after each stage;enable '-print-loops'
118140
for color-coded blocks
119141

142+
- `--dump-linux-exceptions`
143+
144+
Dump Linux kernel exception table
145+
120146
- `--dump-orc`
121147

122148
Dump raw ORC unwind information (sorted)
123149

150+
- `--dump-para-sites`
151+
152+
Dump Linux kernel paravitual patch sites
153+
154+
- `--dump-pci-fixups`
155+
156+
Dump Linux kernel PCI fixup table
157+
158+
- `--dump-smp-locks`
159+
160+
Dump Linux kernel SMP locks
161+
162+
- `--dump-static-calls`
163+
164+
Dump Linux kernel static calls
165+
166+
- `--dump-static-keys`
167+
168+
Dump Linux kernel static keys jump table
169+
124170
- `--dwarf-output-path=<string>`
125171

126172
Path to where .dwo files or dwp file will be written out to.
@@ -205,6 +251,14 @@
205251

206252
Skip processing of cold functions
207253

254+
- `--log-file=<string>`
255+
256+
Redirect journaling to a file instead of stdout/stderr
257+
258+
- `--long-jump-labels`
259+
260+
Always use long jumps/nops for Linux kernel static keys
261+
208262
- `--max-data-relocations=<uint>`
209263

210264
Maximum number of data relocations to process
@@ -274,6 +328,10 @@
274328

275329
Number of tasks to be created per thread
276330

331+
- `--terminal-trap`
332+
333+
Assume that execution stops at trap instruction
334+
277335
- `--thread-count=<uint>`
278336

279337
Number of threads
@@ -618,10 +676,6 @@
618676
threshold means fewer functions to process. E.g threshold of 90 means only top
619677
10 percent of functions with profile will be processed.
620678

621-
- `--mcf-use-rarcs`
622-
623-
In MCF, consider the possibility of cancelling flow to balance edges
624-
625679
- `--memcpy1-spec=<func1,func2:cs1:cs2,func3:cs1,...>`
626680

627681
List of functions with call sites for which to specialize memcpy() for size 1
@@ -710,7 +764,7 @@
710764
- `none`: do not reorder functions
711765
- `exec-count`: order by execution count
712766
- `hfsort`: use hfsort algorithm
713-
- `hfsort+`: use hfsort+ algorithm
767+
- `hfsort+`: use cache-directed sort
714768
- `cdsort`: use cache-directed sort
715769
- `pettis-hansen`: use Pettis-Hansen algorithm
716770
- `random`: reorder functions randomly
@@ -804,8 +858,8 @@
804858

805859
- `--stale-matching-min-matched-block=<uint>`
806860

807-
Minimum percent of exact match block for a function to be considered for
808-
profile inference.
861+
Percentage threshold of matched basic blocks at which stale profile inference
862+
is executed.
809863

810864
- `--stale-threshold=<uint>`
811865

@@ -853,6 +907,10 @@
853907

854908
Only apply branch boundary alignment in hot code
855909

910+
- `--x86-strip-redundant-address-size`
911+
912+
Remove redundant Address-Size override prefix
913+
856914
### BOLT options in relocation mode:
857915

858916
- `--align-macro-fusion=<value>`
@@ -1039,6 +1097,10 @@
10391097

10401098
Print clusters
10411099

1100+
- `--print-estimate-edge-counts`
1101+
1102+
Print function after edge counts are set for no-LBR profile
1103+
10421104
- `--print-finalized`
10431105

10441106
Print function after CFG is finalized
@@ -1071,6 +1133,10 @@
10711133

10721134
Print functions after inlining optimization
10731135

1136+
- `--print-large-functions`
1137+
1138+
Print functions that could not be overwritten due to excessive size
1139+
10741140
- `--print-longjmp`
10751141

10761142
Print functions after longjmp pass
@@ -1166,4 +1232,4 @@
11661232

11671233
- `--print-options`
11681234

1169-
Print non-default options after command line parsing
1235+
Print non-default options after command line parsing

bolt/docs/OptimizingLinux.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Optimizing Linux Kernel with BOLT
2+
3+
4+
## Introduction
5+
6+
Many Linux applications spend a significant amount of their execution time in the kernel. Thus, when we consider code optimization for system performance, it is essential to improve the CPU utilization not only in the user-space applications and libraries but also in the kernel. BOLT has demonstrated double-digit gains while being applied to user-space programs. This guide shows how to apply BOLT to the x86-64 Linux kernel and enhance your system's performance. In our experiments, BOLT boosted database TPS by 2 percent when applied to the kernel compiled with the highest level optimizations, including PGO and LTO. The database spent ~40% of the time in the kernel and was quite sensitive to kernel performance.
7+
8+
BOLT optimizes code layout based on a low-level execution profile collected with the Linux `perf` tool. The best quality profile should include branch history, such as Intel's last branch records (LBR). BOLT runs on a linked binary and reorders the code while combining frequently executed blocks of instructions in a manner best suited for the hardware. Other than branch instructions, most of the code is left unchanged. Additionally, BOLT updates all metadata associated with the modified code, including DWARF debug information and Linux ORC unwind information.
9+
10+
While BOLT optimizations are not specific to the Linux kernel, certain quirks distinguish the kernel from user-level applications.
11+
12+
BOLT has been successfully applied to and tested with several flavors of the x86-64 Linux kernel.
13+
14+
15+
## QuickStart Guide
16+
17+
BOLT operates on a statically-linked kernel executable, a.k.a. `vmlinux` binary. However, most Linux distributions use a `vmlinuz` compressed image for system booting. To use BOLT on the kernel, you must either repackage `vmlinuz` after BOLT optimizations or add steps for running BOLT into the kernel build and rebuild `vmlinuz`. Uncompressing `vmlinuz` and repackaging it with a new `vmlinux` binary falls beyond the scope of this guide, and at some point, we may add the capability to run BOLT directly on `vmlinuz`. Meanwhile, this guide focuses on steps for integrating BOLT into the kernel build process.
18+
19+
20+
### Building the Kernel
21+
22+
After downloading the kernel sources and configuration for your distribution, you should be able to build `vmlinuz` using the `make bzImage` command. Ideally, the kernel should binary match the kernel on the system you are about to optimize (the target system). The binary matching part is critical as BOLT performs profile matching and optimizations at the binary level. We recommend installing a freshly built kernel on the target system to avoid any discrepancies.
23+
24+
Note that the kernel build will produce several artifacts besides bzImage. The most important of them is the uncompressed `vmlinux` binary, which will be used in the next steps. Make sure to save this file.
25+
26+
Build and target systems should have a `perf` tool installed for collecting and processing profiles. If your build system differs from the target, make sure `perf` versions are compatible. The build system should also have the latest BOLT binary and tools (`llvm-bolt`, `perf2bolt`).
27+
28+
Once the target system boots with the freshly-built kernel, start your workload, such as a database benchmark. While the system is under load, collect the kernel profile using perf:
29+
30+
31+
```bash
32+
$ sudo perf record -a -e cycles -j any,k -F 5000 -- sleep 600
33+
```
34+
35+
36+
Convert `perf` profile into a format suitable for BOLT passing the `vmlinux` binary to `perf2bolt`:
37+
38+
39+
```bash
40+
$ sudo chwon $USER perf.data
41+
$ perf2bolt -p perf.data -o perf.fdata vmlinux
42+
```
43+
44+
45+
Under a high load, `perf.data` should be several gigabytes in size and you should expect the converted `perf.fdata` not to exceed 100 MB.
46+
47+
Two changes are required for the kernel build. The first one is optional but highly recommended. It introduces a BOLT-reserved space into `vmlinux` code section:
48+
49+
50+
```diff
51+
--- a/arch/x86/kernel/vmlinux.lds.S
52+
+++ b/arch/x86/kernel/vmlinux.lds.S
53+
@@ -139,6 +139,11 @@ SECTIONS
54+
STATIC_CALL_TEXT
55+
*(.gnu.warning)
56+
57+
+ /* Allocate space for BOLT */
58+
+ __bolt_reserved_start = .;
59+
+ . += 2048 * 1024;
60+
+ __bolt_reserved_end = .;
61+
+
62+
#ifdef CONFIG_RETPOLINE
63+
__indirect_thunk_start = .;
64+
*(.text.__x86.*)
65+
```
66+
67+
68+
The second patch adds a step that runs BOLT on `vmlinux` binary:
69+
70+
71+
```diff
72+
--- a/scripts/link-vmlinux.sh
73+
+++ b/scripts/link-vmlinux.sh
74+
@@ -340,5 +340,13 @@ if is_enabled CONFIG_KALLSYMS; then
75+
fi
76+
fi
77+
78+
+# Apply BOLT
79+
+BOLT=llvm-bolt
80+
+BOLT_PROFILE=perf.fdata
81+
+BOLT_OPTS="--dyno-stats --eliminate-unreachable=0 --reorder-blocks=ext-tsp --simplify-conditional-tail-calls=0 --skip-funcs=__entry_text_start,irq_entries_start --split-functions"
82+
+mv vmlinux vmlinux.pre-bolt
83+
+echo BOLTing vmlinux
84+
+${BOLT} vmlinux.pre-bolt -o vmlinux --data ${BOLT_PROFILE} ${BOLT_OPTS}
85+
+
86+
# For fixdep
87+
echo "vmlinux: $0" > .vmlinux.d
88+
```
89+
90+
91+
If you skipped the first step or are running BOLT on a pre-built `vmlinux` binary, drop the `--split-functions` option.
92+
93+
94+
## Performance Expectations
95+
96+
By improving the code layout, BOLT can boost the kernel's performance by up to 5% by reducing instruction cache misses and branch mispredictions. When measuring total system performance, you should scale this number accordingly based on the time your application spends in the kernel (excluding I/O time).
97+
98+
99+
## Profile Quality
100+
101+
The timing and duration of the profiling may have a significant effect on the performance of the BOLTed kernel. If you don't know your workload well, it's recommended that you profile for the whole duration of the benchmark run. As longer times will result in larger `perf.data` files, you can lower the profiling frequency by providing a smaller value of `-F` flag. E.g., to record the kernel profile for half an hour, use the following command:
102+
103+
104+
```bash
105+
$ sudo perf record -a -e cycles -j any,k -F 1000 -- sleep 1800
106+
```
107+
108+
109+
110+
## BOLT Disassembly
111+
112+
BOLT annotates the disassembly with control-flow information and attaches Linux-specific metadata to the code. To view annotated disassembly, run:
113+
114+
115+
```bash
116+
$ llvm-bolt vmlinux -o /dev/null --print-cfg
117+
```
118+
119+
120+
If you want to limit the disassembly to a set of functions, add `--print-only=<func1regex>,<func2regex>,...`, where a function name is specified using regular expressions.

bolt/include/bolt/Core/BinaryFunction.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -930,6 +930,10 @@ class BinaryFunction {
930930
return const_cast<BinaryFunction *>(this)->getInstructionAtOffset(Offset);
931931
}
932932

933+
/// When the function is in disassembled state, return an instruction that
934+
/// contains the \p Offset.
935+
MCInst *getInstructionContainingOffset(uint64_t Offset);
936+
933937
std::optional<MCInst> disassembleInstructionAtOffset(uint64_t Offset) const;
934938

935939
/// Return offset for the first instruction. If there is data at the

0 commit comments

Comments
 (0)