-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[mlir][gpu][NVPTX] Enable NVIDIA GPU JIT compilation path #66220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This patch adds an NVPTX compilation path that enables JIT compilation on NVIDIA targets. The following modifications were performed: 1. Adding a format field to the GPU object attribute, allowing the translation attribute to use the correct runtime function to load the module. Likewise, a dictionary attribute was added to add any possible extra options. 2. Adding the "createObject" method to "GPUTargetAttrInterface"; this method returns a GPU object from a binary string. 3. Adding the function "mgpuModuleLoadJIT", which is only available for NVIDIA GPUs, as there is no equivalent for AMD. 4. Adding the CMake flag `MLIR_GPU_COMPILATION_TEST_FORMAT` to specify the format to use during testing.
@llvm/pr-subscribers-mlir-sparse @llvm/pr-subscribers-mlir ChangesThis patch adds an NVPTX compilation path that enables JIT compilation on NVIDIA targets. The following modifications were performed: 1. Adding a format field to the GPU object attribute, allowing the translation attribute to use the correct runtime function to load the module. Likewise, a dictionary attribute was added to add any possible extra options.
NOTE:
Patch is 50.36 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/66220.diff 33 Files Affected:
<pre>
diff --git a/mlir/include/mlir/Dialect/GPU/IR/CompilationAttrs.td b/mlir/include/mlir/Dialect/GPU/IR/CompilationAttrs.td +def GPU_ObjectOffload : I32EnumAttrCase<"Offload", 1, "offload">;
def GPU_ObjectAttr : GPU_Attr<"Object", "object"> {
diff --git a/mlir/include/mlir/Dialect/GPU/IR/CompilationInterfaces.h b/mlir/include/mlir/Dialect/GPU/IR/CompilationInterfaces.h
protected:
LogicalResult ObjectAttr::verify(function_ref<InFlightDiagnostic()> emitError,
+namespace {
+void printObject(AsmPrinter &odsParser, CompilationTarget format,
//===----------------------------------------------------------------------===// +CompilationTarget TargetOptions::getCompilationTarget() const {
+CompilationTarget TargetOptions::getDefaultCompilationTarget() {
std::pair<llvm::BumpPtrAllocator, SmallVector<const char *>> -TargetOptions::CompilationTarget TargetOptions::getCompilationTarget() const {
MLIR_DEFINE_EXPLICIT_TYPE_ID(::mlir::gpu::TargetOptions) #include "mlir/Dialect/GPU/IR/GPUOpInterfaces.cpp.inc" void GpuModuleToBinaryPass::runOnOperation() {
+extern "C" MLIR_CUDA_WRAPPERS_EXPORT CUmodule mgpuModuleLoadJIT(void *data,
extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuModuleUnload(CUmodule module) { +extern "C" hipModule_t mgpuModuleLoadJIT(void *data, int optLevel) {
extern "C" void mgpuModuleUnload(hipModule_t module) { |
Ad "An option needs to be added to the SparseCompiler to support the format option, however I didn't know if there's any preference." Since I don't see changes in the sparse code, I assume you want some feedback, but I need a bit more context on what you had in mind. In general, we have a lot of "knobs" in the sparse pipeline setup, so generally I am not opposed to adding one more ;-) |
With this patch we have 3 ways to compile code,
JIT will make the test work, but fatbin is preferable for runtime performance as it can be used for AOT. |
Ad "Is it okay to add another option to the sparse compiler to specify which format to use?" Yes, more than okay! Ass "Is there a preference to which option to use by default?" If JIT make the test work again, let's make that the default. But please describe the three options in detail with performance implications (possibly indirectly by referring to where you add this as comment) |
I'll do that. Btw, did you have the chance to try the fix I posted in #65857 yesterday? |
I haven't looked at the code carefully, I will do that on my tomorrow, but adding JIT sounds great.
Would it be okay if we didn't do this for default behaviour? Nvidia's state-of-art compiler is |
The default behavior remains Another option, is setting I'm inclined to keep |
Perfectly okay with another default for the sparse compiler for consistency. I was merely suggesting this so the test would pass without changes but explicitly marking them as isa. I don't feel strongly either way, so please pick whatever feels best. |
GPU_ObjectISA, | ||
GPU_ObjectBinary, | ||
GPU_ObjectFatbin | ||
]>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deserves some doc.
(I'm not totally sure right now what "offload" does in this list actually)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the docs in the ObjectAttr
docs. The offload
format is meant to be a generic format, for NVPTX and & AMDGPU it generates LLVM bitcode. Execution from this format is not enabled in trunk, however downstream users could use it.
@@ -32,8 +44,17 @@ def GPU_ObjectAttr : GPU_Attr<"Object", "object"> { | |||
#gpu.object<#nvvm.target, "..."> | |||
``` | |||
}]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the doc please
binOrFatbin = | ||
binary | | ||
fatbinary, /// The process should produce a binary or fatbinary. It's up | ||
/// to the target to decide which. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this is the doc that may have been lost moving to ODS, cf the other comment above)
@@ -144,6 +144,22 @@ struct SparseCompilerOptions | |||
desc("GPU target architecture")}; | |||
PassOptions::Option<std::string> gpuFeatures{*this, "gpu-features", | |||
desc("GPU target features")}; | |||
/// For NVIDIA GPUs there are 3 compilation format options: | |||
/// 1. `isa`: the compiler generates PTX and the runtime JITs the PTX. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// 1. `isa`: the compiler generates PTX and the runtime JITs the PTX. | |
/// 1. `isa`: the compiler generates PTX and the driver JITs the PTX. |
/// GPU running the program. | ||
/// Option 3 is the best compromise between options 1 & 2 as it can JIT in | ||
/// case of an arch mismatch, however, it's only possible to JIT to a higher | ||
/// CC than `gpuChip`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the CC target when using 1.?
To some extent there shouldn't be any difference between 1 and 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's never specified that's why gpu-to-cubin
always worked, it's always JITted to the running arch.
If there's an arch mismatch then 1 and 3 have the same performance hit, however if the compiled arch matches the running arch, then it behaves like 2 and there's no performance hit.
The last commit updated the docs, migrated all tests to use |
/// 3. `fatbin`: generates a fat binary with a CUBIN object for `gpuChip` and | ||
/// also embeds the PTX in the fat binary. | ||
/// Notes: | ||
/// Option 1 adds a significant runtime performance hit, however, tests are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this detailed explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this change! I have a few nits, but good to go once addressed so I am approving this (for the sparse changes part)
/// Option 1 adds a significant runtime performance hit, however, tests are | ||
/// more likely to pass with this option. | ||
/// Option 2 is better for execution time as there is no JIT; however, the | ||
/// program will fail if there's an arch mismatch between `gpuChip` and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you please spell out "architecture" (unless this is NVidia convention to write it that way)
@@ -1,2 +1,4 @@ | |||
if not config.enable_cuda_runner or not config.mlir_run_cuda_sm80_tests: | |||
config.unsupported = True | |||
|
|||
config.substitutions.append(("%format", config.gpu_compilation_format)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we please use a slightly more specific name for this (format is very generic, how about at least gpu_format or so)
Just a random note, @fabianmcg , that I really appreciate your refactoring. The GPU "pipeline" for the sparse compiler still had a few rough edges and you really smoothed these out! So, thanks! |
Thank you, happy to help! Also thanks for all the feedback and testing! |
This patch adds an NVPTX compilation path that enables JIT compilation on NVIDIA targets. The following modifications were performed: 1. Adding a format field to the GPU object attribute, allowing the translation attribute to use the correct runtime function to load the module. Likewise, a dictionary attribute was added to add any possible extra options. 2. Adding the `createObject` method to `GPUTargetAttrInterface`; this method returns a GPU object from a binary string. 3. Adding the function `mgpuModuleLoadJIT`, which is only available for NVIDIA GPUs, as there is no equivalent for AMD. 4. Adding the CMake flag `MLIR_GPU_COMPILATION_TEST_FORMAT` to specify the format to use during testing.
This patch adds an NVPTX compilation path that enables JIT compilation on NVIDIA targets. The following modifications were performed:
Adding a format field to the GPU object attribute, allowing the translation attribute to use the correct runtime function to load the module. Likewise, a dictionary attribute was added to add any possible extra options.
Adding the
createObject
method toGPUTargetAttrInterface
; this method returns a GPU object from a binary string.Adding the function
mgpuModuleLoadJIT
, which is only available for NVIDIA GPUs, as there is no equivalent for AMD.Adding the CMake flag
MLIR_GPU_COMPILATION_TEST_FORMAT
to specify the format to use during testing.NOTE:
MLIR_GPU_COMPILATION_TEST_FORMAT
.SparseCompiler
to support the format option, however I didn't know if there's any preference.mgpuModuleLoadJIT
on the assumption there's a JIT cache. Another option is to implement the cache itself in MLIR.